All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Potential OSD deadlock?
       [not found]                     ` <CAJ4mKGYdJJfFERrOrdN7T8SzhdcyKhSDqJ69wOiee2aVj8vpEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-22 21:22                       ` Robert LeBlanc
       [not found]                         ` <CAANLjFqbt0y-Ri=q6hXuuS06Sgi1S6phRdb1MJTgR6LTyHPtvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-22 21:22 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50 to 13:00 are
available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz

2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0

In the logs I can see that osd.17 dispatches the I/O to osd.13 and
osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
but for some reason osd.13 doesn't get the message until 53 seconds
later. osd.17 seems happy to just wait and doesn't resend the data
(well, I'm not 100% sure how to tell which entries are the actual data
transfer).

It looks like osd.17 is receiving responses to start the communication
with osd.13, but the op is not acknowledged until almost a minute
later. To me it seems that the message is getting received but not
passed to another thread right away or something. This test was done
with an idle cluster, a single fio client (rbd engine) with a single
thread.

The OSD servers are almost 100% idle during these blocked I/O
requests. I think I'm at the end of my troubleshooting, so I can use
some help.

Single Test started about
2015-09-22 12:52:36

2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.439150 secs
2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
cluster [WRN] slow request 30.439150 seconds old, received at
2015-09-22 12:55:06.487451:
 osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,16
2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
[WRN] 2 slow requests, 2 included below; oldest blocked for >
30.379680 secs
2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
[WRN] slow request 30.291520 seconds old, received at 2015-09-22
12:55:06.406303:
 osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17
2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
[WRN] slow request 30.379680 seconds old, received at 2015-09-22
12:55:06.318144:
 osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,14
2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.954212 secs
2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
cluster [WRN] slow request 30.954212 seconds old, received at
2015-09-22 12:57:33.044003:
 osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 16,17
2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.704367 secs
2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
cluster [WRN] slow request 30.704367 seconds old, received at
2015-09-22 12:57:33.055404:
 osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
[set-alloc-hint object_size 4194304 write_size 4194304,write
0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
 currently waiting for subops from 13,17

Server   IP addr              OSD
nodev  - 192.168.55.11 - 12
nodew  - 192.168.55.12 - 13
nodex  - 192.168.55.13 - 16
nodey  - 192.168.55.14 - 17
nodez  - 192.168.55.15 - 14
nodezz - 192.168.55.16 - 15

fio job:
[rbd-test]
readwrite=write
blocksize=4M
#runtime=60
name=rbd-test
#readwrite=randwrite
#bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
#rwmixread=72
#norandommap
#size=1T
#blocksize=4k
ioengine=rbd
rbdname=test2
pool=rbd
clientname=admin
iodepth=8
#numjobs=4
#thread
#group_reporting
#time_based
#direct=1
#ramp_time=60


Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
J3hS
=0J7F
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum <gfarnum-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Is there some way to tell in the logs that this is happening?
>
> You can search for the (mangled) name _split_collection
>> I'm not
>> seeing much I/O, CPU usage during these times. Is there some way to
>> prevent the splitting? Is there a negative side effect to doing so?
>
> Bump up the split and merge thresholds. You can search the list for
> this, it was discussed not too long ago.
>
>> We've had I/O block for over 900 seconds and as soon as the sessions
>> are aborted, they are reestablished and complete immediately.
>>
>> The fio test is just a seq write, starting it over (rewriting from the
>> beginning) is still causing the issue. I was suspect that it is not
>> having to create new file and therefore split collections. This is on
>> my test cluster with no other load.
>
> Hmm, that does make it seem less likely if you're really not creating
> new objects, if you're actually running fio in such a way that it's
> not allocating new FS blocks (this is probably hard to set up?).
>
>>
>> I'll be doing a lot of testing today. Which log options and depths
>> would be the most helpful for tracking this issue down?
>
> If you want to go log diving "debug osd = 20", "debug filestore = 20",
> "debug ms = 1" are what the OSD guys like to see. That should spit out
> everything you need to track exactly what each Op is doing.
> -Greg

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                         ` <CAANLjFqbt0y-Ri=q6hXuuS06Sgi1S6phRdb1MJTgR6LTyHPtvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-22 21:41                           ` Samuel Just
  2015-09-22 21:45                             ` [ceph-users] " Robert LeBlanc
       [not found]                             ` <CAN=+7FVo6D2AoufELCP_qeiJ23i0XEOqQs_yYLrFNXwiiQSthw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 45+ messages in thread
From: Samuel Just @ 2015-09-22 21:41 UTC (permalink / raw)
  To: Robert LeBlanc, Weil, Sage; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird.  Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of seconds?

What kernel are you running?
-Sam

On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> OK, looping in ceph-devel to see if I can get some more eyes. I've
> extracted what I think are important entries from the logs for the
> first blocked request. NTP is running all the servers so the logs
> should be close in terms of time. Logs for 12:50 to 13:00 are
> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>
> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>
> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> but for some reason osd.13 doesn't get the message until 53 seconds
> later. osd.17 seems happy to just wait and doesn't resend the data
> (well, I'm not 100% sure how to tell which entries are the actual data
> transfer).
>
> It looks like osd.17 is receiving responses to start the communication
> with osd.13, but the op is not acknowledged until almost a minute
> later. To me it seems that the message is getting received but not
> passed to another thread right away or something. This test was done
> with an idle cluster, a single fio client (rbd engine) with a single
> thread.
>
> The OSD servers are almost 100% idle during these blocked I/O
> requests. I think I'm at the end of my troubleshooting, so I can use
> some help.
>
> Single Test started about
> 2015-09-22 12:52:36
>
> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.439150 secs
> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> cluster [WRN] slow request 30.439150 seconds old, received at
> 2015-09-22 12:55:06.487451:
>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,16
> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> 30.379680 secs
> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> 12:55:06.406303:
>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,17
> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> 12:55:06.318144:
>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,14
> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.954212 secs
> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> cluster [WRN] slow request 30.954212 seconds old, received at
> 2015-09-22 12:57:33.044003:
>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 16,17
> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> 30.704367 secs
> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> cluster [WRN] slow request 30.704367 seconds old, received at
> 2015-09-22 12:57:33.055404:
>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>  currently waiting for subops from 13,17
>
> Server   IP addr              OSD
> nodev  - 192.168.55.11 - 12
> nodew  - 192.168.55.12 - 13
> nodex  - 192.168.55.13 - 16
> nodey  - 192.168.55.14 - 17
> nodez  - 192.168.55.15 - 14
> nodezz - 192.168.55.16 - 15
>
> fio job:
> [rbd-test]
> readwrite=write
> blocksize=4M
> #runtime=60
> name=rbd-test
> #readwrite=randwrite
> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> #rwmixread=72
> #norandommap
> #size=1T
> #blocksize=4k
> ioengine=rbd
> rbdname=test2
> pool=rbd
> clientname=admin
> iodepth=8
> #numjobs=4
> #thread
> #group_reporting
> #time_based
> #direct=1
> #ramp_time=60
>
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> J3hS
> =0J7F
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum <gfarnum-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Is there some way to tell in the logs that this is happening?
>>
>> You can search for the (mangled) name _split_collection
>>> I'm not
>>> seeing much I/O, CPU usage during these times. Is there some way to
>>> prevent the splitting? Is there a negative side effect to doing so?
>>
>> Bump up the split and merge thresholds. You can search the list for
>> this, it was discussed not too long ago.
>>
>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>> are aborted, they are reestablished and complete immediately.
>>>
>>> The fio test is just a seq write, starting it over (rewriting from the
>>> beginning) is still causing the issue. I was suspect that it is not
>>> having to create new file and therefore split collections. This is on
>>> my test cluster with no other load.
>>
>> Hmm, that does make it seem less likely if you're really not creating
>> new objects, if you're actually running fio in such a way that it's
>> not allocating new FS blocks (this is probably hard to set up?).
>>
>>>
>>> I'll be doing a lot of testing today. Which log options and depths
>>> would be the most helpful for tracking this issue down?
>>
>> If you want to go log diving "debug osd = 20", "debug filestore = 20",
>> "debug ms = 1" are what the OSD guys like to see. That should spit out
>> everything you need to track exactly what each Op is doing.
>> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-09-22 21:41                           ` Samuel Just
@ 2015-09-22 21:45                             ` Robert LeBlanc
       [not found]                             ` <CAN=+7FVo6D2AoufELCP_qeiJ23i0XEOqQs_yYLrFNXwiiQSthw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-22 21:45 UTC (permalink / raw)
  To: Samuel Just; +Cc: Weil, Sage, Gregory Farnum, ceph-users, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

4.2.0-1.el7.elrepo.x86_64
- - ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:41 PM, Samuel Just  wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?
>
> What kernel are you running?
> -Sam
>
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> extracted what I think are important entries from the logs for the
>> first blocked request. NTP is running all the servers so the logs
>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>
>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>
>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> but for some reason osd.13 doesn't get the message until 53 seconds
>> later. osd.17 seems happy to just wait and doesn't resend the data
>> (well, I'm not 100% sure how to tell which entries are the actual data
>> transfer).
>>
>> It looks like osd.17 is receiving responses to start the communication
>> with osd.13, but the op is not acknowledged until almost a minute
>> later. To me it seems that the message is getting received but not
>> passed to another thread right away or something. This test was done
>> with an idle cluster, a single fio client (rbd engine) with a single
>> thread.
>>
>> The OSD servers are almost 100% idle during these blocked I/O
>> requests. I think I'm at the end of my troubleshooting, so I can use
>> some help.
>>
>> Single Test started about
>> 2015-09-22 12:52:36
>>
>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.439150 secs
>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> cluster [WRN] slow request 30.439150 seconds old, received at
>> 2015-09-22 12:55:06.487451:
>>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,16
>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> 30.379680 secs
>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> 12:55:06.406303:
>>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,17
>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> 12:55:06.318144:
>>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,14
>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.954212 secs
>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> cluster [WRN] slow request 30.954212 seconds old, received at
>> 2015-09-22 12:57:33.044003:
>>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 16,17
>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> 30.704367 secs
>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> cluster [WRN] slow request 30.704367 seconds old, received at
>> 2015-09-22 12:57:33.055404:
>>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>  currently waiting for subops from 13,17
>>
>> Server   IP addr              OSD
>> nodev  - 192.168.55.11 - 12
>> nodew  - 192.168.55.12 - 13
>> nodex  - 192.168.55.13 - 16
>> nodey  - 192.168.55.14 - 17
>> nodez  - 192.168.55.15 - 14
>> nodezz - 192.168.55.16 - 15
>>
>> fio job:
>> [rbd-test]
>> readwrite=write
>> blocksize=4M
>> #runtime=60
>> name=rbd-test
>> #readwrite=randwrite
>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> #rwmixread=72
>> #norandommap
>> #size=1T
>> #blocksize=4k
>> ioengine=rbd
>> rbdname=test2
>> pool=rbd
>> clientname=admin
>> iodepth=8
>> #numjobs=4
>> #thread
>> #group_reporting
>> #time_based
>> #direct=1
>> #ramp_time=60
>>
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> J3hS
>> =0J7F
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> Is there some way to tell in the logs that this is happening?
>>>
>>> You can search for the (mangled) name _split_collection
>>>> I'm not
>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>
>>> Bump up the split and merge thresholds. You can search the list for
>>> this, it was discussed not too long ago.
>>>
>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>> are aborted, they are reestablished and complete immediately.
>>>>
>>>> The fio test is just a seq write, starting it over (rewriting from the
>>>> beginning) is still causing the issue. I was suspect that it is not
>>>> having to create new file and therefore split collections. This is on
>>>> my test cluster with no other load.
>>>
>>> Hmm, that does make it seem less likely if you're really not creating
>>> new objects, if you're actually running fio in such a way that it's
>>> not allocating new FS blocks (this is probably hard to set up?).
>>>
>>>>
>>>> I'll be doing a lot of testing today. Which log options and depths
>>>> would be the most helpful for tracking this issue down?
>>>
>>> If you want to go log diving "debug osd = 20", "debug filestore = 20",
>>> "debug ms = 1" are what the OSD guys like to see. That should spit out
>>> everything you need to track exactly what each Op is doing.
>>> -Greg
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

- -----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAcvUCRDmVDuy+mK58QAAlhIP/jZkxTpX72PSgd8OLTeY
OracWsDXiYpnlEZkm4G2N4Pev9B/dHuXYYmFf779ZQvUprbN09DuF2dQucZw
FpMZfrKbbXuMfLYKL/InNIt8g1lJltZMVt5SAKB+4PjNY1FJLLHjpCs7NV18
w+Vg05FKsD98kDGw6920TeFa08bHfCSF23AjopAlneqLtbYpn4Y/XgtOjuRB
03ATp4Iuk6rJphlzFrtoW33ccwwU2qbJvztSejoH5LPYpBPb7GF4AraBD8sn
ZSkJadufb4stWGS0cORd+2hy9Es5M+afRI0ifjCtwuysyLxWpCltoSHYOEhi
HVN7IhRKpal4PG1Ql+6+mWPmA2mjeIAJ9jr/Y1KvTQdezbTJO74wyKEAJwcV
z37MwppJ3KahXwIeqPP2foE0AmJqP/BBdRvzNj7Sp0rsdEVnEYbeNubqH11E
3BPRQVvW8hV6EfzbSDqGxSoDGUzCkqBlNxyj/kPis4DozUcFSAJ6yl5nZgld
e68NZXH9X54tOd+nAuPVSSmWigavPZcfYi5A5tF3zPKsK772Msi3PYF3FeYM
6Ipp0kwHAFYbLMYNxaGnYVOkQgwvgFwRXFkYcU7UrYkucIAud+/sO6uGqf6G
FDClJH2kPPYs+lsBr3rdr8a1RLGFaQtTTv69gJjUrCWjNSKtnZrKXIXX/pK1
cgT3
=7MQi
- -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAcvcCRDmVDuy+mK58QAAQIkQAJJneixj6eq9UuBHoVLA
ng7mzeTvBgLwgyHsDQVE6h/cC0ZKem8GYtZxyyVbeu6ex/dCeaxWgXqwXC4F
33XO7mdoVJDi6UHXIfRWOW0FFdZgG28VYC6Y0W5+n4HHC29MFxvxDOiMCx41
amobumpmzmZy87OoUMjd+vea1l56dsZaJfOhglQNUE/8cqrB0bXxU3lLlI+v
nb/kGZeUj1wELxq21fRV++RlfgQotSGNSv8e7Pgx/daCpvU5K1Aze/wJl7H7
uI1i3tz6f14wM4isk4ld6CWUyIc8C9EiUDdvj7XR06rF7E0aFDAm9OESr5Wz
wRkOAIQ7fypfYebKX/q9TS1R8SYgamEZNFGsZgz7k94oBV5C/wk926GdD3sR
RPZSlrw3KQMdjNt76gAQBKDnaAdE9mkEPxP6d/l+vTUVohBiOD2Ul8YL2PLv
la3LqtKIVU+7gneWTX3GNqHMHVHviQ75TihCHIVv/7+lePa9mIjwX53AKKwh
D5xrMESQZm6dLr1IZC28xqZofyFcBBwG4AkzS5nzu9S/tfoRh019ztLLNkA4
ddtZSDhDLRO0tnmSDsUsi1HBpqTXNNURkQYQh0cgXt4vQSMueej+5137qaCK
6RxvfcBVt/jV4eXUl91vLrcJpMjCO7p4ZDB2su7zruJuWhnLRQ0LETU6koZO
XEZe
=fkfd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                             ` <CAN=+7FVo6D2AoufELCP_qeiJ23i0XEOqQs_yYLrFNXwiiQSthw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-22 21:52                               ` Sage Weil
       [not found]                                 ` <alpine.DEB.2.00.1509221450490.11876-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-09-22 21:52 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Tue, 22 Sep 2015, Samuel Just wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> once see a kernel issue which caused some messages to be mysteriously
> delayed for many 10s of seconds?

Every time we have seen this behavior and diagnosed it in the wild it has 
been a network misconfiguration.  Usually related to jumbo frames.

sage


> 
> What kernel are you running?
> -Sam
> 
> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > OK, looping in ceph-devel to see if I can get some more eyes. I've
> > extracted what I think are important entries from the logs for the
> > first blocked request. NTP is running all the servers so the logs
> > should be close in terms of time. Logs for 12:50 to 13:00 are
> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >
> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >
> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> > but for some reason osd.13 doesn't get the message until 53 seconds
> > later. osd.17 seems happy to just wait and doesn't resend the data
> > (well, I'm not 100% sure how to tell which entries are the actual data
> > transfer).
> >
> > It looks like osd.17 is receiving responses to start the communication
> > with osd.13, but the op is not acknowledged until almost a minute
> > later. To me it seems that the message is getting received but not
> > passed to another thread right away or something. This test was done
> > with an idle cluster, a single fio client (rbd engine) with a single
> > thread.
> >
> > The OSD servers are almost 100% idle during these blocked I/O
> > requests. I think I'm at the end of my troubleshooting, so I can use
> > some help.
> >
> > Single Test started about
> > 2015-09-22 12:52:36
> >
> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.439150 secs
> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> > cluster [WRN] slow request 30.439150 seconds old, received at
> > 2015-09-22 12:55:06.487451:
> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,16
> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
> > 30.379680 secs
> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> > 12:55:06.406303:
> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,17
> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> > 12:55:06.318144:
> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,14
> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.954212 secs
> > 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> > cluster [WRN] slow request 30.954212 seconds old, received at
> > 2015-09-22 12:57:33.044003:
> >  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 16,17
> > 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> > 30.704367 secs
> > 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> > cluster [WRN] slow request 30.704367 seconds old, received at
> > 2015-09-22 12:57:33.055404:
> >  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> > 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >  currently waiting for subops from 13,17
> >
> > Server   IP addr              OSD
> > nodev  - 192.168.55.11 - 12
> > nodew  - 192.168.55.12 - 13
> > nodex  - 192.168.55.13 - 16
> > nodey  - 192.168.55.14 - 17
> > nodez  - 192.168.55.15 - 14
> > nodezz - 192.168.55.16 - 15
> >
> > fio job:
> > [rbd-test]
> > readwrite=write
> > blocksize=4M
> > #runtime=60
> > name=rbd-test
> > #readwrite=randwrite
> > #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> > #rwmixread=72
> > #norandommap
> > #size=1T
> > #blocksize=4k
> > ioengine=rbd
> > rbdname=test2
> > pool=rbd
> > clientname=admin
> > iodepth=8
> > #numjobs=4
> > #thread
> > #group_reporting
> > #time_based
> > #direct=1
> > #ramp_time=60
> >
> >
> > Thanks,
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.1.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> > tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> > h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> > 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> > sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> > FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> > pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> > 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> > B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> > 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> > o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> > gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> > J3hS
> > =0J7F
> > -----END PGP SIGNATURE-----
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum <gfarnum-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> Is there some way to tell in the logs that this is happening?
> >>
> >> You can search for the (mangled) name _split_collection
> >>> I'm not
> >>> seeing much I/O, CPU usage during these times. Is there some way to
> >>> prevent the splitting? Is there a negative side effect to doing so?
> >>
> >> Bump up the split and merge thresholds. You can search the list for
> >> this, it was discussed not too long ago.
> >>
> >>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>> are aborted, they are reestablished and complete immediately.
> >>>
> >>> The fio test is just a seq write, starting it over (rewriting from the
> >>> beginning) is still causing the issue. I was suspect that it is not
> >>> having to create new file and therefore split collections. This is on
> >>> my test cluster with no other load.
> >>
> >> Hmm, that does make it seem less likely if you're really not creating
> >> new objects, if you're actually running fio in such a way that it's
> >> not allocating new FS blocks (this is probably hard to set up?).
> >>
> >>>
> >>> I'll be doing a lot of testing today. Which log options and depths
> >>> would be the most helpful for tracking this issue down?
> >>
> >> If you want to go log diving "debug osd = 20", "debug filestore = 20",
> >> "debug ms = 1" are what the OSD guys like to see. That should spit out
> >> everything you need to track exactly what each Op is doing.
> >> -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                 ` <alpine.DEB.2.00.1509221450490.11876-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-09-22 22:15                                   ` Robert LeBlanc
  2015-09-23 18:48                                     ` [ceph-users] " Robert LeBlanc
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-22 22:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for ping, but still had the
blocked I/O.

I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
the blocked I/O.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> On Tue, 22 Sep 2015, Samuel Just wrote:
>> I looked at the logs, it looks like there was a 53 second delay
>> between when osd.17 started sending the osd_repop message and when
>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> once see a kernel issue which caused some messages to be mysteriously
>> delayed for many 10s of seconds?
>
> Every time we have seen this behavior and diagnosed it in the wild it has
> been a network misconfiguration.  Usually related to jumbo frames.
>
> sage
>
>
>>
>> What kernel are you running?
>> -Sam
>>
>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>> > extracted what I think are important entries from the logs for the
>> > first blocked request. NTP is running all the servers so the logs
>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >
>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >
>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> > but for some reason osd.13 doesn't get the message until 53 seconds
>> > later. osd.17 seems happy to just wait and doesn't resend the data
>> > (well, I'm not 100% sure how to tell which entries are the actual data
>> > transfer).
>> >
>> > It looks like osd.17 is receiving responses to start the communication
>> > with osd.13, but the op is not acknowledged until almost a minute
>> > later. To me it seems that the message is getting received but not
>> > passed to another thread right away or something. This test was done
>> > with an idle cluster, a single fio client (rbd engine) with a single
>> > thread.
>> >
>> > The OSD servers are almost 100% idle during these blocked I/O
>> > requests. I think I'm at the end of my troubleshooting, so I can use
>> > some help.
>> >
>> > Single Test started about
>> > 2015-09-22 12:52:36
>> >
>> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> > 30.439150 secs
>> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> > cluster [WRN] slow request 30.439150 seconds old, received at
>> > 2015-09-22 12:55:06.487451:
>> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,16
>> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> > 30.379680 secs
>> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> > 12:55:06.406303:
>> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,17
>> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> > 12:55:06.318144:
>> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,14
>> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> > 30.954212 secs
>> > 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> > cluster [WRN] slow request 30.954212 seconds old, received at
>> > 2015-09-22 12:57:33.044003:
>> >  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 16,17
>> > 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> > 30.704367 secs
>> > 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> > cluster [WRN] slow request 30.704367 seconds old, received at
>> > 2015-09-22 12:57:33.055404:
>> >  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>> > 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >  currently waiting for subops from 13,17
>> >
>> > Server   IP addr              OSD
>> > nodev  - 192.168.55.11 - 12
>> > nodew  - 192.168.55.12 - 13
>> > nodex  - 192.168.55.13 - 16
>> > nodey  - 192.168.55.14 - 17
>> > nodez  - 192.168.55.15 - 14
>> > nodezz - 192.168.55.16 - 15
>> >
>> > fio job:
>> > [rbd-test]
>> > readwrite=write
>> > blocksize=4M
>> > #runtime=60
>> > name=rbd-test
>> > #readwrite=randwrite
>> > #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> > #rwmixread=72
>> > #norandommap
>> > #size=1T
>> > #blocksize=4k
>> > ioengine=rbd
>> > rbdname=test2
>> > pool=rbd
>> > clientname=admin
>> > iodepth=8
>> > #numjobs=4
>> > #thread
>> > #group_reporting
>> > #time_based
>> > #direct=1
>> > #ramp_time=60
>> >
>> >
>> > Thanks,
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: Mailvelope v1.1.0
>> > Comment: https://www.mailvelope.com
>> >
>> > wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> > tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> > h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> > 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> > sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> > FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> > pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> > 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> > B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> > 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> > o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> > gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> > J3hS
>> > =0J7F
>> > -----END PGP SIGNATURE-----
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA256
>> >>>
>> >>> Is there some way to tell in the logs that this is happening?
>> >>
>> >> You can search for the (mangled) name _split_collection
>> >>> I'm not
>> >>> seeing much I/O, CPU usage during these times. Is there some way to
>> >>> prevent the splitting? Is there a negative side effect to doing so?
>> >>
>> >> Bump up the split and merge thresholds. You can search the list for
>> >> this, it was discussed not too long ago.
>> >>
>> >>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >>> are aborted, they are reestablished and complete immediately.
>> >>>
>> >>> The fio test is just a seq write, starting it over (rewriting from the
>> >>> beginning) is still causing the issue. I was suspect that it is not
>> >>> having to create new file and therefore split collections. This is on
>> >>> my test cluster with no other load.
>> >>
>> >> Hmm, that does make it seem less likely if you're really not creating
>> >> new objects, if you're actually running fio in such a way that it's
>> >> not allocating new FS blocks (this is probably hard to set up?).
>> >>
>> >>>
>> >>> I'll be doing a lot of testing today. Which log options and depths
>> >>> would be the most helpful for tracking this issue down?
>> >>
>> >> If you want to go log diving "debug osd = 20", "debug filestore = 20",
>> >> "debug ms = 1" are what the OSD guys like to see. That should spit out
>> >> everything you need to track exactly what each Op is doing.
>> >> -Greg
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
+WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
gcZm
=CjwB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-09-22 22:15                                   ` Robert LeBlanc
@ 2015-09-23 18:48                                     ` Robert LeBlanc
       [not found]                                       ` <CAANLjFrrAyhJ=JR0+K3SX6G1ZsdcUyhTqFzUn_A0+bSh0D=bkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-23 18:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, Gregory Farnum, ceph-users, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

OK, here is the update on the saga...

I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets, no lost pings. Then then had the ping flood running while I
put Ceph load on the cluster and the dropped pings started increasing
after stopping the Ceph workload the pings stopped dropping.

I then ran iperf between all the nodes with the same results, so that
ruled out Ceph to a large degree. I then booted in the the
3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
need the network enhancements in the 4.x series to work well.

Does this sound familiar to anyone? I'll probably start bisecting the
kernel to see where this issue in introduced. Both of the clusters
with this issue are running 4.x, other than that, they are pretty
differing hardware and network configs.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
/XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
4OEo
=P33I
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> This is IPoIB and we have the MTU set to 64K. There was some issues
> pinging hosts with "No buffer space available" (hosts are currently
> configured for 4GB to test SSD caching rather than page cache). I
> found that MTU under 32K worked reliable for ping, but still had the
> blocked I/O.
>
> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> the blocked I/O.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> I looked at the logs, it looks like there was a 53 second delay
>>> between when osd.17 started sending the osd_repop message and when
>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> once see a kernel issue which caused some messages to be mysteriously
>>> delayed for many 10s of seconds?
>>
>> Every time we have seen this behavior and diagnosed it in the wild it has
>> been a network misconfiguration.  Usually related to jumbo frames.
>>
>> sage
>>
>>
>>>
>>> What kernel are you running?
>>> -Sam
>>>
>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>> > Hash: SHA256
>>> >
>>> > OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> > extracted what I think are important entries from the logs for the
>>> > first blocked request. NTP is running all the servers so the logs
>>> > should be close in terms of time. Logs for 12:50 to 13:00 are
>>> > available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >
>>> > 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> > 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> > 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> > 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> > 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> > 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> > 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> > 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> > 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> > 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >
>>> > In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> > osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> > but for some reason osd.13 doesn't get the message until 53 seconds
>>> > later. osd.17 seems happy to just wait and doesn't resend the data
>>> > (well, I'm not 100% sure how to tell which entries are the actual data
>>> > transfer).
>>> >
>>> > It looks like osd.17 is receiving responses to start the communication
>>> > with osd.13, but the op is not acknowledged until almost a minute
>>> > later. To me it seems that the message is getting received but not
>>> > passed to another thread right away or something. This test was done
>>> > with an idle cluster, a single fio client (rbd engine) with a single
>>> > thread.
>>> >
>>> > The OSD servers are almost 100% idle during these blocked I/O
>>> > requests. I think I'm at the end of my troubleshooting, so I can use
>>> > some help.
>>> >
>>> > Single Test started about
>>> > 2015-09-22 12:52:36
>>> >
>>> > 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> > 30.439150 secs
>>> > 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>> > cluster [WRN] slow request 30.439150 seconds old, received at
>>> > 2015-09-22 12:55:06.487451:
>>> >  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> > 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>> >  currently waiting for subops from 13,16
>>> > 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>> > [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>> > 30.379680 secs
>>> > 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>> > [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>> > 12:55:06.406303:
>>> >  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> > 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>> >  currently waiting for subops from 13,17
>>> > 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>> > [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>> > 12:55:06.318144:
>>> >  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> > 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>> >  currently waiting for subops from 13,14
>>> > 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> > 30.954212 secs
>>> > 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>> > cluster [WRN] slow request 30.954212 seconds old, received at
>>> > 2015-09-22 12:57:33.044003:
>>> >  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> > 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>> >  currently waiting for subops from 16,17
>>> > 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>> > cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> > 30.704367 secs
>>> > 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>> > cluster [WRN] slow request 30.704367 seconds old, received at
>>> > 2015-09-22 12:57:33.055404:
>>> >  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>> > [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> > 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>> >  currently waiting for subops from 13,17
>>> >
>>> > Server   IP addr              OSD
>>> > nodev  - 192.168.55.11 - 12
>>> > nodew  - 192.168.55.12 - 13
>>> > nodex  - 192.168.55.13 - 16
>>> > nodey  - 192.168.55.14 - 17
>>> > nodez  - 192.168.55.15 - 14
>>> > nodezz - 192.168.55.16 - 15
>>> >
>>> > fio job:
>>> > [rbd-test]
>>> > readwrite=write
>>> > blocksize=4M
>>> > #runtime=60
>>> > name=rbd-test
>>> > #readwrite=randwrite
>>> > #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>> > #rwmixread=72
>>> > #norandommap
>>> > #size=1T
>>> > #blocksize=4k
>>> > ioengine=rbd
>>> > rbdname=test2
>>> > pool=rbd
>>> > clientname=admin
>>> > iodepth=8
>>> > #numjobs=4
>>> > #thread
>>> > #group_reporting
>>> > #time_based
>>> > #direct=1
>>> > #ramp_time=60
>>> >
>>> >
>>> > Thanks,
>>> > -----BEGIN PGP SIGNATURE-----
>>> > Version: Mailvelope v1.1.0
>>> > Comment: https://www.mailvelope.com
>>> >
>>> > wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>> > tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>> > h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>> > 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>> > sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>> > FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>> > pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>> > 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>> > B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>> > 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>> > o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>> > gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>> > J3hS
>>> > =0J7F
>>> > -----END PGP SIGNATURE-----
>>> > ----------------
>>> > Robert LeBlanc
>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >
>>> >
>>> > On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>> >> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>> Hash: SHA256
>>> >>>
>>> >>> Is there some way to tell in the logs that this is happening?
>>> >>
>>> >> You can search for the (mangled) name _split_collection
>>> >>> I'm not
>>> >>> seeing much I/O, CPU usage during these times. Is there some way to
>>> >>> prevent the splitting? Is there a negative side effect to doing so?
>>> >>
>>> >> Bump up the split and merge thresholds. You can search the list for
>>> >> this, it was discussed not too long ago.
>>> >>
>>> >>> We've had I/O block for over 900 seconds and as soon as the sessions
>>> >>> are aborted, they are reestablished and complete immediately.
>>> >>>
>>> >>> The fio test is just a seq write, starting it over (rewriting from the
>>> >>> beginning) is still causing the issue. I was suspect that it is not
>>> >>> having to create new file and therefore split collections. This is on
>>> >>> my test cluster with no other load.
>>> >>
>>> >> Hmm, that does make it seem less likely if you're really not creating
>>> >> new objects, if you're actually running fio in such a way that it's
>>> >> not allocating new FS blocks (this is probably hard to set up?).
>>> >>
>>> >>>
>>> >>> I'll be doing a lot of testing today. Which log options and depths
>>> >>> would be the most helpful for tracking this issue down?
>>> >>
>>> >> If you want to go log diving "debug osd = 20", "debug filestore = 20",
>>> >> "debug ms = 1" are what the OSD guys like to see. That should spit out
>>> >> everything you need to track exactly what each Op is doing.
>>> >> -Greg
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> gcZm
> =CjwB
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                       ` <CAANLjFrrAyhJ=JR0+K3SX6G1ZsdcUyhTqFzUn_A0+bSh0D=bkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-23 19:10                                         ` Mark Nelson
       [not found]                                           ` <5602F921.3010204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Mark Nelson @ 2015-09-23 19:10 UTC (permalink / raw)
  To: Robert LeBlanc, Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

FWIW, we've got some 40GbE Intel cards in the community performance 
cluster on a Mellanox 40GbE switch that appear (knock on wood) to be 
running fine with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from 
Intel that older drivers might cause problems though.

Here's ifconfig from one of the nodes:

ens513f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20<link>
         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Mark

On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> OK, here is the update on the saga...
>
> I traced some more of blocked I/Os and it seems that communication
> between two hosts seemed worse than others. I did a two way ping flood
> between the two hosts using max packet sizes (1500). After 1.5M
> packets, no lost pings. Then then had the ping flood running while I
> put Ceph load on the cluster and the dropped pings started increasing
> after stopping the Ceph workload the pings stopped dropping.
>
> I then ran iperf between all the nodes with the same results, so that
> ruled out Ceph to a large degree. I then booted in the the
> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> need the network enhancements in the 4.x series to work well.
>
> Does this sound familiar to anyone? I'll probably start bisecting the
> kernel to see where this issue in introduced. Both of the clusters
> with this issue are running 4.x, other than that, they are pretty
> differing hardware and network configs.
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> 4OEo
> =P33I
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> pinging hosts with "No buffer space available" (hosts are currently
>> configured for 4GB to test SSD caching rather than page cache). I
>> found that MTU under 32K worked reliable for ping, but still had the
>> blocked I/O.
>>
>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> the blocked I/O.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>> I looked at the logs, it looks like there was a 53 second delay
>>>> between when osd.17 started sending the osd_repop message and when
>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>> once see a kernel issue which caused some messages to be mysteriously
>>>> delayed for many 10s of seconds?
>>>
>>> Every time we have seen this behavior and diagnosed it in the wild it has
>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>
>>> sage
>>>
>>>
>>>>
>>>> What kernel are you running?
>>>> -Sam
>>>>
>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>> extracted what I think are important entries from the logs for the
>>>>> first blocked request. NTP is running all the servers so the logs
>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>
>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>
>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>> transfer).
>>>>>
>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>> later. To me it seems that the message is getting received but not
>>>>> passed to another thread right away or something. This test was done
>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>> thread.
>>>>>
>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>> some help.
>>>>>
>>>>> Single Test started about
>>>>> 2015-09-22 12:52:36
>>>>>
>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> 30.439150 secs
>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>> 2015-09-22 12:55:06.487451:
>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>   currently waiting for subops from 13,16
>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>> 30.379680 secs
>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>> 12:55:06.406303:
>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>   currently waiting for subops from 13,17
>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>> 12:55:06.318144:
>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>   currently waiting for subops from 13,14
>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> 30.954212 secs
>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>> 2015-09-22 12:57:33.044003:
>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>   currently waiting for subops from 16,17
>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> 30.704367 secs
>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>> 2015-09-22 12:57:33.055404:
>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>   currently waiting for subops from 13,17
>>>>>
>>>>> Server   IP addr              OSD
>>>>> nodev  - 192.168.55.11 - 12
>>>>> nodew  - 192.168.55.12 - 13
>>>>> nodex  - 192.168.55.13 - 16
>>>>> nodey  - 192.168.55.14 - 17
>>>>> nodez  - 192.168.55.15 - 14
>>>>> nodezz - 192.168.55.16 - 15
>>>>>
>>>>> fio job:
>>>>> [rbd-test]
>>>>> readwrite=write
>>>>> blocksize=4M
>>>>> #runtime=60
>>>>> name=rbd-test
>>>>> #readwrite=randwrite
>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>> #rwmixread=72
>>>>> #norandommap
>>>>> #size=1T
>>>>> #blocksize=4k
>>>>> ioengine=rbd
>>>>> rbdname=test2
>>>>> pool=rbd
>>>>> clientname=admin
>>>>> iodepth=8
>>>>> #numjobs=4
>>>>> #thread
>>>>> #group_reporting
>>>>> #time_based
>>>>> #direct=1
>>>>> #ramp_time=60
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.1.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>> J3hS
>>>>> =0J7F
>>>>> -----END PGP SIGNATURE-----
>>>>> ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>>
>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>
>>>>>> You can search for the (mangled) name _split_collection
>>>>>>> I'm not
>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>
>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>> this, it was discussed not too long ago.
>>>>>>
>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>
>>>>>>> The fio test is just a seq write, starting it over (rewriting from the
>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>> having to create new file and therefore split collections. This is on
>>>>>>> my test cluster with no other load.
>>>>>>
>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>
>>>>>>>
>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>
>>>>>> If you want to go log diving "debug osd = 20", "debug filestore = 20",
>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit out
>>>>>> everything you need to track exactly what each Op is doing.
>>>>>> -Greg
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> gcZm
>> =CjwB
>> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                           ` <5602F921.3010204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-09-23 19:30                                             ` Robert LeBlanc
  2015-09-25 20:40                                               ` [ceph-users] " Robert LeBlanc
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-23 19:30 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We were able to only get ~17Gb out of the XL710 (heavily tweaked)
until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
seems that there were some major reworks in the network handling in
the kernel to efficiently handle that network rate. If I remember
right we also saw a drop in CPU utilization. I'm starting to think
that we did see packet loss while congesting our ISLs in our initial
testing, but we could not tell where the dropping was happening. We
saw some on the switches, but it didn't seem to be bad if we weren't
trying to congest things. We probably already saw this issue, just
didn't know it.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> drivers might cause problems though.
>
> Here's ifconfig from one of the nodes:
>
> ens513f1: flags=4163  mtu 1500
>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> Mark
>
>
> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> OK, here is the update on the saga...
>>
>> I traced some more of blocked I/Os and it seems that communication
>> between two hosts seemed worse than others. I did a two way ping flood
>> between the two hosts using max packet sizes (1500). After 1.5M
>> packets, no lost pings. Then then had the ping flood running while I
>> put Ceph load on the cluster and the dropped pings started increasing
>> after stopping the Ceph workload the pings stopped dropping.
>>
>> I then ran iperf between all the nodes with the same results, so that
>> ruled out Ceph to a large degree. I then booted in the the
>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> need the network enhancements in the 4.x series to work well.
>>
>> Does this sound familiar to anyone? I'll probably start bisecting the
>> kernel to see where this issue in introduced. Both of the clusters
>> with this issue are running 4.x, other than that, they are pretty
>> differing hardware and network configs.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> 4OEo
>> =P33I
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> wrote:
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> pinging hosts with "No buffer space available" (hosts are currently
>>> configured for 4GB to test SSD caching rather than page cache). I
>>> found that MTU under 32K worked reliable for ping, but still had the
>>> blocked I/O.
>>>
>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> the blocked I/O.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>
>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>
>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>> between when osd.17 started sending the osd_repop message and when
>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>> delayed for many 10s of seconds?
>>>>
>>>>
>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>> has
>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> What kernel are you running?
>>>>> -Sam
>>>>>
>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>>
>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>> extracted what I think are important entries from the logs for the
>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>
>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>
>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>> transfer).
>>>>>>
>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>> later. To me it seems that the message is getting received but not
>>>>>> passed to another thread right away or something. This test was done
>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>> thread.
>>>>>>
>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>> some help.
>>>>>>
>>>>>> Single Test started about
>>>>>> 2015-09-22 12:52:36
>>>>>>
>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> 30.439150 secs
>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>   currently waiting for subops from 13,16
>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>> 30.379680 secs
>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>> 12:55:06.406303:
>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>   currently waiting for subops from 13,17
>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>> 12:55:06.318144:
>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>   currently waiting for subops from 13,14
>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> 30.954212 secs
>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>   currently waiting for subops from 16,17
>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> 30.704367 secs
>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>   currently waiting for subops from 13,17
>>>>>>
>>>>>> Server   IP addr              OSD
>>>>>> nodev  - 192.168.55.11 - 12
>>>>>> nodew  - 192.168.55.12 - 13
>>>>>> nodex  - 192.168.55.13 - 16
>>>>>> nodey  - 192.168.55.14 - 17
>>>>>> nodez  - 192.168.55.15 - 14
>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>
>>>>>> fio job:
>>>>>> [rbd-test]
>>>>>> readwrite=write
>>>>>> blocksize=4M
>>>>>> #runtime=60
>>>>>> name=rbd-test
>>>>>> #readwrite=randwrite
>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>> #rwmixread=72
>>>>>> #norandommap
>>>>>> #size=1T
>>>>>> #blocksize=4k
>>>>>> ioengine=rbd
>>>>>> rbdname=test2
>>>>>> pool=rbd
>>>>>> clientname=admin
>>>>>> iodepth=8
>>>>>> #numjobs=4
>>>>>> #thread
>>>>>> #group_reporting
>>>>>> #time_based
>>>>>> #direct=1
>>>>>> #ramp_time=60
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.1.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>>
>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>> J3hS
>>>>>> =0J7F
>>>>>> -----END PGP SIGNATURE-----
>>>>>> ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>
>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>>
>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>
>>>>>>>
>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>
>>>>>>>> I'm not
>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>
>>>>>>>
>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>> this, it was discussed not too long ago.
>>>>>>>
>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>
>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>> the
>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>> on
>>>>>>>> my test cluster with no other load.
>>>>>>>
>>>>>>>
>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>
>>>>>>>>
>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>
>>>>>>>
>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>> 20",
>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>> out
>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>> -Greg
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.1.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>> gcZm
>>> =CjwB
>>> -----END PGP SIGNATURE-----
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
ae22
=AX+L
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-09-23 19:30                                             ` Robert LeBlanc
@ 2015-09-25 20:40                                               ` Robert LeBlanc
       [not found]                                                 ` <CAANLjFrw0AtqvHu5ioRPH7yKjV5ZqfnGKou+fKLTZukLTnLbgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-09-25 20:40 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Sage Weil, Samuel Just, Gregory Farnum, ceph-users, ceph-devel

We dropped the replication on our cluster from 4 to 3 and it looks
like all the blocked I/O has stopped (no entries in the log for the
last 12 hours). This makes me believe that there is some issue with
the number of sockets or some other TCP issue. We have not messed with
Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
hosts hosting about 150 VMs. Open files is set at 32K for the OSD
processes and 16K system wide.

Does this seem like the right spot to be looking? What are some
configuration items we should be looking at?

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> seems that there were some major reworks in the network handling in
> the kernel to efficiently handle that network rate. If I remember
> right we also saw a drop in CPU utilization. I'm starting to think
> that we did see packet loss while congesting our ISLs in our initial
> testing, but we could not tell where the dropping was happening. We
> saw some on the switches, but it didn't seem to be bad if we weren't
> trying to congest things. We probably already saw this issue, just
> didn't know it.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> drivers might cause problems though.
>>
>> Here's ifconfig from one of the nodes:
>>
>> ens513f1: flags=4163  mtu 1500
>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>         RX errors 0  dropped 0  overruns 0  frame 0
>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> Mark
>>
>>
>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> OK, here is the update on the saga...
>>>
>>> I traced some more of blocked I/Os and it seems that communication
>>> between two hosts seemed worse than others. I did a two way ping flood
>>> between the two hosts using max packet sizes (1500). After 1.5M
>>> packets, no lost pings. Then then had the ping flood running while I
>>> put Ceph load on the cluster and the dropped pings started increasing
>>> after stopping the Ceph workload the pings stopped dropping.
>>>
>>> I then ran iperf between all the nodes with the same results, so that
>>> ruled out Ceph to a large degree. I then booted in the the
>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>> need the network enhancements in the 4.x series to work well.
>>>
>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>> kernel to see where this issue in introduced. Both of the clusters
>>> with this issue are running 4.x, other than that, they are pretty
>>> differing hardware and network configs.
>>>
>>> Thanks,
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.1.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>> 4OEo
>>> =P33I
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>> wrote:
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>> blocked I/O.
>>>>
>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>> the blocked I/O.
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>
>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>
>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>> delayed for many 10s of seconds?
>>>>>
>>>>>
>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>> has
>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> What kernel are you running?
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>>
>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>
>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>
>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>> transfer).
>>>>>>>
>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>> passed to another thread right away or something. This test was done
>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>> thread.
>>>>>>>
>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>> some help.
>>>>>>>
>>>>>>> Single Test started about
>>>>>>> 2015-09-22 12:52:36
>>>>>>>
>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>> 30.439150 secs
>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>   currently waiting for subops from 13,16
>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>> 30.379680 secs
>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>> 12:55:06.406303:
>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>   currently waiting for subops from 13,17
>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>> 12:55:06.318144:
>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>   currently waiting for subops from 13,14
>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>> 30.954212 secs
>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>   currently waiting for subops from 16,17
>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>> 30.704367 secs
>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>   currently waiting for subops from 13,17
>>>>>>>
>>>>>>> Server   IP addr              OSD
>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>
>>>>>>> fio job:
>>>>>>> [rbd-test]
>>>>>>> readwrite=write
>>>>>>> blocksize=4M
>>>>>>> #runtime=60
>>>>>>> name=rbd-test
>>>>>>> #readwrite=randwrite
>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>> #rwmixread=72
>>>>>>> #norandommap
>>>>>>> #size=1T
>>>>>>> #blocksize=4k
>>>>>>> ioengine=rbd
>>>>>>> rbdname=test2
>>>>>>> pool=rbd
>>>>>>> clientname=admin
>>>>>>> iodepth=8
>>>>>>> #numjobs=4
>>>>>>> #thread
>>>>>>> #group_reporting
>>>>>>> #time_based
>>>>>>> #direct=1
>>>>>>> #ramp_time=60
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.1.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>
>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>> J3hS
>>>>>>> =0J7F
>>>>>>> -----END PGP SIGNATURE-----
>>>>>>> ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>
>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>
>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>> Hash: SHA256
>>>>>>>>>
>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>
>>>>>>>>
>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>
>>>>>>>>> I'm not
>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>
>>>>>>>>
>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>
>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>
>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>> the
>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>> on
>>>>>>>>> my test cluster with no other load.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>
>>>>>>>>
>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>> 20",
>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>> out
>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>> -Greg
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.1.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>> gcZm
>>>> =CjwB
>>>> -----END PGP SIGNATURE-----
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> ae22
> =AX+L
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                 ` <CAANLjFrw0AtqvHu5ioRPH7yKjV5ZqfnGKou+fKLTZukLTnLbgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-03 20:10                                                   ` Robert LeBlanc
       [not found]                                                     ` <CAANLjFoafHQ1X8U7LUrvhh2h8fu3WgNpUsMDR+QkSWpfW8ad0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-03 20:10 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do both remote and
on-site engagements, please let us know.

This certainly seems to be network related, but somewhere in the
kernel. We have tried increasing the network and TCP buffers, number
of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
on the boxes, the disks are busy, but not constantly at 100% (they
cycle from <10% up to 100%, but not 100% for more than a few seconds
at a time). There seems to be no reasonable explanation why I/O is
blocked pretty frequently longer than 30 seconds. We have verified
Jumbo frames by pinging from/to each node with 9000 byte packets. The
network admins have verified that packets are not being dropped in the
switches for these nodes. We have tried different kernels including
the recent Google patch to cubic. This is showing up on three cluster
(two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
(from CentOS 7.1) with similar results.

The messages seem slightly different:
2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
100.087155 secs
2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
cluster [WRN] slow request 30.041999 seconds old, received at
2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
points reached

I don't know what "no flag points reached" means.

The problem is most pronounced when we have to reboot an OSD node (1
of 13), we will have hundreds of I/O blocked for some times up to 300
seconds. It takes a good 15 minutes for things to settle down. The
production cluster is very busy doing normally 8,000 I/O and peaking
at 15,000. This is all 4TB spindles with SSD journals and the disks
are between 25-50% full. We are currently splitting PGs to distribute
the load better across the disks, but we are having to do this 10 PGs
at a time as we get blocked I/O. We have max_backfills and
max_recovery set to 1, client op priority is set higher than recovery
priority. We tried increasing the number of op threads but this didn't
seem to help. It seems as soon as PGs are finished being checked, they
become active and could be the cause for slow I/O while the other PGs
are being checked.

What I don't understand is that the messages are delayed. As soon as
the message is received by Ceph OSD process, it is very quickly
committed to the journal and a response is sent back to the primary
OSD which is received very quickly as well. I've adjust
min_free_kbytes and it seems to keep the OSDs from crashing, but
doesn't solve the main problem. We don't have swap and there is 64 GB
of RAM per nodes for 10 OSDs.

Is there something that could cause the kernel to get a packet but not
be able to dispatch it to Ceph such that it could be explaining why we
are seeing these blocked I/O for 30+ seconds. Is there some pointers
to tracing Ceph messages from the network buffer through the kernel to
the Ceph process?

We can really use some pointers no matter how outrageous. We've have
over 6 people looking into this for weeks now and just can't think of
anything else.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
l7OF
=OI++
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> We dropped the replication on our cluster from 4 to 3 and it looks
> like all the blocked I/O has stopped (no entries in the log for the
> last 12 hours). This makes me believe that there is some issue with
> the number of sockets or some other TCP issue. We have not messed with
> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> processes and 16K system wide.
>
> Does this seem like the right spot to be looking? What are some
> configuration items we should be looking at?
>
> Thanks,
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> seems that there were some major reworks in the network handling in
>> the kernel to efficiently handle that network rate. If I remember
>> right we also saw a drop in CPU utilization. I'm starting to think
>> that we did see packet loss while congesting our ISLs in our initial
>> testing, but we could not tell where the dropping was happening. We
>> saw some on the switches, but it didn't seem to be bad if we weren't
>> trying to congest things. We probably already saw this issue, just
>> didn't know it.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>> drivers might cause problems though.
>>>
>>> Here's ifconfig from one of the nodes:
>>>
>>> ens513f1: flags=4163  mtu 1500
>>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> Mark
>>>
>>>
>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> OK, here is the update on the saga...
>>>>
>>>> I traced some more of blocked I/Os and it seems that communication
>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>> packets, no lost pings. Then then had the ping flood running while I
>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>
>>>> I then ran iperf between all the nodes with the same results, so that
>>>> ruled out Ceph to a large degree. I then booted in the the
>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>> need the network enhancements in the 4.x series to work well.
>>>>
>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>> kernel to see where this issue in introduced. Both of the clusters
>>>> with this issue are running 4.x, other than that, they are pretty
>>>> differing hardware and network configs.
>>>>
>>>> Thanks,
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.1.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>> 4OEo
>>>> =P33I
>>>> -----END PGP SIGNATURE-----
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>> wrote:
>>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>> blocked I/O.
>>>>>
>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>> the blocked I/O.
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>
>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>
>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>> delayed for many 10s of seconds?
>>>>>>
>>>>>>
>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>> has
>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> What kernel are you running?
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>>
>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>
>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>
>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>> transfer).
>>>>>>>>
>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>> some help.
>>>>>>>>
>>>>>>>> Single Test started about
>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>
>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.439150 secs
>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,16
>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>> 30.379680 secs
>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>> 12:55:06.406303:
>>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,17
>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>> 12:55:06.318144:
>>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,14
>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.954212 secs
>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 16,17
>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.704367 secs
>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,17
>>>>>>>>
>>>>>>>> Server   IP addr              OSD
>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>
>>>>>>>> fio job:
>>>>>>>> [rbd-test]
>>>>>>>> readwrite=write
>>>>>>>> blocksize=4M
>>>>>>>> #runtime=60
>>>>>>>> name=rbd-test
>>>>>>>> #readwrite=randwrite
>>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>> #rwmixread=72
>>>>>>>> #norandommap
>>>>>>>> #size=1T
>>>>>>>> #blocksize=4k
>>>>>>>> ioengine=rbd
>>>>>>>> rbdname=test2
>>>>>>>> pool=rbd
>>>>>>>> clientname=admin
>>>>>>>> iodepth=8
>>>>>>>> #numjobs=4
>>>>>>>> #thread
>>>>>>>> #group_reporting
>>>>>>>> #time_based
>>>>>>>> #direct=1
>>>>>>>> #ramp_time=60
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>
>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>> J3hS
>>>>>>>> =0J7F
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>> ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>
>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>> Hash: SHA256
>>>>>>>>>>
>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>
>>>>>>>>>> I'm not
>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>
>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>
>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>> the
>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>> on
>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>> 20",
>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>> out
>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.1.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>> gcZm
>>>>> =CjwB
>>>>> -----END PGP SIGNATURE-----
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> ae22
>> =AX+L
>> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                     ` <CAANLjFoafHQ1X8U7LUrvhh2h8fu3WgNpUsMDR+QkSWpfW8ad0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-04  6:16                                                       ` Josef Johansson
       [not found]                                                         ` <CAOnYue-T061jkAvpe3cwH7Et4xXY_dkW73KyKcbwiefgzgs8cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-10-04 13:48                                                       ` Sage Weil
  1 sibling, 1 reply; 45+ messages in thread
From: Josef Johansson @ 2015-10-04  6:16 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 25230 bytes --]

Hi,

I don't know what brand those 4TB spindles are, but I know that mine are
very bad at doing write at the same time as read. Especially small read
write.

This has an absurdly bad effect when doing maintenance on ceph. That being
said we see a lot of difference between dumpling and hammer in performance
on these drives. Most likely due to hammer able to read write degraded PGs.

We have run into two different problems along the way, the first was
blocked request where we had to upgrade from 64GB mem on each node to
256GB. We thought that it was the only safe buy make things better.

I believe it worked because more reads were cached so we had less mixed
read write on the nodes, thus giving the spindles more room to breath. Now
this was a shot in the dark then, but the price is not that high even to
just try it out.. compared to 6 people working on it. I believe the IO on
disk was not huge either, but what kills the disk is high latency. How much
bandwidth are the disk using? We had very low.. 3-5MB/s.

The second problem was defragmentations hitting 70%, lowering that to 6%
made a lot of difference. Depending on IO pattern this increases different.

TL;DR read kills the 4TB spindles.

Hope you guys clear out of the woods.
/Josef
On 3 Oct 2015 10:10 pm, "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
> wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance
> cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running
> fine
> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that
> older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163  mtu 1500
> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still
> seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be
> mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild
> it
> >>>>>> has
> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at
> http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from
> osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk
> result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150
> sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from
> osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk
> result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right
> away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53
> seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual
> data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the
> communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was
> done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a
> single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can
> use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
> for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>>   osd_op(client.250874.0:1388
> rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 :
> cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 :
> cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>>   osd_op(client.250874.0:1384
> rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 :
> cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>>   osd_op(client.250874.0:1382
> rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
> for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>>   osd_op(client.250874.0:1873
> rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
> for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>>   osd_op(client.250874.0:1874
> rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server   IP addr              OSD
> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some
> way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing
> so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list
> for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the
> sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting
> from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is
> not
> >>>>>>>>>> having to create new file and therefore split collections. This
> is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not
> creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that
> it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and
> depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should
> spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>>>>> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[-- Attachment #1.2: Type: text/html, Size: 36493 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                     ` <CAANLjFoafHQ1X8U7LUrvhh2h8fu3WgNpUsMDR+QkSWpfW8ad0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-10-04  6:16                                                       ` Josef Johansson
@ 2015-10-04 13:48                                                       ` Sage Weil
       [not found]                                                         ` <alpine.DEB.2.00.1510040646020.5233-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-04 13:48 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
> 
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
> 
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
> 
> I don't know what "no flag points reached" means.

Just that the op hasn't been marked as reaching any interesting points 
(op->mark_*() calls).

Is it possible to gather a lot with debug ms = 20 and debug osd = 20?  
It's extremely verbose but it'll let us see where the op is getting 
blocked.  If you see the "slow request" message it means the op in 
received by ceph (that's when the clock starts), so I suspect it's not 
something we can blame on the network stack.

sage


> 
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
> 
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
> 
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
> 
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
> 
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163  mtu 1500
> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>> has
> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server   IP addr              OSD
> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                         ` <CAOnYue-T061jkAvpe3cwH7Et4xXY_dkW73KyKcbwiefgzgs8cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-04 15:08                                                           ` Alex Gorbachev
  2015-10-04 16:13                                                           ` Robert LeBlanc
  1 sibling, 0 replies; 45+ messages in thread
From: Alex Gorbachev @ 2015-10-04 15:08 UTC (permalink / raw)
  To: Josef Johansson; +Cc: Sage Weil, ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 27388 bytes --]

We had multiple issues with 4TB drives and delays.  Here is the
configuration that works for us fairly well on Ubuntu (but we are about to
significantly increase the IO load so this may change).

NTP: always use NTP and make sure it is working - Ceph is very sensitive to
time being precise

/etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="elevator=noop nomodeset splash=silent
vga=normal net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=Y"

blk_mq really helps with spreading the IO load over multiple cores.

I used to use intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll, but
it seems allowing idle states actually can improve performance by running
CPUs cooler, so will likely remove this soon.

chmod -x /etc/init.d/ondemand - in order to prevent CPU throttling

use Mellanox OFED on pre-4.x kernels

check your flow control settings on server and switch using ethtool

test network performance with iperf

disable firewall rules or just uninstall firewall (e.g. ufw)

Turn off in BIOS any virtualization technology VT-d etc., and (see note
above re C-states) maybe also disable power saving features

/etc/sysctl.conf:
kernel.pid_max = 4194303
vm.swappiness=1
vm.min_free_kbytes=1048576

Hope this helps.

Alex


On Sun, Oct 4, 2015 at 2:16 AM, Josef Johansson <josef86-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are
> very bad at doing write at the same time as read. Especially small read
> write.
>
> This has an absurdly bad effect when doing maintenance on ceph. That being
> said we see a lot of difference between dumpling and hammer in performance
> on these drives. Most likely due to hammer able to read write degraded PGs.
>
> We have run into two different problems along the way, the first was
> blocked request where we had to upgrade from 64GB mem on each node to
> 256GB. We thought that it was the only safe buy make things better.
>
> I believe it worked because more reads were cached so we had less mixed
> read write on the nodes, thus giving the spindles more room to breath. Now
> this was a shot in the dark then, but the price is not that high even to
> just try it out.. compared to 6 people working on it. I believe the IO on
> disk was not huge either, but what kills the disk is high latency. How much
> bandwidth are the disk using? We had very low.. 3-5MB/s.
>
> The second problem was defragmentations hitting 70%, lowering that to 6%
> made a lot of difference. Depending on IO pattern this increases different.
>
> TL;DR read kills the 4TB spindles.
>
> Hope you guys clear out of the woods.
> /Josef
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> We are still struggling with this and have tried a lot of different
>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> consulting services for non-Red Hat systems. If there are some
>> certified Ceph consultants in the US that we can do both remote and
>> on-site engagements, please let us know.
>>
>> This certainly seems to be network related, but somewhere in the
>> kernel. We have tried increasing the network and TCP buffers, number
>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> on the boxes, the disks are busy, but not constantly at 100% (they
>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> at a time). There seems to be no reasonable explanation why I/O is
>> blocked pretty frequently longer than 30 seconds. We have verified
>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> network admins have verified that packets are not being dropped in the
>> switches for these nodes. We have tried different kernels including
>> the recent Google patch to cubic. This is showing up on three cluster
>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> (from CentOS 7.1) with similar results.
>>
>> The messages seem slightly different:
>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> 100.087155 secs
>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> cluster [WRN] slow request 30.041999 seconds old, received at
>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> points reached
>>
>> I don't know what "no flag points reached" means.
>>
>> The problem is most pronounced when we have to reboot an OSD node (1
>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> seconds. It takes a good 15 minutes for things to settle down. The
>> production cluster is very busy doing normally 8,000 I/O and peaking
>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> are between 25-50% full. We are currently splitting PGs to distribute
>> the load better across the disks, but we are having to do this 10 PGs
>> at a time as we get blocked I/O. We have max_backfills and
>> max_recovery set to 1, client op priority is set higher than recovery
>> priority. We tried increasing the number of op threads but this didn't
>> seem to help. It seems as soon as PGs are finished being checked, they
>> become active and could be the cause for slow I/O while the other PGs
>> are being checked.
>>
>> What I don't understand is that the messages are delayed. As soon as
>> the message is received by Ceph OSD process, it is very quickly
>> committed to the journal and a response is sent back to the primary
>> OSD which is received very quickly as well. I've adjust
>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> of RAM per nodes for 10 OSDs.
>>
>> Is there something that could cause the kernel to get a packet but not
>> be able to dispatch it to Ceph such that it could be explaining why we
>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> to tracing Ceph messages from the network buffer through the kernel to
>> the Ceph process?
>>
>> We can really use some pointers no matter how outrageous. We've have
>> over 6 people looking into this for weeks now and just can't think of
>> anything else.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> l7OF
>> =OI++
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
>> wrote:
>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> > like all the blocked I/O has stopped (no entries in the log for the
>> > last 12 hours). This makes me believe that there is some issue with
>> > the number of sockets or some other TCP issue. We have not messed with
>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> > processes and 16K system wide.
>> >
>> > Does this seem like the right spot to be looking? What are some
>> > configuration items we should be looking at?
>> >
>> > Thanks,
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
>> wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >> seems that there were some major reworks in the network handling in
>> >> the kernel to efficiently handle that network rate. If I remember
>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >> that we did see packet loss while congesting our ISLs in our initial
>> >> testing, but we could not tell where the dropping was happening. We
>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >> trying to congest things. We probably already saw this issue, just
>> >> didn't know it.
>> >> - ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>> FWIW, we've got some 40GbE Intel cards in the community performance
>> cluster
>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running
>> fine
>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that
>> older
>> >>> drivers might cause problems though.
>> >>>
>> >>> Here's ifconfig from one of the nodes:
>> >>>
>> >>> ens513f1: flags=4163  mtu 1500
>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >>>
>> >>> Mark
>> >>>
>> >>>
>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>>>
>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>> Hash: SHA256
>> >>>>
>> >>>> OK, here is the update on the saga...
>> >>>>
>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >>>> between two hosts seemed worse than others. I did a two way ping
>> flood
>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >>>>
>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >>>> need the network enhancements in the 4.x series to work well.
>> >>>>
>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >>>> differing hardware and network configs.
>> >>>>
>> >>>> Thanks,
>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>>> Version: Mailvelope v1.1.0
>> >>>> Comment: https://www.mailvelope.com
>> >>>>
>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>>> 4OEo
>> >>>> =P33I
>> >>>> -----END PGP SIGNATURE-----
>> >>>> ----------------
>> >>>> Robert LeBlanc
>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>
>> >>>>
>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>>> wrote:
>> >>>>>
>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>> Hash: SHA256
>> >>>>>
>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >>>>> blocked I/O.
>> >>>>>
>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still
>> seeing
>> >>>>> the blocked I/O.
>> >>>>> - ----------------
>> >>>>> Robert LeBlanc
>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>>>>>
>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>>>>>>
>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >>>>>>> once see a kernel issue which caused some messages to be
>> mysteriously
>> >>>>>>> delayed for many 10s of seconds?
>> >>>>>>
>> >>>>>>
>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild
>> it
>> >>>>>> has
>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >>>>>>
>> >>>>>> sage
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> What kernel are you running?
>> >>>>>>> -Sam
>> >>>>>>>
>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >>>>>>>>
>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>>> Hash: SHA256
>> >>>>>>>>
>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes.
>> I've
>> >>>>>>>> extracted what I think are important entries from the logs for
>> the
>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >>>>>>>> available at
>> http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>>>>>>>
>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from
>> osd.16
>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk
>> result=0
>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150
>> sec
>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from
>> osd.13
>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk
>> result=0
>> >>>>>>>>
>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13
>> and
>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right
>> away,
>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53
>> seconds
>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the
>> data
>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the
>> actual data
>> >>>>>>>> transfer).
>> >>>>>>>>
>> >>>>>>>> It looks like osd.17 is receiving responses to start the
>> communication
>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >>>>>>>> later. To me it seems that the message is getting received but
>> not
>> >>>>>>>> passed to another thread right away or something. This test was
>> done
>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a
>> single
>> >>>>>>>> thread.
>> >>>>>>>>
>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can
>> use
>> >>>>>>>> some help.
>> >>>>>>>>
>> >>>>>>>> Single Test started about
>> >>>>>>>> 2015-09-22 12:52:36
>> >>>>>>>>
>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
>> for >
>> >>>>>>>> 30.439150 secs
>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>>>>>>>   osd_op(client.250874.0:1388
>> rbd_data.3380e2ae8944a.0000000000000545
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected
>> e56785)
>> >>>>>>>>   currently waiting for subops from 13,16
>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 :
>> cluster
>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >>>>>>>> 30.379680 secs
>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 :
>> cluster
>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >>>>>>>> 12:55:06.406303:
>> >>>>>>>>   osd_op(client.250874.0:1384
>> rbd_data.3380e2ae8944a.0000000000000541
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected
>> e56785)
>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 :
>> cluster
>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >>>>>>>> 12:55:06.318144:
>> >>>>>>>>   osd_op(client.250874.0:1382
>> rbd_data.3380e2ae8944a.000000000000053f
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected
>> e56785)
>> >>>>>>>>   currently waiting for subops from 13,14
>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
>> for >
>> >>>>>>>> 30.954212 secs
>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>>>>>>>   osd_op(client.250874.0:1873
>> rbd_data.3380e2ae8944a.000000000000070d
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected
>> e56785)
>> >>>>>>>>   currently waiting for subops from 16,17
>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked
>> for >
>> >>>>>>>> 30.704367 secs
>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>>>>>>>   osd_op(client.250874.0:1874
>> rbd_data.3380e2ae8944a.000000000000070e
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected
>> e56785)
>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>>>
>> >>>>>>>> Server   IP addr              OSD
>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>>>>>>>
>> >>>>>>>> fio job:
>> >>>>>>>> [rbd-test]
>> >>>>>>>> readwrite=write
>> >>>>>>>> blocksize=4M
>> >>>>>>>> #runtime=60
>> >>>>>>>> name=rbd-test
>> >>>>>>>> #readwrite=randwrite
>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>>>>>>> #rwmixread=72
>> >>>>>>>> #norandommap
>> >>>>>>>> #size=1T
>> >>>>>>>> #blocksize=4k
>> >>>>>>>> ioengine=rbd
>> >>>>>>>> rbdname=test2
>> >>>>>>>> pool=rbd
>> >>>>>>>> clientname=admin
>> >>>>>>>> iodepth=8
>> >>>>>>>> #numjobs=4
>> >>>>>>>> #thread
>> >>>>>>>> #group_reporting
>> >>>>>>>> #time_based
>> >>>>>>>> #direct=1
>> >>>>>>>> #ramp_time=60
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>>>>>>>
>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>>>>>>> J3hS
>> >>>>>>>> =0J7F
>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>>>>>>> ----------------
>> >>>>>>>> Robert LeBlanc
>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>> B9F1
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>>>>> Hash: SHA256
>> >>>>>>>>>>
>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >>>>>>>>>>
>> >>>>>>>>>> I'm not
>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some
>> way to
>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to
>> doing so?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list
>> for
>> >>>>>>>>> this, it was discussed not too long ago.
>> >>>>>>>>>
>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the
>> sessions
>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >>>>>>>>>>
>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting
>> from
>> >>>>>>>>>> the
>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it
>> is not
>> >>>>>>>>>> having to create new file and therefore split collections.
>> This is
>> >>>>>>>>>> on
>> >>>>>>>>>> my test cluster with no other load.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not
>> creating
>> >>>>>>>>> new objects, if you're actually running fio in such a way that
>> it's
>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and
>> depths
>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore
>> =
>> >>>>>>>>> 20",
>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should
>> spit
>> >>>>>>>>> out
>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >>>>>>>>> -Greg
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel"
>> >>>>>>>> in
>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>>> More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>> Version: Mailvelope v1.1.0
>> >>>>> Comment: https://www.mailvelope.com
>> >>>>>
>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>>>> gcZm
>> >>>>> =CjwB
>> >>>>> -----END PGP SIGNATURE-----
>> >>>>
>> >>>> --
>> >>>> To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in
>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>
>> >>>
>> >>
>> >> -----BEGIN PGP SIGNATURE-----
>> >> Version: Mailvelope v1.1.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >> ae22
>> >> =AX+L
>> >> -----END PGP SIGNATURE-----
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

[-- Attachment #1.2: Type: text/html, Size: 39019 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                         ` <CAOnYue-T061jkAvpe3cwH7Et4xXY_dkW73KyKcbwiefgzgs8cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-10-04 15:08                                                           ` Alex Gorbachev
@ 2015-10-04 16:13                                                           ` Robert LeBlanc
       [not found]                                                             ` <CAANLjFosyB_cqk19Ax=5wgXDnSGOOa-_5sY1FuCZyYTx3rS9JQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-04 16:13 UTC (permalink / raw)
  To: Josef Johansson; +Cc: Sage Weil, ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 46855 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

These are Toshiba MG03ACA400 drives.

sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
sde is SATADOM with OS install
sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI
Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI
controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT
SAS-2 (rev 05)

There is probably some performance optimization that we can do in this
area, however unless I'm missing something, I don't see anything that
should cause I/O to take 30-60+ seconds to complete from a disk
standpoint.

[root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1:
"; xfs_db -c frag -r /dev/sd${i}1; done
sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%


[root@ceph1 ~]# iostat -xd 2
Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015      _x86_64_
   (16 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.09     2.06    9.24   36.18   527.28  1743.71
100.00     8.96  197.32   17.50  243.23   4.07  18.47
sdb               0.17     3.61   16.70   74.44   949.65  2975.30
86.13     6.74   73.95   23.94   85.16   4.31  39.32
sdc               0.14     4.67   15.69   87.80   818.02  3860.11
90.41     9.56   92.38   26.73  104.11   4.44  45.91
sdd               0.17     3.43    7.16   69.13   480.96  2847.42
87.25     4.80   62.89   30.00   66.30   4.33  33.00
sde               0.01     1.13    0.34    0.99     8.35    12.01
30.62     0.01    7.37    2.64    9.02   1.64   0.22
sdj               0.00     1.22    0.01  348.22     0.03 11302.65
64.91     0.23    0.66    0.14    0.66   0.15   5.15
sdk               0.00     1.99    0.01  369.94     0.03 12876.74
69.61     0.26    0.71    0.13    0.71   0.16   5.75
sdf               0.01     1.79    1.55   31.12    39.64  1431.37
90.06     4.07  124.67   16.25  130.05   3.11  10.17
sdi               0.22     3.17   23.92   72.90  1386.45  2676.28
83.93     7.75   80.00   24.31   98.27   4.31  41.77
sdm               0.16     3.10   17.63   72.84   986.29  2767.24
82.98     6.57   72.64   23.67   84.50   4.23  38.30
sdl               0.11     3.01   12.10   55.14   660.85  2361.40
89.89    17.87  265.80   21.64  319.36   4.08  27.45
sdg               0.08     2.45    9.75   53.90   489.67  1929.42
76.01    17.27  271.30   20.77  316.61   3.98  25.33
sdh               0.10     2.76   11.28   60.97   600.10  2114.48
75.14     1.70   23.55   22.92   23.66   4.10  29.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.50    0.00   146.00     0.00
584.00     0.01   16.00   16.00    0.00  16.00   0.80
sdb               0.00     0.50    9.00  119.00  2036.00  2578.00
72.09     0.68    5.50    7.06    5.39   2.36  30.25
sdc               0.00     4.00   34.00  129.00   494.00  6987.75
91.80     1.70   10.44   17.00    8.72   4.44  72.40
sdd               0.00     1.50    1.50   95.50    74.00  2396.50
50.94     0.85    8.75   23.33    8.52   7.53  73.05
sde               0.00    37.00   11.00    1.00    46.00   152.00
33.00     0.01    1.00    0.64    5.00   0.54   0.65
sdj               0.00     0.50    0.00  970.50     0.00 12594.00
25.95     0.09    0.09    0.00    0.09   0.08   8.20
sdk               0.00     0.00    0.00  977.50     0.00 12016.00
24.59     0.10    0.10    0.00    0.10   0.09   8.90
sdf               0.00     0.50    0.50   37.50     2.00   230.25
12.22     9.63   10.58    8.00   10.61   1.79   6.80
sdi               2.00     0.00   10.50    0.00  2528.00     0.00
481.52     0.10    9.33    9.33    0.00   7.76   8.15
sdm               0.00     0.50   15.00  116.00   546.00   833.25
21.06     0.94    7.17   14.03    6.28   4.13  54.15
sdl               0.00     0.00    3.00    0.00    26.00     0.00
17.33     0.02    7.50    7.50    0.00   7.50   2.25
sdg               0.00     3.50    1.00   64.50     4.00  2929.25
89.56     0.40    6.04    9.00    5.99   3.42  22.40
sdh               0.50     0.50    4.00   64.00   770.00  1105.00
55.15     4.96  189.42   21.25  199.93   4.21  28.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    8.50    0.00   110.00     0.00
25.88     0.01    1.59    1.59    0.00   1.53   1.30
sdb               0.00     4.00    6.50  117.50   494.00  4544.50
81.27     0.87    6.99   11.62    6.73   3.28  40.70
sdc               0.00     0.50    5.50  202.50   526.00  4123.00
44.70     1.80    8.66   18.73    8.39   2.08  43.30
sdd               0.00     3.00    2.50  227.00   108.00  6952.00
61.53    46.10  197.44   30.20  199.29   3.86  88.60
sde               0.00     0.00    0.00    1.50     0.00     6.00
8.00     0.00    2.33    0.00    2.33   1.33   0.20
sdj               0.00     0.00    0.00  834.00     0.00  9912.00
23.77     0.08    0.09    0.00    0.09   0.08   6.75
sdk               0.00     0.00    0.00  777.00     0.00 12318.00
31.71     0.12    0.15    0.00    0.15   0.10   7.70
sdf               0.00     1.00    4.50  117.00   198.00   693.25
14.67    34.86  362.88   84.33  373.60   3.59  43.65
sdi               0.00     0.00    1.50    0.00     6.00     0.00
8.00     0.01    9.00    9.00    0.00   9.00   1.35
sdm               0.50     3.00    3.50  143.00  1014.00  4205.25
71.25     0.93    5.95   20.43    5.59   3.08  45.15
sdl               0.50     0.00    8.00  148.50  1578.00  2128.50
47.37     0.82    5.27    6.44    5.21   3.40  53.20
sdg               1.50     2.00   10.50  100.50  2540.00  2039.50
82.51     0.77    7.00   14.19    6.25   5.42  60.20
sdh               0.50     0.00    5.00    0.00  1050.00     0.00
420.00     0.04    7.10    7.10    0.00   7.10   3.55

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    6.00    0.00   604.00     0.00
201.33     0.03    5.58    5.58    0.00   5.58   3.35
sdb               0.00     6.00    7.00  236.00   132.00  8466.00
70.77    45.48  186.59   31.79  191.18   1.62  39.45
sdc               2.00     0.00   19.50   46.50  6334.00   686.00
212.73     0.39    5.96    7.97    5.12   3.57  23.55
sdd               0.00     1.00    3.00   20.00    72.00  1527.25
139.07     0.31   47.67    6.17   53.90   3.11   7.15
sde               0.00    17.00    0.00    4.50     0.00   184.00
81.78     0.01    2.33    0.00    2.33   2.33   1.05
sdj               0.00     0.00    0.00  805.50     0.00 12760.00
31.68     0.21    0.27    0.00    0.27   0.09   7.35
sdk               0.00     0.00    0.00  438.00     0.00 14300.00
65.30     0.24    0.54    0.00    0.54   0.13   5.65
sdf               0.00     0.00    1.00    0.00     6.00     0.00
12.00     0.00    2.50    2.50    0.00   2.50   0.25
sdi               0.00     5.50   14.50   27.50   394.00  6459.50
326.36     0.86   20.18   11.00   25.02   7.42  31.15
sdm               0.00     1.00    9.00  175.00   554.00  3173.25
40.51     1.12    6.38    7.22    6.34   2.41  44.40
sdl               0.00     2.00    2.50  100.50    26.00  2483.00
48.72     0.77    7.47   11.80    7.36   2.10  21.65
sdg               0.00     4.50    9.00  214.00   798.00  7417.00
73.68    66.56  298.46   28.83  309.80   3.35  74.70
sdh               0.00     0.00   16.50    0.00   344.00     0.00
41.70     0.09    5.61    5.61    0.00   4.55   7.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.00     0.00    9.00    0.00  3162.00     0.00
702.67     0.07    8.06    8.06    0.00   6.06   5.45
sdb               0.50     0.00   12.50   13.00  1962.00   298.75
177.31     0.63   30.00    4.84   54.19   9.96  25.40
sdc               0.00     0.50    3.50  131.00    18.00  1632.75
24.55     0.87    6.48   16.86    6.20   3.51  47.25
sdd               0.00     0.00    4.00    0.00    72.00    16.00
44.00     0.26   10.38   10.38    0.00  23.38   9.35
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00  843.50     0.00 16334.00
38.73     0.19    0.23    0.00    0.23   0.11   9.10
sdk               0.00     0.00    0.00  803.00     0.00 10394.00
25.89     0.07    0.08    0.00    0.08   0.08   6.25
sdf               0.00     4.00   11.00   90.50   150.00  2626.00
54.70     0.59    5.84    3.82    6.08   4.06  41.20
sdi               0.00     3.50   17.50  130.50  2132.00  6309.50
114.07     1.84   12.55   25.60   10.80   5.76  85.30
sdm               0.00     4.00    2.00  139.00    44.00  1957.25
28.39     0.89    6.28   14.50    6.17   3.55  50.10
sdl               0.00     0.50   12.00  101.00   334.00  1449.75
31.57     0.94    8.28   10.17    8.06   2.11  23.85
sdg               0.00     0.00    2.50    3.00   204.00    17.00
80.36     0.02    5.27    4.60    5.83   3.91   2.15
sdh               0.00     0.50    9.50   32.50  1810.00   199.50
95.69     0.28    6.69    3.79    7.54   5.12  21.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.50     0.50   25.00   24.50  1248.00   394.25
66.35     0.76   15.30   11.62   19.06   5.25  26.00
sdb               1.50     0.00   13.50   30.00  2628.00   405.25
139.46     0.27    5.94    8.19    4.93   5.31  23.10
sdc               0.00     6.00    3.00  163.00    60.00  9889.50
119.87     1.66    9.83   28.67    9.48   5.95  98.70
sdd               0.00    11.00    5.50  353.50    50.00  2182.00
12.43   118.42  329.26   30.27  333.91   2.78  99.90
sde               0.00     5.50    0.00    1.50     0.00    28.00
37.33     0.00    2.33    0.00    2.33   2.33   0.35
sdj               0.00     0.00    0.00 1227.50     0.00 22064.00
35.95     0.50    0.41    0.00    0.41   0.10  12.50
sdk               0.00     0.50    0.00 1073.50     0.00 19248.00
35.86     0.24    0.23    0.00    0.23   0.10  10.40
sdf               0.00     4.00    0.00  109.00     0.00  4145.00
76.06     0.59    5.44    0.00    5.44   3.63  39.55
sdi               0.00     1.00    8.50   95.50   218.00  2091.75
44.42     1.06    9.70   18.71    8.90   7.00  72.80
sdm               0.00     0.00    8.00  177.50    82.00  3173.00
35.09     1.24    6.65   14.31    6.30   3.53  65.40
sdl               0.00     3.50    3.00  187.50    32.00  2175.25
23.17     1.47    7.68   18.50    7.50   3.85  73.35
sdg               0.00     0.00    1.00    0.00    12.00     0.00
24.00     0.00    1.50    1.50    0.00   1.50   0.15
sdh               0.50     1.00   14.00  169.50  2364.00  4568.00
75.55     1.50    8.12   21.25    7.03   4.91  90.10

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     4.00    3.00   60.00   212.00  2542.00
87.43     0.58    8.02   15.50    7.64   7.95  50.10
sdb               0.00     0.50    2.50   98.00   682.00  1652.00
46.45     0.51    5.13    6.20    5.10   3.05  30.65
sdc               0.00     2.50    4.00  146.00    16.00  4623.25
61.86     1.07    7.33   13.38    7.17   2.22  33.25
sdd               0.00     0.50    9.50   30.00   290.00   358.00
32.81     0.84   32.22   49.16   26.85  12.28  48.50
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.50    0.00  530.00     0.00  7138.00
26.94     0.06    0.11    0.00    0.11   0.09   4.65
sdk               0.00     0.00    0.00  625.00     0.00  8254.00
26.41     0.07    0.12    0.00    0.12   0.09   5.75
sdf               0.00     0.00    0.00    4.00     0.00    18.00
9.00     0.01    3.62    0.00    3.62   3.12   1.25
sdi               0.00     2.50    8.00   61.00   836.00  2681.50
101.96     0.58    9.25   15.12    8.48   6.71  46.30
sdm               0.00     4.50   11.00  273.00  2100.00  8562.00
75.08    13.49   47.53   24.95   48.44   1.83  52.00
sdl               0.00     1.00    0.50   49.00     2.00  1038.00
42.02     0.23    4.83   14.00    4.73   2.45  12.15
sdg               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh               1.00     1.00    9.00  109.00  2082.00  2626.25
79.80     0.85    7.34    7.83    7.30   3.83  45.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.50   10.00  177.00   284.00  4857.00
54.98    36.26  194.27   21.85  204.01   3.53  66.00
sdb               1.00     0.50   39.50  119.50  1808.00  2389.25
52.80     1.58    9.96   12.32    9.18   2.42  38.45
sdc               0.00     2.00   15.00  200.50   116.00  4951.00
47.03    14.37   66.70   73.87   66.16   2.23  47.95
sdd               0.00     3.50    6.00   54.50   180.00  2360.50
83.98     0.69   11.36   20.42   10.36   7.99  48.35
sde               0.00     7.50    0.00   32.50     0.00   160.00
9.85     1.64   50.51    0.00   50.51   1.48   4.80
sdj               0.00     0.00    0.00  835.00     0.00 10198.00
24.43     0.07    0.09    0.00    0.09   0.08   6.50
sdk               0.00     0.00    0.00  802.00     0.00 12534.00
31.26     0.23    0.29    0.00    0.29   0.10   8.05
sdf               0.00     2.50    2.00  133.50    14.00  5272.25
78.03     4.37   32.21    4.50   32.63   1.73  23.40
sdi               0.00     4.50   17.00  125.50  2676.00  8683.25
159.43     1.86   13.02   27.97   11.00   4.95  70.55
sdm               0.00     0.00    7.00    0.50   540.00    32.00
152.53     0.05    7.07    7.57    0.00   7.07   5.30
sdl               0.00     7.00   27.00  276.00  2374.00 11955.50
94.58    25.87   85.36   15.20   92.23   1.84  55.90
sdg               0.00     0.00   45.00    0.00   828.00     0.00
36.80     0.07    1.62    1.62    0.00   0.68   3.05
sdh               0.00     0.50    0.50   65.50     2.00  1436.25
43.58     0.51    7.79   16.00    7.73   3.61  23.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     8.00   14.50  150.00   122.00   929.25
12.78    20.65   70.61    7.55   76.71   1.46  24.05
sdb               0.00     5.00    8.00  283.50    86.00  2757.50
19.51    69.43  205.40   51.75  209.73   2.40  69.85
sdc               0.00     0.00   12.50    1.50   350.00    48.25
56.89     0.25   17.75   17.00   24.00   4.75   6.65
sdd               0.00     3.50   36.50  141.00   394.00  2338.75
30.79     1.50    8.42   16.16    6.41   4.56  80.95
sde               0.00     1.50    0.00    1.00     0.00    10.00
20.00     0.00    2.00    0.00    2.00   2.00   0.20
sdj               0.00     0.00    0.00 1059.00     0.00 18506.00
34.95     0.19    0.18    0.00    0.18   0.10  10.75
sdk               0.00     0.00    0.00 1103.00     0.00 14220.00
25.78     0.09    0.08    0.00    0.08   0.08   8.35
sdf               0.00     5.50    2.00   19.50     8.00  5158.75
480.63     0.17    8.05    6.50    8.21   6.95  14.95
sdi               0.00     5.50   28.00  224.50  2210.00  8971.75
88.57   122.15  328.47   27.43  366.02   3.71  93.70
sdm               0.00     0.00   13.00    4.00   718.00    16.00
86.35     0.15    3.76    4.23    2.25   3.62   6.15
sdl               0.00     0.00   16.50    0.00   832.00     0.00
100.85     0.02    1.12    1.12    0.00   1.09   1.80
sdg               0.00     2.50   17.00   23.50  1032.00  3224.50
210.20     0.25    6.25    2.56    8.91   3.41  13.80
sdh               0.00    10.50    4.50  241.00    66.00  7252.00
59.62    23.00   91.66    4.22   93.29   2.11  51.85

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.50    3.50   91.00    92.00   552.75
13.65    36.27  479.41   81.57  494.71   5.65  53.35
sdb               0.00     1.00    6.00  168.00   224.00   962.50
13.64    83.35  533.92   62.00  550.77   5.75 100.00
sdc               0.00     1.00    3.00  171.00    16.00  1640.00
19.03     1.08    6.18   11.83    6.08   3.15  54.80
sdd               0.00     5.00    5.00  107.50   132.00  6576.75
119.27     0.79    7.06   18.80    6.51   5.13  57.70
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00 1111.50     0.00 22346.00
40.21     0.27    0.24    0.00    0.24   0.11  12.10
sdk               0.00     0.00    0.00 1022.00     0.00 33040.00
64.66     0.68    0.67    0.00    0.67   0.13  13.60
sdf               0.00     5.50    2.50   91.00    12.00  4977.25
106.72     2.29   24.48   14.40   24.76   2.42  22.60
sdi               0.00     0.00   10.00   69.50   368.00   858.50
30.86     7.40  586.41    5.50  669.99   4.21  33.50
sdm               0.00     4.00    8.00  210.00   944.00  5833.50
62.18     1.57    7.62   18.62    7.20   4.57  99.70
sdl               0.00     0.00    7.50   22.50   104.00   253.25
23.82     0.14    4.82    5.07    4.73   4.03  12.10
sdg               0.00     4.00    1.00   84.00     4.00  3711.75
87.43     0.58    6.88   12.50    6.81   5.75  48.90
sdh               0.00     3.50    7.50   44.00    72.00  2954.25
117.52     1.54   39.50   61.73   35.72   6.40  32.95

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    1.00    0.00    20.00     0.00
40.00     0.01   14.50   14.50    0.00  14.50   1.45
sdb               0.00     7.00   10.50  198.50  2164.00  6014.75
78.27     1.94    9.29   28.90    8.25   4.77  99.75
sdc               0.00     2.00    4.00   95.50   112.00  5152.25
105.81     0.94    9.46   24.25    8.84   4.68  46.55
sdd               0.00     1.00    2.00  131.00    10.00  7167.25
107.93     4.55   34.23   83.25   33.48   2.52  33.55
sde               0.00     0.00    0.00    0.50     0.00     2.00
8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00  541.50     0.00  6468.00
23.89     0.05    0.10    0.00    0.10   0.09   5.00
sdk               0.00     0.00    0.00  509.00     0.00  7704.00
30.27     0.07    0.14    0.00    0.14   0.10   4.85
sdf               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    3.50    0.00    90.00     0.00
51.43     0.04   10.14   10.14    0.00  10.14   3.55
sdm               0.00     2.00    5.00  102.50  1186.00  4583.00
107.33     0.81    7.56   23.20    6.80   2.78  29.85
sdl               0.00    14.00   10.00  216.00   112.00  3645.50
33.25    73.45  311.05   46.30  323.31   3.51  79.35
sdg               0.00     1.00    0.00   52.50     0.00   240.00
9.14     0.25    4.76    0.00    4.76   4.48  23.50
sdh               0.00     0.00    3.50    0.00    18.00     0.00
10.29     0.02    7.00    7.00    0.00   7.00   2.45

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    1.00    0.00     4.00     0.00
8.00     0.01   14.50   14.50    0.00  14.50   1.45
sdb               0.00     9.00    2.00  292.00   192.00 10925.75
75.63    36.98  100.27   54.75  100.58   2.95  86.60
sdc               0.00     9.00   10.50  151.00    78.00  6771.25
84.82    36.06   94.60   26.57   99.33   3.77  60.85
sdd               0.00     0.00    5.00    1.00    74.00    24.00
32.67     0.03    5.00    6.00    0.00   5.00   3.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00  787.50     0.00  9418.00
23.92     0.07    0.10    0.00    0.10   0.09   6.70
sdk               0.00     0.00    0.00  766.50     0.00  9400.00
24.53     0.08    0.11    0.00    0.11   0.10   7.70
sdf               0.00     0.00    0.50   41.50     6.00   391.00
18.90     0.24    5.79    9.00    5.75   5.50  23.10
sdi               0.00    10.00    9.00  268.00    92.00  1618.75
12.35    68.20  150.90   15.50  155.45   2.36  65.30
sdm               0.00    11.50   10.00  330.50    72.00  3201.25
19.23    68.83  139.38   37.45  142.46   1.84  62.80
sdl               0.00     2.50    2.50  228.50    14.00  2526.00
21.99    90.42  404.71  242.40  406.49   4.33 100.00
sdg               0.00     5.50    7.50  298.00    68.00  5275.25
34.98    75.31  174.85   26.73  178.58   2.67  81.60
sdh               0.00     0.00    2.50    2.00    28.00    24.00
23.11     0.01    2.78    5.00    0.00   2.78   1.25

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson  wrote:
Hi,

I don't know what brand those 4TB spindles are, but I know that mine
are very bad at doing write at the same time as read. Especially small
read write.

This has an absurdly bad effect when doing maintenance on ceph. That
being said we see a lot of difference between dumpling and hammer in
performance on these drives. Most likely due to hammer able to read
write degraded PGs.

We have run into two different problems along the way, the first was
blocked request where we had to upgrade from 64GB mem on each node to
256GB. We thought that it was the only safe buy make things better.

I believe it worked because more reads were cached so we had less
mixed read write on the nodes, thus giving the spindles more room to
breath. Now this was a shot in the dark then, but the price is not
that high even to just try it out.. compared to 6 people working on
it. I believe the IO on disk was not huge either, but what kills the
disk is high latency. How much bandwidth are the disk using? We had
very low.. 3-5MB/s.

The second problem was defragmentations hitting 70%, lowering that to
6% made a lot of difference. Depending on IO pattern this increases
different.

TL;DR read kills the 4TB spindles.

Hope you guys clear out of the woods.
/Josef

On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:
- -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do both remote and
on-site engagements, please let us know.

This certainly seems to be network related, but somewhere in the
kernel. We have tried increasing the network and TCP buffers, number
of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
on the boxes, the disks are busy, but not constantly at 100% (they
cycle from <10% up to 100%, but not 100% for more than a few seconds
at a time). There seems to be no reasonable explanation why I/O is
blocked pretty frequently longer than 30 seconds. We have verified
Jumbo frames by pinging from/to each node with 9000 byte packets. The
network admins have verified that packets are not being dropped in the
switches for these nodes. We have tried different kernels including
the recent Google patch to cubic. This is showing up on three cluster
(two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
(from CentOS 7.1) with similar results.

The messages seem slightly different:
2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
100.087155 secs
2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
cluster [WRN] slow request 30.041999 seconds old, received at
2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
points reached

I don't know what "no flag points reached" means.

The problem is most pronounced when we have to reboot an OSD node (1
of 13), we will have hundreds of I/O blocked for some times up to 300
seconds. It takes a good 15 minutes for things to settle down. The
production cluster is very busy doing normally 8,000 I/O and peaking
at 15,000. This is all 4TB spindles with SSD journals and the disks
are between 25-50% full. We are currently splitting PGs to distribute
the load better across the disks, but we are having to do this 10 PGs
at a time as we get blocked I/O. We have max_backfills and
max_recovery set to 1, client op priority is set higher than recovery
priority. We tried increasing the number of op threads but this didn't
seem to help. It seems as soon as PGs are finished being checked, they
become active and could be the cause for slow I/O while the other PGs
are being checked.

What I don't understand is that the messages are delayed. As soon as
the message is received by Ceph OSD process, it is very quickly
committed to the journal and a response is sent back to the primary
OSD which is received very quickly as well. I've adjust
min_free_kbytes and it seems to keep the OSDs from crashing, but
doesn't solve the main problem. We don't have swap and there is 64 GB
of RAM per nodes for 10 OSDs.

Is there something that could cause the kernel to get a packet but not
be able to dispatch it to Ceph such that it could be explaining why we
are seeing these blocked I/O for 30+ seconds. Is there some pointers
to tracing Ceph messages from the network buffer through the kernel to
the Ceph process?

We can really use some pointers no matter how outrageous. We've have
over 6 people looking into this for weeks now and just can't think of
anything else.

Thanks,
- -----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
l7OF
=OI++
- -----END PGP SIGNATURE-----
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> We dropped the replication on our cluster from 4 to 3 and it looks
> like all the blocked I/O has stopped (no entries in the log for the
> last 12 hours). This makes me believe that there is some issue with
> the number of sockets or some other TCP issue. We have not messed with
> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> processes and 16K system wide.
>
> Does this seem like the right spot to be looking? What are some
> configuration items we should be looking at?
>
> Thanks,
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> seems that there were some major reworks in the network handling in
>> the kernel to efficiently handle that network rate. If I remember
>> right we also saw a drop in CPU utilization. I'm starting to think
>> that we did see packet loss while congesting our ISLs in our initial
>> testing, but we could not tell where the dropping was happening. We
>> saw some on the switches, but it didn't seem to be bad if we weren't
>> trying to congest things. We probably already saw this issue, just
>> didn't know it.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>> drivers might cause problems though.
>>>
>>> Here's ifconfig from one of the nodes:
>>>
>>> ens513f1: flags=4163  mtu 1500
>>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> Mark
>>>
>>>
>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> OK, here is the update on the saga...
>>>>
>>>> I traced some more of blocked I/Os and it seems that communication
>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>> packets, no lost pings. Then then had the ping flood running while I
>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>
>>>> I then ran iperf between all the nodes with the same results, so that
>>>> ruled out Ceph to a large degree. I then booted in the the
>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>> need the network enhancements in the 4.x series to work well.
>>>>
>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>> kernel to see where this issue in introduced. Both of the clusters
>>>> with this issue are running 4.x, other than that, they are pretty
>>>> differing hardware and network configs.
>>>>
>>>> Thanks,
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.1.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>> 4OEo
>>>> =P33I
>>>> -----END PGP SIGNATURE-----
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>> wrote:
>>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>> blocked I/O.
>>>>>
>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>> the blocked I/O.
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>
>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>
>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>> delayed for many 10s of seconds?
>>>>>>
>>>>>>
>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>> has
>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> What kernel are you running?
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>>
>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>
>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>
>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>> transfer).
>>>>>>>>
>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>> some help.
>>>>>>>>
>>>>>>>> Single Test started about
>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>
>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.439150 secs
>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,16
>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>> 30.379680 secs
>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>> 12:55:06.406303:
>>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,17
>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>> 12:55:06.318144:
>>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,14
>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.954212 secs
>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 16,17
>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>> 30.704367 secs
>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>   currently waiting for subops from 13,17
>>>>>>>>
>>>>>>>> Server   IP addr              OSD
>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>
>>>>>>>> fio job:
>>>>>>>> [rbd-test]
>>>>>>>> readwrite=write
>>>>>>>> blocksize=4M
>>>>>>>> #runtime=60
>>>>>>>> name=rbd-test
>>>>>>>> #readwrite=randwrite
>>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>> #rwmixread=72
>>>>>>>> #norandommap
>>>>>>>> #size=1T
>>>>>>>> #blocksize=4k
>>>>>>>> ioengine=rbd
>>>>>>>> rbdname=test2
>>>>>>>> pool=rbd
>>>>>>>> clientname=admin
>>>>>>>> iodepth=8
>>>>>>>> #numjobs=4
>>>>>>>> #thread
>>>>>>>> #group_reporting
>>>>>>>> #time_based
>>>>>>>> #direct=1
>>>>>>>> #ramp_time=60
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>
>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>> J3hS
>>>>>>>> =0J7F
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>> ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>
>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>> Hash: SHA256
>>>>>>>>>>
>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>
>>>>>>>>>> I'm not
>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>
>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>
>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>> the
>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>> on
>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>> 20",
>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>> out
>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.1.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>> gcZm
>>>>> =CjwB
>>>>> -----END PGP SIGNATURE-----
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> ae22
>> =AX+L
>> -----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing
listceph-users-idqoXFIVOFIuzjOF24JIZ5en40jlLrCI@public.gmane.org://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEVA5CRDmVDuy+mK58QAAY4EP/2jTEGPrbR3KDOC1d6FU
7TkVeFtow7UCe9/TwArLtcEVTr8rdaXNWRi7gat99zbL5pw+96Sj6bGqpKVz
ZBSHcBlLIl42Hj10Ju7Svpwn7Q9RnSGOvjEdghEKsTxnf37gZD/KjvMbidJu
jlPGEfnGEdYbQ+vDYoCoUIuvUNPbCvWQTjJpnTXrMZfhhEBoOepMzF9s6L6B
AWR9WUrtz4HtGSMT42U1gd3LDOUh/5Ioy6FuhJe04piaf3ikRg+pjX47/WBd
mQupmKJOblaULCswOrMLTS9R2+p6yaWj0zlUb3OAOErO7JR8OWZ2H7tYjkQN
rGPsIRNv4yKw2Z5vJdHLksVdYhBQY1I4N1GO3+hf+j/yotPC9Ay4BYLZrQwf
3L+uhqSEu80erZjsJF4lilmw0l9nbDSoXc0MqRoXrpUIqyVtmaCBynv5Xq7s
L5idaH6iVPBwy4Y6qzVuQpP0LaHp48ojIRx7likQJt0MSeDzqnslnp5B/9nb
Ppu3peRUKf5GEKISRQ6gOI3C4gTSSX6aBatWdtpm01Et0T6ysoxAP/VoO3Nb
0PDsuYYT0U1MYqi0USouiNc4yRWNb9hkkBHEJrwjtP52moL1WYdYleL6w+FS
Y1YQ1DU8YsEtVniBmZc4TBQJRRIS6SaQjH108JCjUcy9oVNwRtOqbcT1aiI6
EP/Q
=efx7
-----END PGP SIGNATURE-----

[-- Attachment #1.2: Type: text/html, Size: 61225 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                             ` <CAANLjFosyB_cqk19Ax=5wgXDnSGOOa-_5sY1FuCZyYTx3rS9JQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-04 16:16                                                               ` Josef Johansson
  0 siblings, 0 replies; 45+ messages in thread
From: Josef Johansson @ 2015-10-04 16:16 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 49774 bytes --]

I would start with defrag the drives, the good part is that you can just
run the defrag with the time parameter and it will take all available xfs
drives.
On 4 Oct 2015 6:13 pm, "Robert LeBlanc" <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> These are Toshiba MG03ACA400 drives.
>
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
>
> There is probably some performance optimization that we can do in this area, however unless I'm missing something, I don't see anything that should cause I/O to take 30-60+ seconds to complete from a disk standpoint.
>
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: "; xfs_db -c frag -r /dev/sd${i}1; done
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
>
>
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015      _x86_64_        (16 CPU)
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.09     2.06    9.24   36.18   527.28  1743.71   100.00     8.96  197.32   17.50  243.23   4.07  18.47
> sdb               0.17     3.61   16.70   74.44   949.65  2975.30    86.13     6.74   73.95   23.94   85.16   4.31  39.32
> sdc               0.14     4.67   15.69   87.80   818.02  3860.11    90.41     9.56   92.38   26.73  104.11   4.44  45.91
> sdd               0.17     3.43    7.16   69.13   480.96  2847.42    87.25     4.80   62.89   30.00   66.30   4.33  33.00
> sde               0.01     1.13    0.34    0.99     8.35    12.01    30.62     0.01    7.37    2.64    9.02   1.64   0.22
> sdj               0.00     1.22    0.01  348.22     0.03 11302.65    64.91     0.23    0.66    0.14    0.66   0.15   5.15
> sdk               0.00     1.99    0.01  369.94     0.03 12876.74    69.61     0.26    0.71    0.13    0.71   0.16   5.75
> sdf               0.01     1.79    1.55   31.12    39.64  1431.37    90.06     4.07  124.67   16.25  130.05   3.11  10.17
> sdi               0.22     3.17   23.92   72.90  1386.45  2676.28    83.93     7.75   80.00   24.31   98.27   4.31  41.77
> sdm               0.16     3.10   17.63   72.84   986.29  2767.24    82.98     6.57   72.64   23.67   84.50   4.23  38.30
> sdl               0.11     3.01   12.10   55.14   660.85  2361.40    89.89    17.87  265.80   21.64  319.36   4.08  27.45
> sdg               0.08     2.45    9.75   53.90   489.67  1929.42    76.01    17.27  271.30   20.77  316.61   3.98  25.33
> sdh               0.10     2.76   11.28   60.97   600.10  2114.48    75.14     1.70   23.55   22.92   23.66   4.10  29.60
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.50    0.00   146.00     0.00   584.00     0.01   16.00   16.00    0.00  16.00   0.80
> sdb               0.00     0.50    9.00  119.00  2036.00  2578.00    72.09     0.68    5.50    7.06    5.39   2.36  30.25
> sdc               0.00     4.00   34.00  129.00   494.00  6987.75    91.80     1.70   10.44   17.00    8.72   4.44  72.40
> sdd               0.00     1.50    1.50   95.50    74.00  2396.50    50.94     0.85    8.75   23.33    8.52   7.53  73.05
> sde               0.00    37.00   11.00    1.00    46.00   152.00    33.00     0.01    1.00    0.64    5.00   0.54   0.65
> sdj               0.00     0.50    0.00  970.50     0.00 12594.00    25.95     0.09    0.09    0.00    0.09   0.08   8.20
> sdk               0.00     0.00    0.00  977.50     0.00 12016.00    24.59     0.10    0.10    0.00    0.10   0.09   8.90
> sdf               0.00     0.50    0.50   37.50     2.00   230.25    12.22     9.63   10.58    8.00   10.61   1.79   6.80
> sdi               2.00     0.00   10.50    0.00  2528.00     0.00   481.52     0.10    9.33    9.33    0.00   7.76   8.15
> sdm               0.00     0.50   15.00  116.00   546.00   833.25    21.06     0.94    7.17   14.03    6.28   4.13  54.15
> sdl               0.00     0.00    3.00    0.00    26.00     0.00    17.33     0.02    7.50    7.50    0.00   7.50   2.25
> sdg               0.00     3.50    1.00   64.50     4.00  2929.25    89.56     0.40    6.04    9.00    5.99   3.42  22.40
> sdh               0.50     0.50    4.00   64.00   770.00  1105.00    55.15     4.96  189.42   21.25  199.93   4.21  28.60
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    8.50    0.00   110.00     0.00    25.88     0.01    1.59    1.59    0.00   1.53   1.30
> sdb               0.00     4.00    6.50  117.50   494.00  4544.50    81.27     0.87    6.99   11.62    6.73   3.28  40.70
> sdc               0.00     0.50    5.50  202.50   526.00  4123.00    44.70     1.80    8.66   18.73    8.39   2.08  43.30
> sdd               0.00     3.00    2.50  227.00   108.00  6952.00    61.53    46.10  197.44   30.20  199.29   3.86  88.60
> sde               0.00     0.00    0.00    1.50     0.00     6.00     8.00     0.00    2.33    0.00    2.33   1.33   0.20
> sdj               0.00     0.00    0.00  834.00     0.00  9912.00    23.77     0.08    0.09    0.00    0.09   0.08   6.75
> sdk               0.00     0.00    0.00  777.00     0.00 12318.00    31.71     0.12    0.15    0.00    0.15   0.10   7.70
> sdf               0.00     1.00    4.50  117.00   198.00   693.25    14.67    34.86  362.88   84.33  373.60   3.59  43.65
> sdi               0.00     0.00    1.50    0.00     6.00     0.00     8.00     0.01    9.00    9.00    0.00   9.00   1.35
> sdm               0.50     3.00    3.50  143.00  1014.00  4205.25    71.25     0.93    5.95   20.43    5.59   3.08  45.15
> sdl               0.50     0.00    8.00  148.50  1578.00  2128.50    47.37     0.82    5.27    6.44    5.21   3.40  53.20
> sdg               1.50     2.00   10.50  100.50  2540.00  2039.50    82.51     0.77    7.00   14.19    6.25   5.42  60.20
> sdh               0.50     0.00    5.00    0.00  1050.00     0.00   420.00     0.04    7.10    7.10    0.00   7.10   3.55
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    6.00    0.00   604.00     0.00   201.33     0.03    5.58    5.58    0.00   5.58   3.35
> sdb               0.00     6.00    7.00  236.00   132.00  8466.00    70.77    45.48  186.59   31.79  191.18   1.62  39.45
> sdc               2.00     0.00   19.50   46.50  6334.00   686.00   212.73     0.39    5.96    7.97    5.12   3.57  23.55
> sdd               0.00     1.00    3.00   20.00    72.00  1527.25   139.07     0.31   47.67    6.17   53.90   3.11   7.15
> sde               0.00    17.00    0.00    4.50     0.00   184.00    81.78     0.01    2.33    0.00    2.33   2.33   1.05
> sdj               0.00     0.00    0.00  805.50     0.00 12760.00    31.68     0.21    0.27    0.00    0.27   0.09   7.35
> sdk               0.00     0.00    0.00  438.00     0.00 14300.00    65.30     0.24    0.54    0.00    0.54   0.13   5.65
> sdf               0.00     0.00    1.00    0.00     6.00     0.00    12.00     0.00    2.50    2.50    0.00   2.50   0.25
> sdi               0.00     5.50   14.50   27.50   394.00  6459.50   326.36     0.86   20.18   11.00   25.02   7.42  31.15
> sdm               0.00     1.00    9.00  175.00   554.00  3173.25    40.51     1.12    6.38    7.22    6.34   2.41  44.40
> sdl               0.00     2.00    2.50  100.50    26.00  2483.00    48.72     0.77    7.47   11.80    7.36   2.10  21.65
> sdg               0.00     4.50    9.00  214.00   798.00  7417.00    73.68    66.56  298.46   28.83  309.80   3.35  74.70
> sdh               0.00     0.00   16.50    0.00   344.00     0.00    41.70     0.09    5.61    5.61    0.00   4.55   7.50
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00    9.00    0.00  3162.00     0.00   702.67     0.07    8.06    8.06    0.00   6.06   5.45
> sdb               0.50     0.00   12.50   13.00  1962.00   298.75   177.31     0.63   30.00    4.84   54.19   9.96  25.40
> sdc               0.00     0.50    3.50  131.00    18.00  1632.75    24.55     0.87    6.48   16.86    6.20   3.51  47.25
> sdd               0.00     0.00    4.00    0.00    72.00    16.00    44.00     0.26   10.38   10.38    0.00  23.38   9.35
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  843.50     0.00 16334.00    38.73     0.19    0.23    0.00    0.23   0.11   9.10
> sdk               0.00     0.00    0.00  803.00     0.00 10394.00    25.89     0.07    0.08    0.00    0.08   0.08   6.25
> sdf               0.00     4.00   11.00   90.50   150.00  2626.00    54.70     0.59    5.84    3.82    6.08   4.06  41.20
> sdi               0.00     3.50   17.50  130.50  2132.00  6309.50   114.07     1.84   12.55   25.60   10.80   5.76  85.30
> sdm               0.00     4.00    2.00  139.00    44.00  1957.25    28.39     0.89    6.28   14.50    6.17   3.55  50.10
> sdl               0.00     0.50   12.00  101.00   334.00  1449.75    31.57     0.94    8.28   10.17    8.06   2.11  23.85
> sdg               0.00     0.00    2.50    3.00   204.00    17.00    80.36     0.02    5.27    4.60    5.83   3.91   2.15
> sdh               0.00     0.50    9.50   32.50  1810.00   199.50    95.69     0.28    6.69    3.79    7.54   5.12  21.50
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.50     0.50   25.00   24.50  1248.00   394.25    66.35     0.76   15.30   11.62   19.06   5.25  26.00
> sdb               1.50     0.00   13.50   30.00  2628.00   405.25   139.46     0.27    5.94    8.19    4.93   5.31  23.10
> sdc               0.00     6.00    3.00  163.00    60.00  9889.50   119.87     1.66    9.83   28.67    9.48   5.95  98.70
> sdd               0.00    11.00    5.50  353.50    50.00  2182.00    12.43   118.42  329.26   30.27  333.91   2.78  99.90
> sde               0.00     5.50    0.00    1.50     0.00    28.00    37.33     0.00    2.33    0.00    2.33   2.33   0.35
> sdj               0.00     0.00    0.00 1227.50     0.00 22064.00    35.95     0.50    0.41    0.00    0.41   0.10  12.50
> sdk               0.00     0.50    0.00 1073.50     0.00 19248.00    35.86     0.24    0.23    0.00    0.23   0.10  10.40
> sdf               0.00     4.00    0.00  109.00     0.00  4145.00    76.06     0.59    5.44    0.00    5.44   3.63  39.55
> sdi               0.00     1.00    8.50   95.50   218.00  2091.75    44.42     1.06    9.70   18.71    8.90   7.00  72.80
> sdm               0.00     0.00    8.00  177.50    82.00  3173.00    35.09     1.24    6.65   14.31    6.30   3.53  65.40
> sdl               0.00     3.50    3.00  187.50    32.00  2175.25    23.17     1.47    7.68   18.50    7.50   3.85  73.35
> sdg               0.00     0.00    1.00    0.00    12.00     0.00    24.00     0.00    1.50    1.50    0.00   1.50   0.15
> sdh               0.50     1.00   14.00  169.50  2364.00  4568.00    75.55     1.50    8.12   21.25    7.03   4.91  90.10
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     4.00    3.00   60.00   212.00  2542.00    87.43     0.58    8.02   15.50    7.64   7.95  50.10
> sdb               0.00     0.50    2.50   98.00   682.00  1652.00    46.45     0.51    5.13    6.20    5.10   3.05  30.65
> sdc               0.00     2.50    4.00  146.00    16.00  4623.25    61.86     1.07    7.33   13.38    7.17   2.22  33.25
> sdd               0.00     0.50    9.50   30.00   290.00   358.00    32.81     0.84   32.22   49.16   26.85  12.28  48.50
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.50    0.00  530.00     0.00  7138.00    26.94     0.06    0.11    0.00    0.11   0.09   4.65
> sdk               0.00     0.00    0.00  625.00     0.00  8254.00    26.41     0.07    0.12    0.00    0.12   0.09   5.75
> sdf               0.00     0.00    0.00    4.00     0.00    18.00     9.00     0.01    3.62    0.00    3.62   3.12   1.25
> sdi               0.00     2.50    8.00   61.00   836.00  2681.50   101.96     0.58    9.25   15.12    8.48   6.71  46.30
> sdm               0.00     4.50   11.00  273.00  2100.00  8562.00    75.08    13.49   47.53   24.95   48.44   1.83  52.00
> sdl               0.00     1.00    0.50   49.00     2.00  1038.00    42.02     0.23    4.83   14.00    4.73   2.45  12.15
> sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdh               1.00     1.00    9.00  109.00  2082.00  2626.25    79.80     0.85    7.34    7.83    7.30   3.83  45.20
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     1.50   10.00  177.00   284.00  4857.00    54.98    36.26  194.27   21.85  204.01   3.53  66.00
> sdb               1.00     0.50   39.50  119.50  1808.00  2389.25    52.80     1.58    9.96   12.32    9.18   2.42  38.45
> sdc               0.00     2.00   15.00  200.50   116.00  4951.00    47.03    14.37   66.70   73.87   66.16   2.23  47.95
> sdd               0.00     3.50    6.00   54.50   180.00  2360.50    83.98     0.69   11.36   20.42   10.36   7.99  48.35
> sde               0.00     7.50    0.00   32.50     0.00   160.00     9.85     1.64   50.51    0.00   50.51   1.48   4.80
> sdj               0.00     0.00    0.00  835.00     0.00 10198.00    24.43     0.07    0.09    0.00    0.09   0.08   6.50
> sdk               0.00     0.00    0.00  802.00     0.00 12534.00    31.26     0.23    0.29    0.00    0.29   0.10   8.05
> sdf               0.00     2.50    2.00  133.50    14.00  5272.25    78.03     4.37   32.21    4.50   32.63   1.73  23.40
> sdi               0.00     4.50   17.00  125.50  2676.00  8683.25   159.43     1.86   13.02   27.97   11.00   4.95  70.55
> sdm               0.00     0.00    7.00    0.50   540.00    32.00   152.53     0.05    7.07    7.57    0.00   7.07   5.30
> sdl               0.00     7.00   27.00  276.00  2374.00 11955.50    94.58    25.87   85.36   15.20   92.23   1.84  55.90
> sdg               0.00     0.00   45.00    0.00   828.00     0.00    36.80     0.07    1.62    1.62    0.00   0.68   3.05
> sdh               0.00     0.50    0.50   65.50     2.00  1436.25    43.58     0.51    7.79   16.00    7.73   3.61  23.80
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     8.00   14.50  150.00   122.00   929.25    12.78    20.65   70.61    7.55   76.71   1.46  24.05
> sdb               0.00     5.00    8.00  283.50    86.00  2757.50    19.51    69.43  205.40   51.75  209.73   2.40  69.85
> sdc               0.00     0.00   12.50    1.50   350.00    48.25    56.89     0.25   17.75   17.00   24.00   4.75   6.65
> sdd               0.00     3.50   36.50  141.00   394.00  2338.75    30.79     1.50    8.42   16.16    6.41   4.56  80.95
> sde               0.00     1.50    0.00    1.00     0.00    10.00    20.00     0.00    2.00    0.00    2.00   2.00   0.20
> sdj               0.00     0.00    0.00 1059.00     0.00 18506.00    34.95     0.19    0.18    0.00    0.18   0.10  10.75
> sdk               0.00     0.00    0.00 1103.00     0.00 14220.00    25.78     0.09    0.08    0.00    0.08   0.08   8.35
> sdf               0.00     5.50    2.00   19.50     8.00  5158.75   480.63     0.17    8.05    6.50    8.21   6.95  14.95
> sdi               0.00     5.50   28.00  224.50  2210.00  8971.75    88.57   122.15  328.47   27.43  366.02   3.71  93.70
> sdm               0.00     0.00   13.00    4.00   718.00    16.00    86.35     0.15    3.76    4.23    2.25   3.62   6.15
> sdl               0.00     0.00   16.50    0.00   832.00     0.00   100.85     0.02    1.12    1.12    0.00   1.09   1.80
> sdg               0.00     2.50   17.00   23.50  1032.00  3224.50   210.20     0.25    6.25    2.56    8.91   3.41  13.80
> sdh               0.00    10.50    4.50  241.00    66.00  7252.00    59.62    23.00   91.66    4.22   93.29   2.11  51.85
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.50    3.50   91.00    92.00   552.75    13.65    36.27  479.41   81.57  494.71   5.65  53.35
> sdb               0.00     1.00    6.00  168.00   224.00   962.50    13.64    83.35  533.92   62.00  550.77   5.75 100.00
> sdc               0.00     1.00    3.00  171.00    16.00  1640.00    19.03     1.08    6.18   11.83    6.08   3.15  54.80
> sdd               0.00     5.00    5.00  107.50   132.00  6576.75   119.27     0.79    7.06   18.80    6.51   5.13  57.70
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00 1111.50     0.00 22346.00    40.21     0.27    0.24    0.00    0.24   0.11  12.10
> sdk               0.00     0.00    0.00 1022.00     0.00 33040.00    64.66     0.68    0.67    0.00    0.67   0.13  13.60
> sdf               0.00     5.50    2.50   91.00    12.00  4977.25   106.72     2.29   24.48   14.40   24.76   2.42  22.60
> sdi               0.00     0.00   10.00   69.50   368.00   858.50    30.86     7.40  586.41    5.50  669.99   4.21  33.50
> sdm               0.00     4.00    8.00  210.00   944.00  5833.50    62.18     1.57    7.62   18.62    7.20   4.57  99.70
> sdl               0.00     0.00    7.50   22.50   104.00   253.25    23.82     0.14    4.82    5.07    4.73   4.03  12.10
> sdg               0.00     4.00    1.00   84.00     4.00  3711.75    87.43     0.58    6.88   12.50    6.81   5.75  48.90
> sdh               0.00     3.50    7.50   44.00    72.00  2954.25   117.52     1.54   39.50   61.73   35.72   6.40  32.95
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00    20.00     0.00    40.00     0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     7.00   10.50  198.50  2164.00  6014.75    78.27     1.94    9.29   28.90    8.25   4.77  99.75
> sdc               0.00     2.00    4.00   95.50   112.00  5152.25   105.81     0.94    9.46   24.25    8.84   4.68  46.55
> sdd               0.00     1.00    2.00  131.00    10.00  7167.25   107.93     4.55   34.23   83.25   33.48   2.52  33.55
> sde               0.00     0.00    0.00    0.50     0.00     2.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  541.50     0.00  6468.00    23.89     0.05    0.10    0.00    0.10   0.09   5.00
> sdk               0.00     0.00    0.00  509.00     0.00  7704.00    30.27     0.07    0.14    0.00    0.14   0.10   4.85
> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi               0.00     0.00    3.50    0.00    90.00     0.00    51.43     0.04   10.14   10.14    0.00  10.14   3.55
> sdm               0.00     2.00    5.00  102.50  1186.00  4583.00   107.33     0.81    7.56   23.20    6.80   2.78  29.85
> sdl               0.00    14.00   10.00  216.00   112.00  3645.50    33.25    73.45  311.05   46.30  323.31   3.51  79.35
> sdg               0.00     1.00    0.00   52.50     0.00   240.00     9.14     0.25    4.76    0.00    4.76   4.48  23.50
> sdh               0.00     0.00    3.50    0.00    18.00     0.00    10.29     0.02    7.00    7.00    0.00   7.00   2.45
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00     4.00     0.00     8.00     0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     9.00    2.00  292.00   192.00 10925.75    75.63    36.98  100.27   54.75  100.58   2.95  86.60
> sdc               0.00     9.00   10.50  151.00    78.00  6771.25    84.82    36.06   94.60   26.57   99.33   3.77  60.85
> sdd               0.00     0.00    5.00    1.00    74.00    24.00    32.67     0.03    5.00    6.00    0.00   5.00   3.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  787.50     0.00  9418.00    23.92     0.07    0.10    0.00    0.10   0.09   6.70
> sdk               0.00     0.00    0.00  766.50     0.00  9400.00    24.53     0.08    0.11    0.00    0.11   0.10   7.70
> sdf               0.00     0.00    0.50   41.50     6.00   391.00    18.90     0.24    5.79    9.00    5.75   5.50  23.10
> sdi               0.00    10.00    9.00  268.00    92.00  1618.75    12.35    68.20  150.90   15.50  155.45   2.36  65.30
> sdm               0.00    11.50   10.00  330.50    72.00  3201.25    19.23    68.83  139.38   37.45  142.46   1.84  62.80
> sdl               0.00     2.50    2.50  228.50    14.00  2526.00    21.99    90.42  404.71  242.40  406.49   4.33 100.00
> sdg               0.00     5.50    7.50  298.00    68.00  5275.25    34.98    75.31  174.85   26.73  178.58   2.67  81.60
> sdh               0.00     0.00    2.50    2.00    28.00    24.00    23.11     0.01    2.78    5.00    0.00   2.78   1.25
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson  wrote:
> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are very bad at doing write at the same time as read. Especially small read write.
>
> This has an absurdly bad effect when doing maintenance on ceph. That being said we see a lot of difference between dumpling and hammer in performance on these drives. Most likely due to hammer able to read write degraded PGs.
>
> We have run into two different problems along the way, the first was blocked request where we had to upgrade from 64GB mem on each node to 256GB. We thought that it was the only safe buy make things better.
>
> I believe it worked because more reads were cached so we had less mixed read write on the nodes, thus giving the spindles more room to breath. Now this was a shot in the dark then, but the price is not that high even to just try it out.. compared to 6 people working on it. I believe the IO on disk was not huge either, but what kills the disk is high latency. How much bandwidth are the disk using? We had very low.. 3-5MB/s.
>
> The second problem was defragmentations hitting 70%, lowering that to 6% made a lot of difference. Depending on IO pattern this increases different.
>
> TL;DR read kills the 4TB spindles.
>
> Hope you guys clear out of the woods.
> /Josef
>
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:
> - -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> - -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> - -----END PGP SIGNATURE-----
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163  mtu 1500
> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>> has
> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server   IP addr              OSD
> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing listceph-users-idqoXFIVOFIuzjOF24JIZ5en40jlLrCI@public.gmane.org://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEVA5CRDmVDuy+mK58QAAY4EP/2jTEGPrbR3KDOC1d6FU
> 7TkVeFtow7UCe9/TwArLtcEVTr8rdaXNWRi7gat99zbL5pw+96Sj6bGqpKVz
> ZBSHcBlLIl42Hj10Ju7Svpwn7Q9RnSGOvjEdghEKsTxnf37gZD/KjvMbidJu
> jlPGEfnGEdYbQ+vDYoCoUIuvUNPbCvWQTjJpnTXrMZfhhEBoOepMzF9s6L6B
> AWR9WUrtz4HtGSMT42U1gd3LDOUh/5Ioy6FuhJe04piaf3ikRg+pjX47/WBd
> mQupmKJOblaULCswOrMLTS9R2+p6yaWj0zlUb3OAOErO7JR8OWZ2H7tYjkQN
> rGPsIRNv4yKw2Z5vJdHLksVdYhBQY1I4N1GO3+hf+j/yotPC9Ay4BYLZrQwf
> 3L+uhqSEu80erZjsJF4lilmw0l9nbDSoXc0MqRoXrpUIqyVtmaCBynv5Xq7s
> L5idaH6iVPBwy4Y6qzVuQpP0LaHp48ojIRx7likQJt0MSeDzqnslnp5B/9nb
> Ppu3peRUKf5GEKISRQ6gOI3C4gTSSX6aBatWdtpm01Et0T6ysoxAP/VoO3Nb
> 0PDsuYYT0U1MYqi0USouiNc4yRWNb9hkkBHEJrwjtP52moL1WYdYleL6w+FS
> Y1YQ1DU8YsEtVniBmZc4TBQJRRIS6SaQjH108JCjUcy9oVNwRtOqbcT1aiI6
> EP/Q
> =efx7
> -----END PGP SIGNATURE-----
>
>

[-- Attachment #1.2: Type: text/html, Size: 62107 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                         ` <alpine.DEB.2.00.1510040646020.5233-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-04 21:04                                                           ` Robert LeBlanc
       [not found]                                                             ` <CAANLjFqgJbLEsBYEW=bk0h+Lmop-MLX=eA7qx98h-3Z6M1x7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-04 21:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I have eight nodes running the fio job rbd_test_real to different RBD
volumes. I've included the CRUSH map in the tarball.

I stopped one OSD process and marked it out. I let it recover for a
few minutes and then I started the process again and marked it in. I
started getting block I/O messages during the recovery.

The logs are located at http://162.144.87.113/files/ushou1.tar.xz

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
3EPx
=UDIV
-----END PGP SIGNATURE-----

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> We are still struggling with this and have tried a lot of different
>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> consulting services for non-Red Hat systems. If there are some
>> certified Ceph consultants in the US that we can do both remote and
>> on-site engagements, please let us know.
>>
>> This certainly seems to be network related, but somewhere in the
>> kernel. We have tried increasing the network and TCP buffers, number
>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> on the boxes, the disks are busy, but not constantly at 100% (they
>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> at a time). There seems to be no reasonable explanation why I/O is
>> blocked pretty frequently longer than 30 seconds. We have verified
>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> network admins have verified that packets are not being dropped in the
>> switches for these nodes. We have tried different kernels including
>> the recent Google patch to cubic. This is showing up on three cluster
>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> (from CentOS 7.1) with similar results.
>>
>> The messages seem slightly different:
>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> 100.087155 secs
>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> cluster [WRN] slow request 30.041999 seconds old, received at
>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> points reached
>>
>> I don't know what "no flag points reached" means.
>
> Just that the op hasn't been marked as reaching any interesting points
> (op->mark_*() calls).
>
> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> It's extremely verbose but it'll let us see where the op is getting
> blocked.  If you see the "slow request" message it means the op in
> received by ceph (that's when the clock starts), so I suspect it's not
> something we can blame on the network stack.
>
> sage
>
>
>>
>> The problem is most pronounced when we have to reboot an OSD node (1
>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> seconds. It takes a good 15 minutes for things to settle down. The
>> production cluster is very busy doing normally 8,000 I/O and peaking
>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> are between 25-50% full. We are currently splitting PGs to distribute
>> the load better across the disks, but we are having to do this 10 PGs
>> at a time as we get blocked I/O. We have max_backfills and
>> max_recovery set to 1, client op priority is set higher than recovery
>> priority. We tried increasing the number of op threads but this didn't
>> seem to help. It seems as soon as PGs are finished being checked, they
>> become active and could be the cause for slow I/O while the other PGs
>> are being checked.
>>
>> What I don't understand is that the messages are delayed. As soon as
>> the message is received by Ceph OSD process, it is very quickly
>> committed to the journal and a response is sent back to the primary
>> OSD which is received very quickly as well. I've adjust
>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> of RAM per nodes for 10 OSDs.
>>
>> Is there something that could cause the kernel to get a packet but not
>> be able to dispatch it to Ceph such that it could be explaining why we
>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> to tracing Ceph messages from the network buffer through the kernel to
>> the Ceph process?
>>
>> We can really use some pointers no matter how outrageous. We've have
>> over 6 people looking into this for weeks now and just can't think of
>> anything else.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> l7OF
>> =OI++
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> > like all the blocked I/O has stopped (no entries in the log for the
>> > last 12 hours). This makes me believe that there is some issue with
>> > the number of sockets or some other TCP issue. We have not messed with
>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> > processes and 16K system wide.
>> >
>> > Does this seem like the right spot to be looking? What are some
>> > configuration items we should be looking at?
>> >
>> > Thanks,
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >> seems that there were some major reworks in the network handling in
>> >> the kernel to efficiently handle that network rate. If I remember
>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >> that we did see packet loss while congesting our ISLs in our initial
>> >> testing, but we could not tell where the dropping was happening. We
>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >> trying to congest things. We probably already saw this issue, just
>> >> didn't know it.
>> >> - ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >>> drivers might cause problems though.
>> >>>
>> >>> Here's ifconfig from one of the nodes:
>> >>>
>> >>> ens513f1: flags=4163  mtu 1500
>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >>>
>> >>> Mark
>> >>>
>> >>>
>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>>>
>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>> Hash: SHA256
>> >>>>
>> >>>> OK, here is the update on the saga...
>> >>>>
>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >>>>
>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >>>> need the network enhancements in the 4.x series to work well.
>> >>>>
>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >>>> differing hardware and network configs.
>> >>>>
>> >>>> Thanks,
>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>>> Version: Mailvelope v1.1.0
>> >>>> Comment: https://www.mailvelope.com
>> >>>>
>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>>> 4OEo
>> >>>> =P33I
>> >>>> -----END PGP SIGNATURE-----
>> >>>> ----------------
>> >>>> Robert LeBlanc
>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>
>> >>>>
>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>>> wrote:
>> >>>>>
>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>> Hash: SHA256
>> >>>>>
>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >>>>> blocked I/O.
>> >>>>>
>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >>>>> the blocked I/O.
>> >>>>> - ----------------
>> >>>>> Robert LeBlanc
>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>>>>>
>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>>>>>>
>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >>>>>>> delayed for many 10s of seconds?
>> >>>>>>
>> >>>>>>
>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >>>>>> has
>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >>>>>>
>> >>>>>> sage
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> What kernel are you running?
>> >>>>>>> -Sam
>> >>>>>>>
>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >>>>>>>>
>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>>> Hash: SHA256
>> >>>>>>>>
>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>>>>>>>
>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >>>>>>>>
>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >>>>>>>> transfer).
>> >>>>>>>>
>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >>>>>>>> passed to another thread right away or something. This test was done
>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >>>>>>>> thread.
>> >>>>>>>>
>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >>>>>>>> some help.
>> >>>>>>>>
>> >>>>>>>> Single Test started about
>> >>>>>>>> 2015-09-22 12:52:36
>> >>>>>>>>
>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>>> 30.439150 secs
>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>>>   currently waiting for subops from 13,16
>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >>>>>>>> 30.379680 secs
>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >>>>>>>> 12:55:06.406303:
>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >>>>>>>> 12:55:06.318144:
>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>>>   currently waiting for subops from 13,14
>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>>> 30.954212 secs
>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>>>   currently waiting for subops from 16,17
>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>>> 30.704367 secs
>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>>>
>> >>>>>>>> Server   IP addr              OSD
>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>>>>>>>
>> >>>>>>>> fio job:
>> >>>>>>>> [rbd-test]
>> >>>>>>>> readwrite=write
>> >>>>>>>> blocksize=4M
>> >>>>>>>> #runtime=60
>> >>>>>>>> name=rbd-test
>> >>>>>>>> #readwrite=randwrite
>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>>>>>>> #rwmixread=72
>> >>>>>>>> #norandommap
>> >>>>>>>> #size=1T
>> >>>>>>>> #blocksize=4k
>> >>>>>>>> ioengine=rbd
>> >>>>>>>> rbdname=test2
>> >>>>>>>> pool=rbd
>> >>>>>>>> clientname=admin
>> >>>>>>>> iodepth=8
>> >>>>>>>> #numjobs=4
>> >>>>>>>> #thread
>> >>>>>>>> #group_reporting
>> >>>>>>>> #time_based
>> >>>>>>>> #direct=1
>> >>>>>>>> #ramp_time=60
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>>>>>>>
>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>>>>>>> J3hS
>> >>>>>>>> =0J7F
>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>>>>>>> ----------------
>> >>>>>>>> Robert LeBlanc
>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>>>>> Hash: SHA256
>> >>>>>>>>>>
>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >>>>>>>>>>
>> >>>>>>>>>> I'm not
>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >>>>>>>>> this, it was discussed not too long ago.
>> >>>>>>>>>
>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >>>>>>>>>>
>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >>>>>>>>>> the
>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >>>>>>>>>> on
>> >>>>>>>>>> my test cluster with no other load.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >>>>>>>>> 20",
>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >>>>>>>>> out
>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >>>>>>>>> -Greg
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>>>>>>> in
>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>> Version: Mailvelope v1.1.0
>> >>>>> Comment: https://www.mailvelope.com
>> >>>>>
>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>>>> gcZm
>> >>>>> =CjwB
>> >>>>> -----END PGP SIGNATURE-----
>> >>>>
>> >>>> --
>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>
>> >>>
>> >>
>> >> -----BEGIN PGP SIGNATURE-----
>> >> Version: Mailvelope v1.1.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >> ae22
>> >> =AX+L
>> >> -----END PGP SIGNATURE-----
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                             ` <CAANLjFqgJbLEsBYEW=bk0h+Lmop-MLX=eA7qx98h-3Z6M1x7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06  3:35                                                               ` Robert LeBlanc
       [not found]                                                                 ` <CAANLjFruw-1yySqO=aY05c0bzuqdkBH0-WcKmyP_+JtSyA1kpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06  3:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

With some off-list help, we have adjusted
osd_client_message_cap=10000. This seems to have helped a bit and we
have seen some OSDs have a value up to 4,000 for client messages. But
it does not solve the problem with the blocked I/O.

One thing that I have noticed is that almost exactly 30 seconds elapse
between an OSD boots and the first blocked I/O message. I don't know
if the OSD doesn't have time to get it's brain right about a PG before
it starts servicing it or what exactly.

On another note, I tried upgrading our CentOS dev cluster from Hammer
to master and things didn't go so well. The OSDs would not start
because /var/lib/ceph was not owned by ceph. I chowned the directory
and all OSDs and the OSD then started, but never became active in the
cluster. It just sat there after reading all the PGs. There were
sockets open to the monitor, but no OSD to OSD sockets. I tried
downgrading to the Infernalis branch and still no luck getting the
OSDs to come up. The OSD processes were idle after the initial boot.
All packages were installed from gitbuilder.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
GdXC
=Aigq
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I have eight nodes running the fio job rbd_test_real to different RBD
> volumes. I've included the CRUSH map in the tarball.
>
> I stopped one OSD process and marked it out. I let it recover for a
> few minutes and then I started the process again and marked it in. I
> started getting block I/O messages during the recovery.
>
> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> 3EPx
> =UDIV
> -----END PGP SIGNATURE-----
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> We are still struggling with this and have tried a lot of different
>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>> consulting services for non-Red Hat systems. If there are some
>>> certified Ceph consultants in the US that we can do both remote and
>>> on-site engagements, please let us know.
>>>
>>> This certainly seems to be network related, but somewhere in the
>>> kernel. We have tried increasing the network and TCP buffers, number
>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>> at a time). There seems to be no reasonable explanation why I/O is
>>> blocked pretty frequently longer than 30 seconds. We have verified
>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>> network admins have verified that packets are not being dropped in the
>>> switches for these nodes. We have tried different kernels including
>>> the recent Google patch to cubic. This is showing up on three cluster
>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>> (from CentOS 7.1) with similar results.
>>>
>>> The messages seem slightly different:
>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>> 100.087155 secs
>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>> points reached
>>>
>>> I don't know what "no flag points reached" means.
>>
>> Just that the op hasn't been marked as reaching any interesting points
>> (op->mark_*() calls).
>>
>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> It's extremely verbose but it'll let us see where the op is getting
>> blocked.  If you see the "slow request" message it means the op in
>> received by ceph (that's when the clock starts), so I suspect it's not
>> something we can blame on the network stack.
>>
>> sage
>>
>>
>>>
>>> The problem is most pronounced when we have to reboot an OSD node (1
>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>> seconds. It takes a good 15 minutes for things to settle down. The
>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>> are between 25-50% full. We are currently splitting PGs to distribute
>>> the load better across the disks, but we are having to do this 10 PGs
>>> at a time as we get blocked I/O. We have max_backfills and
>>> max_recovery set to 1, client op priority is set higher than recovery
>>> priority. We tried increasing the number of op threads but this didn't
>>> seem to help. It seems as soon as PGs are finished being checked, they
>>> become active and could be the cause for slow I/O while the other PGs
>>> are being checked.
>>>
>>> What I don't understand is that the messages are delayed. As soon as
>>> the message is received by Ceph OSD process, it is very quickly
>>> committed to the journal and a response is sent back to the primary
>>> OSD which is received very quickly as well. I've adjust
>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>> of RAM per nodes for 10 OSDs.
>>>
>>> Is there something that could cause the kernel to get a packet but not
>>> be able to dispatch it to Ceph such that it could be explaining why we
>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>> to tracing Ceph messages from the network buffer through the kernel to
>>> the Ceph process?
>>>
>>> We can really use some pointers no matter how outrageous. We've have
>>> over 6 people looking into this for weeks now and just can't think of
>>> anything else.
>>>
>>> Thanks,
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.1.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>> l7OF
>>> =OI++
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>> > like all the blocked I/O has stopped (no entries in the log for the
>>> > last 12 hours). This makes me believe that there is some issue with
>>> > the number of sockets or some other TCP issue. We have not messed with
>>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>> > processes and 16K system wide.
>>> >
>>> > Does this seem like the right spot to be looking? What are some
>>> > configuration items we should be looking at?
>>> >
>>> > Thanks,
>>> > ----------------
>>> > Robert LeBlanc
>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >
>>> >
>>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> Hash: SHA256
>>> >>
>>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>> >> seems that there were some major reworks in the network handling in
>>> >> the kernel to efficiently handle that network rate. If I remember
>>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>> >> that we did see packet loss while congesting our ISLs in our initial
>>> >> testing, but we could not tell where the dropping was happening. We
>>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>> >> trying to congest things. We probably already saw this issue, just
>>> >> didn't know it.
>>> >> - ----------------
>>> >> Robert LeBlanc
>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>
>>> >>
>>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>> >>> drivers might cause problems though.
>>> >>>
>>> >>> Here's ifconfig from one of the nodes:
>>> >>>
>>> >>> ens513f1: flags=4163  mtu 1500
>>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>> >>>
>>> >>> Mark
>>> >>>
>>> >>>
>>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>> >>>>
>>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>> Hash: SHA256
>>> >>>>
>>> >>>> OK, here is the update on the saga...
>>> >>>>
>>> >>>> I traced some more of blocked I/Os and it seems that communication
>>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>> >>>>
>>> >>>> I then ran iperf between all the nodes with the same results, so that
>>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>> >>>> need the network enhancements in the 4.x series to work well.
>>> >>>>
>>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>> >>>> with this issue are running 4.x, other than that, they are pretty
>>> >>>> differing hardware and network configs.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> -----BEGIN PGP SIGNATURE-----
>>> >>>> Version: Mailvelope v1.1.0
>>> >>>> Comment: https://www.mailvelope.com
>>> >>>>
>>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>> >>>> 4OEo
>>> >>>> =P33I
>>> >>>> -----END PGP SIGNATURE-----
>>> >>>> ----------------
>>> >>>> Robert LeBlanc
>>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>> Hash: SHA256
>>> >>>>>
>>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>> >>>>> blocked I/O.
>>> >>>>>
>>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> >>>>> the blocked I/O.
>>> >>>>> - ----------------
>>> >>>>> Robert LeBlanc
>>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>> >>>>>>
>>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> >>>>>>>
>>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>> >>>>>>> delayed for many 10s of seconds?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>> >>>>>> has
>>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>> >>>>>>
>>> >>>>>> sage
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> What kernel are you running?
>>> >>>>>>> -Sam
>>> >>>>>>>
>>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> >>>>>>>>
>>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>>> Hash: SHA256
>>> >>>>>>>>
>>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> >>>>>>>> extracted what I think are important entries from the logs for the
>>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >>>>>>>>
>>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >>>>>>>>
>>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>> >>>>>>>> transfer).
>>> >>>>>>>>
>>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>> >>>>>>>> later. To me it seems that the message is getting received but not
>>> >>>>>>>> passed to another thread right away or something. This test was done
>>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>> >>>>>>>> thread.
>>> >>>>>>>>
>>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>> >>>>>>>> some help.
>>> >>>>>>>>
>>> >>>>>>>> Single Test started about
>>> >>>>>>>> 2015-09-22 12:52:36
>>> >>>>>>>>
>>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>>> 30.439150 secs
>>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>>>   currently waiting for subops from 13,16
>>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>> >>>>>>>> 30.379680 secs
>>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>> >>>>>>>> 12:55:06.406303:
>>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>>>   currently waiting for subops from 13,17
>>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>> >>>>>>>> 12:55:06.318144:
>>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>>>   currently waiting for subops from 13,14
>>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>>> 30.954212 secs
>>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>>>   currently waiting for subops from 16,17
>>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>>> 30.704367 secs
>>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>>>   currently waiting for subops from 13,17
>>> >>>>>>>>
>>> >>>>>>>> Server   IP addr              OSD
>>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>> >>>>>>>>
>>> >>>>>>>> fio job:
>>> >>>>>>>> [rbd-test]
>>> >>>>>>>> readwrite=write
>>> >>>>>>>> blocksize=4M
>>> >>>>>>>> #runtime=60
>>> >>>>>>>> name=rbd-test
>>> >>>>>>>> #readwrite=randwrite
>>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>> >>>>>>>> #rwmixread=72
>>> >>>>>>>> #norandommap
>>> >>>>>>>> #size=1T
>>> >>>>>>>> #blocksize=4k
>>> >>>>>>>> ioengine=rbd
>>> >>>>>>>> rbdname=test2
>>> >>>>>>>> pool=rbd
>>> >>>>>>>> clientname=admin
>>> >>>>>>>> iodepth=8
>>> >>>>>>>> #numjobs=4
>>> >>>>>>>> #thread
>>> >>>>>>>> #group_reporting
>>> >>>>>>>> #time_based
>>> >>>>>>>> #direct=1
>>> >>>>>>>> #ramp_time=60
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>>> Version: Mailvelope v1.1.0
>>> >>>>>>>> Comment: https://www.mailvelope.com
>>> >>>>>>>>
>>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>> >>>>>>>> J3hS
>>> >>>>>>>> =0J7F
>>> >>>>>>>> -----END PGP SIGNATURE-----
>>> >>>>>>>> ----------------
>>> >>>>>>>> Robert LeBlanc
>>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>>>>> Hash: SHA256
>>> >>>>>>>>>>
>>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>> >>>>>>>>>>
>>> >>>>>>>>>> I'm not
>>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>> >>>>>>>>> this, it was discussed not too long ago.
>>> >>>>>>>>>
>>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>> >>>>>>>>>> the
>>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>> >>>>>>>>>> on
>>> >>>>>>>>>> my test cluster with no other load.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>> >>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>> >>>>>>>>> 20",
>>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>> >>>>>>>>> out
>>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>> >>>>>>>>> -Greg
>>> >>>>>>>>
>>> >>>>>>>> --
>>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>>>>>>> in
>>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>> Version: Mailvelope v1.1.0
>>> >>>>> Comment: https://www.mailvelope.com
>>> >>>>>
>>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>> >>>>> gcZm
>>> >>>>> =CjwB
>>> >>>>> -----END PGP SIGNATURE-----
>>> >>>>
>>> >>>> --
>>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>
>>> >>>
>>> >>
>>> >> -----BEGIN PGP SIGNATURE-----
>>> >> Version: Mailvelope v1.1.0
>>> >> Comment: https://www.mailvelope.com
>>> >>
>>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>> >> ae22
>>> >> =AX+L
>>> >> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                 ` <CAANLjFruw-1yySqO=aY05c0bzuqdkBH0-WcKmyP_+JtSyA1kpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 12:37                                                                   ` Sage Weil
       [not found]                                                                     ` <alpine.DEB.2.00.1510060534010.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  2015-10-06 18:03                                                                     ` [ceph-users] " Robert LeBlanc
  0 siblings, 2 replies; 45+ messages in thread
From: Sage Weil @ 2015-10-06 12:37 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> With some off-list help, we have adjusted
> osd_client_message_cap=10000. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client messages. But
> it does not solve the problem with the blocked I/O.
> 
> One thing that I have noticed is that almost exactly 30 seconds elapse
> between an OSD boots and the first blocked I/O message. I don't know
> if the OSD doesn't have time to get it's brain right about a PG before
> it starts servicing it or what exactly.

I'm downloading the logs from yesterday now; sorry it's taking so long.

> On another note, I tried upgrading our CentOS dev cluster from Hammer
> to master and things didn't go so well. The OSDs would not start
> because /var/lib/ceph was not owned by ceph. I chowned the directory
> and all OSDs and the OSD then started, but never became active in the
> cluster. It just sat there after reading all the PGs. There were
> sockets open to the monitor, but no OSD to OSD sockets. I tried
> downgrading to the Infernalis branch and still no luck getting the
> OSDs to come up. The OSD processes were idle after the initial boot.
> All packages were installed from gitbuilder.

Did you chown -R ?

	https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer

My guess is you only chowned the root dir, and the OSD didn't throw 
an error when it encountered the other files?  If you can generate a debug 
osd = 20 log, that would be helpful.. thanks!

sage


> 
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> GdXC
> =Aigq
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > I have eight nodes running the fio job rbd_test_real to different RBD
> > volumes. I've included the CRUSH map in the tarball.
> >
> > I stopped one OSD process and marked it out. I let it recover for a
> > few minutes and then I started the process again and marked it in. I
> > started getting block I/O messages during the recovery.
> >
> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >
> > Thanks,
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.2.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> > 3EPx
> > =UDIV
> > -----END PGP SIGNATURE-----
> >
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> We are still struggling with this and have tried a lot of different
> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >>> consulting services for non-Red Hat systems. If there are some
> >>> certified Ceph consultants in the US that we can do both remote and
> >>> on-site engagements, please let us know.
> >>>
> >>> This certainly seems to be network related, but somewhere in the
> >>> kernel. We have tried increasing the network and TCP buffers, number
> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> >>> on the boxes, the disks are busy, but not constantly at 100% (they
> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> >>> at a time). There seems to be no reasonable explanation why I/O is
> >>> blocked pretty frequently longer than 30 seconds. We have verified
> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> >>> network admins have verified that packets are not being dropped in the
> >>> switches for these nodes. We have tried different kernels including
> >>> the recent Google patch to cubic. This is showing up on three cluster
> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> >>> (from CentOS 7.1) with similar results.
> >>>
> >>> The messages seem slightly different:
> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> >>> 100.087155 secs
> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> >>> cluster [WRN] slow request 30.041999 seconds old, received at
> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> >>> points reached
> >>>
> >>> I don't know what "no flag points reached" means.
> >>
> >> Just that the op hasn't been marked as reaching any interesting points
> >> (op->mark_*() calls).
> >>
> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> >> It's extremely verbose but it'll let us see where the op is getting
> >> blocked.  If you see the "slow request" message it means the op in
> >> received by ceph (that's when the clock starts), so I suspect it's not
> >> something we can blame on the network stack.
> >>
> >> sage
> >>
> >>
> >>>
> >>> The problem is most pronounced when we have to reboot an OSD node (1
> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> >>> seconds. It takes a good 15 minutes for things to settle down. The
> >>> production cluster is very busy doing normally 8,000 I/O and peaking
> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> >>> are between 25-50% full. We are currently splitting PGs to distribute
> >>> the load better across the disks, but we are having to do this 10 PGs
> >>> at a time as we get blocked I/O. We have max_backfills and
> >>> max_recovery set to 1, client op priority is set higher than recovery
> >>> priority. We tried increasing the number of op threads but this didn't
> >>> seem to help. It seems as soon as PGs are finished being checked, they
> >>> become active and could be the cause for slow I/O while the other PGs
> >>> are being checked.
> >>>
> >>> What I don't understand is that the messages are delayed. As soon as
> >>> the message is received by Ceph OSD process, it is very quickly
> >>> committed to the journal and a response is sent back to the primary
> >>> OSD which is received very quickly as well. I've adjust
> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> >>> of RAM per nodes for 10 OSDs.
> >>>
> >>> Is there something that could cause the kernel to get a packet but not
> >>> be able to dispatch it to Ceph such that it could be explaining why we
> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> >>> to tracing Ceph messages from the network buffer through the kernel to
> >>> the Ceph process?
> >>>
> >>> We can really use some pointers no matter how outrageous. We've have
> >>> over 6 people looking into this for weeks now and just can't think of
> >>> anything else.
> >>>
> >>> Thanks,
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: Mailvelope v1.1.0
> >>> Comment: https://www.mailvelope.com
> >>>
> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >>> l7OF
> >>> =OI++
> >>> -----END PGP SIGNATURE-----
> >>> ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>
> >>>
> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> >>> > like all the blocked I/O has stopped (no entries in the log for the
> >>> > last 12 hours). This makes me believe that there is some issue with
> >>> > the number of sockets or some other TCP issue. We have not messed with
> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> >>> > processes and 16K system wide.
> >>> >
> >>> > Does this seem like the right spot to be looking? What are some
> >>> > configuration items we should be looking at?
> >>> >
> >>> > Thanks,
> >>> > ----------------
> >>> > Robert LeBlanc
> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >
> >>> >
> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >> Hash: SHA256
> >>> >>
> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >>> >> seems that there were some major reworks in the network handling in
> >>> >> the kernel to efficiently handle that network rate. If I remember
> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> >>> >> that we did see packet loss while congesting our ISLs in our initial
> >>> >> testing, but we could not tell where the dropping was happening. We
> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >>> >> trying to congest things. We probably already saw this issue, just
> >>> >> didn't know it.
> >>> >> - ----------------
> >>> >> Robert LeBlanc
> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>
> >>> >>
> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> >>> drivers might cause problems though.
> >>> >>>
> >>> >>> Here's ifconfig from one of the nodes:
> >>> >>>
> >>> >>> ens513f1: flags=4163  mtu 1500
> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>> >>>
> >>> >>> Mark
> >>> >>>
> >>> >>>
> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>> >>>>
> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>> Hash: SHA256
> >>> >>>>
> >>> >>>> OK, here is the update on the saga...
> >>> >>>>
> >>> >>>> I traced some more of blocked I/Os and it seems that communication
> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>> >>>>
> >>> >>>> I then ran iperf between all the nodes with the same results, so that
> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>> >>>> need the network enhancements in the 4.x series to work well.
> >>> >>>>
> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>> >>>> with this issue are running 4.x, other than that, they are pretty
> >>> >>>> differing hardware and network configs.
> >>> >>>>
> >>> >>>> Thanks,
> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>> Version: Mailvelope v1.1.0
> >>> >>>> Comment: https://www.mailvelope.com
> >>> >>>>
> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>> >>>> 4OEo
> >>> >>>> =P33I
> >>> >>>> -----END PGP SIGNATURE-----
> >>> >>>> ----------------
> >>> >>>> Robert LeBlanc
> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>
> >>> >>>>
> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>> >>>> wrote:
> >>> >>>>>
> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>> Hash: SHA256
> >>> >>>>>
> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>> >>>>> blocked I/O.
> >>> >>>>>
> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>> >>>>> the blocked I/O.
> >>> >>>>> - ----------------
> >>> >>>>> Robert LeBlanc
> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>> >>>>>>
> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>> >>>>>>>
> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>> >>>>>>> delayed for many 10s of seconds?
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>> >>>>>> has
> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>> >>>>>>
> >>> >>>>>> sage
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>>
> >>> >>>>>>> What kernel are you running?
> >>> >>>>>>> -Sam
> >>> >>>>>>>
> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>>> Hash: SHA256
> >>> >>>>>>>>
> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>> >>>>>>>> extracted what I think are important entries from the logs for the
> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>> >>>>>>>>
> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>> >>>>>>>>
> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >>> >>>>>>>> transfer).
> >>> >>>>>>>>
> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>> >>>>>>>> later. To me it seems that the message is getting received but not
> >>> >>>>>>>> passed to another thread right away or something. This test was done
> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>> >>>>>>>> thread.
> >>> >>>>>>>>
> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>> >>>>>>>> some help.
> >>> >>>>>>>>
> >>> >>>>>>>> Single Test started about
> >>> >>>>>>>> 2015-09-22 12:52:36
> >>> >>>>>>>>
> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>>> 30.439150 secs
> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>>>   currently waiting for subops from 13,16
> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>> >>>>>>>> 30.379680 secs
> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>> >>>>>>>> 12:55:06.406303:
> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>> >>>>>>>> 12:55:06.318144:
> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>>>   currently waiting for subops from 13,14
> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>>> 30.954212 secs
> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>>>   currently waiting for subops from 16,17
> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>>> 30.704367 secs
> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>> >>>>>>>>
> >>> >>>>>>>> Server   IP addr              OSD
> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>> >>>>>>>>
> >>> >>>>>>>> fio job:
> >>> >>>>>>>> [rbd-test]
> >>> >>>>>>>> readwrite=write
> >>> >>>>>>>> blocksize=4M
> >>> >>>>>>>> #runtime=60
> >>> >>>>>>>> name=rbd-test
> >>> >>>>>>>> #readwrite=randwrite
> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>> >>>>>>>> #rwmixread=72
> >>> >>>>>>>> #norandommap
> >>> >>>>>>>> #size=1T
> >>> >>>>>>>> #blocksize=4k
> >>> >>>>>>>> ioengine=rbd
> >>> >>>>>>>> rbdname=test2
> >>> >>>>>>>> pool=rbd
> >>> >>>>>>>> clientname=admin
> >>> >>>>>>>> iodepth=8
> >>> >>>>>>>> #numjobs=4
> >>> >>>>>>>> #thread
> >>> >>>>>>>> #group_reporting
> >>> >>>>>>>> #time_based
> >>> >>>>>>>> #direct=1
> >>> >>>>>>>> #ramp_time=60
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Thanks,
> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>>> Version: Mailvelope v1.1.0
> >>> >>>>>>>> Comment: https://www.mailvelope.com
> >>> >>>>>>>>
> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>> >>>>>>>> J3hS
> >>> >>>>>>>> =0J7F
> >>> >>>>>>>> -----END PGP SIGNATURE-----
> >>> >>>>>>>> ----------------
> >>> >>>>>>>> Robert LeBlanc
> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>>>>> Hash: SHA256
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> I'm not
> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>> >>>>>>>>> this, it was discussed not too long ago.
> >>> >>>>>>>>>
> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>> >>>>>>>>>> the
> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>> >>>>>>>>>> on
> >>> >>>>>>>>>> my test cluster with no other load.
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>> >>>>>>>>>
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>> >>>>>>>>> 20",
> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>> >>>>>>>>> out
> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>> >>>>>>>>> -Greg
> >>> >>>>>>>>
> >>> >>>>>>>> --
> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> >>>>>>>> in
> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>
> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>> Version: Mailvelope v1.1.0
> >>> >>>>> Comment: https://www.mailvelope.com
> >>> >>>>>
> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>> >>>>> gcZm
> >>> >>>>> =CjwB
> >>> >>>>> -----END PGP SIGNATURE-----
> >>> >>>>
> >>> >>>> --
> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >> -----BEGIN PGP SIGNATURE-----
> >>> >> Version: Mailvelope v1.1.0
> >>> >> Comment: https://www.mailvelope.com
> >>> >>
> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >>> >> ae22
> >>> >> =AX+L
> >>> >> -----END PGP SIGNATURE-----
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                     ` <alpine.DEB.2.00.1510060534010.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-06 14:30                                                                       ` Robert LeBlanc
       [not found]                                                                         ` <CAANLjFo==i7wivrGR9LJFs3GOrD2iQHLdCpEfR-AruHyOMLi-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 14:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 32868 bytes --]

Thanks for your time Sage. It sounds like a few people may be helped if you
can find something.

I did a recursive chown as in the instructions (although I didn't know
about the doc at the time). I did an osd debug at 20/20 but didn't see
anything. I'll also do ms and make the logs available. I'll also review the
document to make sure I didn't miss anything else.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Oct 6, 2015 6:37 AM, "Sage Weil" <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > With some off-list help, we have adjusted
> > osd_client_message_cap=10000. This seems to have helped a bit and we
> > have seen some OSDs have a value up to 4,000 for client messages. But
> > it does not solve the problem with the blocked I/O.
> >
> > One thing that I have noticed is that almost exactly 30 seconds elapse
> > between an OSD boots and the first blocked I/O message. I don't know
> > if the OSD doesn't have time to get it's brain right about a PG before
> > it starts servicing it or what exactly.
>
> I'm downloading the logs from yesterday now; sorry it's taking so long.
>
> > On another note, I tried upgrading our CentOS dev cluster from Hammer
> > to master and things didn't go so well. The OSDs would not start
> > because /var/lib/ceph was not owned by ceph. I chowned the directory
> > and all OSDs and the OSD then started, but never became active in the
> > cluster. It just sat there after reading all the PGs. There were
> > sockets open to the monitor, but no OSD to OSD sockets. I tried
> > downgrading to the Infernalis branch and still no luck getting the
> > OSDs to come up. The OSD processes were idle after the initial boot.
> > All packages were installed from gitbuilder.
>
> Did you chown -R ?
>
>
> https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>
> My guess is you only chowned the root dir, and the OSD didn't throw
> an error when it encountered the other files?  If you can generate a debug
> osd = 20 log, that would be helpful.. thanks!
>
> sage
>
>
> >
> > Thanks,
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.2.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> > GdXC
> > =Aigq
> > -----END PGP SIGNATURE-----
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
> wrote:
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA256
> > >
> > > I have eight nodes running the fio job rbd_test_real to different RBD
> > > volumes. I've included the CRUSH map in the tarball.
> > >
> > > I stopped one OSD process and marked it out. I let it recover for a
> > > few minutes and then I started the process again and marked it in. I
> > > started getting block I/O messages during the recovery.
> > >
> > > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> > >
> > > Thanks,
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: Mailvelope v1.2.0
> > > Comment: https://www.mailvelope.com
> > >
> > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> > > 3EPx
> > > =UDIV
> > > -----END PGP SIGNATURE-----
> > >
> > > ----------------
> > > Robert LeBlanc
> > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >
> > >
> > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> > >>> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> Hash: SHA256
> > >>>
> > >>> We are still struggling with this and have tried a lot of different
> > >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> > >>> consulting services for non-Red Hat systems. If there are some
> > >>> certified Ceph consultants in the US that we can do both remote and
> > >>> on-site engagements, please let us know.
> > >>>
> > >>> This certainly seems to be network related, but somewhere in the
> > >>> kernel. We have tried increasing the network and TCP buffers, number
> > >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> > >>> on the boxes, the disks are busy, but not constantly at 100% (they
> > >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> > >>> at a time). There seems to be no reasonable explanation why I/O is
> > >>> blocked pretty frequently longer than 30 seconds. We have verified
> > >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> > >>> network admins have verified that packets are not being dropped in
> the
> > >>> switches for these nodes. We have tried different kernels including
> > >>> the recent Google patch to cubic. This is showing up on three cluster
> > >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> > >>> (from CentOS 7.1) with similar results.
> > >>>
> > >>> The messages seem slightly different:
> > >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> > >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for
> >
> > >>> 100.087155 secs
> > >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> > >>> cluster [WRN] slow request 30.041999 seconds old, received at
> > >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> > >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> > >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> > >>> points reached
> > >>>
> > >>> I don't know what "no flag points reached" means.
> > >>
> > >> Just that the op hasn't been marked as reaching any interesting points
> > >> (op->mark_*() calls).
> > >>
> > >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> > >> It's extremely verbose but it'll let us see where the op is getting
> > >> blocked.  If you see the "slow request" message it means the op in
> > >> received by ceph (that's when the clock starts), so I suspect it's not
> > >> something we can blame on the network stack.
> > >>
> > >> sage
> > >>
> > >>
> > >>>
> > >>> The problem is most pronounced when we have to reboot an OSD node (1
> > >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> > >>> seconds. It takes a good 15 minutes for things to settle down. The
> > >>> production cluster is very busy doing normally 8,000 I/O and peaking
> > >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> > >>> are between 25-50% full. We are currently splitting PGs to distribute
> > >>> the load better across the disks, but we are having to do this 10 PGs
> > >>> at a time as we get blocked I/O. We have max_backfills and
> > >>> max_recovery set to 1, client op priority is set higher than recovery
> > >>> priority. We tried increasing the number of op threads but this
> didn't
> > >>> seem to help. It seems as soon as PGs are finished being checked,
> they
> > >>> become active and could be the cause for slow I/O while the other PGs
> > >>> are being checked.
> > >>>
> > >>> What I don't understand is that the messages are delayed. As soon as
> > >>> the message is received by Ceph OSD process, it is very quickly
> > >>> committed to the journal and a response is sent back to the primary
> > >>> OSD which is received very quickly as well. I've adjust
> > >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> > >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> > >>> of RAM per nodes for 10 OSDs.
> > >>>
> > >>> Is there something that could cause the kernel to get a packet but
> not
> > >>> be able to dispatch it to Ceph such that it could be explaining why
> we
> > >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> > >>> to tracing Ceph messages from the network buffer through the kernel
> to
> > >>> the Ceph process?
> > >>>
> > >>> We can really use some pointers no matter how outrageous. We've have
> > >>> over 6 people looking into this for weeks now and just can't think of
> > >>> anything else.
> > >>>
> > >>> Thanks,
> > >>> -----BEGIN PGP SIGNATURE-----
> > >>> Version: Mailvelope v1.1.0
> > >>> Comment: https://www.mailvelope.com
> > >>>
> > >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> > >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> > >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> > >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> > >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> > >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> > >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> > >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> > >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> > >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> > >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> > >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> > >>> l7OF
> > >>> =OI++
> > >>> -----END PGP SIGNATURE-----
> > >>> ----------------
> > >>> Robert LeBlanc
> > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >>>
> > >>>
> > >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <
> robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> > >>> > like all the blocked I/O has stopped (no entries in the log for the
> > >>> > last 12 hours). This makes me believe that there is some issue with
> > >>> > the number of sockets or some other TCP issue. We have not messed
> with
> > >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8
> KVM
> > >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > >>> > processes and 16K system wide.
> > >>> >
> > >>> > Does this seem like the right spot to be looking? What are some
> > >>> > configuration items we should be looking at?
> > >>> >
> > >>> > Thanks,
> > >>> > ----------------
> > >>> > Robert LeBlanc
> > >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >>> >
> > >>> >
> > >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <
> robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> >> Hash: SHA256
> > >>> >>
> > >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> > >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking).
> It
> > >>> >> seems that there were some major reworks in the network handling
> in
> > >>> >> the kernel to efficiently handle that network rate. If I remember
> > >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> > >>> >> that we did see packet loss while congesting our ISLs in our
> initial
> > >>> >> testing, but we could not tell where the dropping was happening.
> We
> > >>> >> saw some on the switches, but it didn't seem to be bad if we
> weren't
> > >>> >> trying to congest things. We probably already saw this issue, just
> > >>> >> didn't know it.
> > >>> >> - ----------------
> > >>> >> Robert LeBlanc
> > >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > >>> >>
> > >>> >>
> > >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> > >>> >>> FWIW, we've got some 40GbE Intel cards in the community
> performance cluster
> > >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be
> running fine
> > >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel
> that older
> > >>> >>> drivers might cause problems though.
> > >>> >>>
> > >>> >>> Here's ifconfig from one of the nodes:
> > >>> >>>
> > >>> >>> ens513f1: flags=4163  mtu 1500
> > >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast
> 10.0.10.255
> > >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid
> 0x20
> > >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> > >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5
> TiB)
> > >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> > >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5
> TiB)
> > >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions
> 0
> > >>> >>>
> > >>> >>> Mark
> > >>> >>>
> > >>> >>>
> > >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> > >>> >>>>
> > >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> >>>> Hash: SHA256
> > >>> >>>>
> > >>> >>>> OK, here is the update on the saga...
> > >>> >>>>
> > >>> >>>> I traced some more of blocked I/Os and it seems that
> communication
> > >>> >>>> between two hosts seemed worse than others. I did a two way
> ping flood
> > >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> > >>> >>>> packets, no lost pings. Then then had the ping flood running
> while I
> > >>> >>>> put Ceph load on the cluster and the dropped pings started
> increasing
> > >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> > >>> >>>>
> > >>> >>>> I then ran iperf between all the nodes with the same results,
> so that
> > >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> > >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far
> there
> > >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs
> really
> > >>> >>>> need the network enhancements in the 4.x series to work well.
> > >>> >>>>
> > >>> >>>> Does this sound familiar to anyone? I'll probably start
> bisecting the
> > >>> >>>> kernel to see where this issue in introduced. Both of the
> clusters
> > >>> >>>> with this issue are running 4.x, other than that, they are
> pretty
> > >>> >>>> differing hardware and network configs.
> > >>> >>>>
> > >>> >>>> Thanks,
> > >>> >>>> -----BEGIN PGP SIGNATURE-----
> > >>> >>>> Version: Mailvelope v1.1.0
> > >>> >>>> Comment: https://www.mailvelope.com
> > >>> >>>>
> > >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> > >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> > >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> > >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> > >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> > >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> > >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> > >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> > >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> > >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> > >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> > >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> > >>> >>>> 4OEo
> > >>> >>>> =P33I
> > >>> >>>> -----END PGP SIGNATURE-----
> > >>> >>>> ----------------
> > >>> >>>> Robert LeBlanc
> > >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
> B9F1
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> > >>> >>>> wrote:
> > >>> >>>>>
> > >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> >>>>> Hash: SHA256
> > >>> >>>>>
> > >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some
> issues
> > >>> >>>>> pinging hosts with "No buffer space available" (hosts are
> currently
> > >>> >>>>> configured for 4GB to test SSD caching rather than page
> cache). I
> > >>> >>>>> found that MTU under 32K worked reliable for ping, but still
> had the
> > >>> >>>>> blocked I/O.
> > >>> >>>>>
> > >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm
> still seeing
> > >>> >>>>> the blocked I/O.
> > >>> >>>>> - ----------------
> > >>> >>>>> Robert LeBlanc
> > >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
> B9F1
> > >>> >>>>>
> > >>> >>>>>
> > >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> > >>> >>>>>>
> > >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> > >>> >>>>>>>
> > >>> >>>>>>> I looked at the logs, it looks like there was a 53 second
> delay
> > >>> >>>>>>> between when osd.17 started sending the osd_repop message
> and when
> > >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage,
> didn't we
> > >>> >>>>>>> once see a kernel issue which caused some messages to be
> mysteriously
> > >>> >>>>>>> delayed for many 10s of seconds?
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>> Every time we have seen this behavior and diagnosed it in the
> wild it
> > >>> >>>>>> has
> > >>> >>>>>> been a network misconfiguration.  Usually related to jumbo
> frames.
> > >>> >>>>>>
> > >>> >>>>>> sage
> > >>> >>>>>>
> > >>> >>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>> What kernel are you running?
> > >>> >>>>>>> -Sam
> > >>> >>>>>>>
> > >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> > >>> >>>>>>>>
> > >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> >>>>>>>> Hash: SHA256
> > >>> >>>>>>>>
> > >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more
> eyes. I've
> > >>> >>>>>>>> extracted what I think are important entries from the logs
> for the
> > >>> >>>>>>>> first blocked request. NTP is running all the servers so
> the logs
> > >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00
> are
> > >>> >>>>>>>> available at
> http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> > >>> >>>>>>>>
> > >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> > >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> > >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> > >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> > >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0
> from osd.16
> > >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17
> ondisk result=0
> > >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O >
> 30.439150 sec
> > >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> > >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0
> from osd.13
> > >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17
> ondisk result=0
> > >>> >>>>>>>>
> > >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to
> osd.13 and
> > >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O
> right away,
> > >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53
> seconds
> > >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend
> the data
> > >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the
> actual data
> > >>> >>>>>>>> transfer).
> > >>> >>>>>>>>
> > >>> >>>>>>>> It looks like osd.17 is receiving responses to start the
> communication
> > >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a
> minute
> > >>> >>>>>>>> later. To me it seems that the message is getting received
> but not
> > >>> >>>>>>>> passed to another thread right away or something. This test
> was done
> > >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with
> a single
> > >>> >>>>>>>> thread.
> > >>> >>>>>>>>
> > >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked
> I/O
> > >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so
> I can use
> > >>> >>>>>>>> some help.
> > >>> >>>>>>>>
> > >>> >>>>>>>> Single Test started about
> > >>> >>>>>>>> 2015-09-22 12:52:36
> > >>> >>>>>>>>
> > >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726
> 56 :
> > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest
> blocked for >
> > >>> >>>>>>>> 30.439150 secs
> > >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726
> 57 :
> > >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received
> at
> > >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> > >>> >>>>>>>>   osd_op(client.250874.0:1388
> rbd_data.3380e2ae8944a.0000000000000545
> > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> > >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected
> e56785)
> > >>> >>>>>>>>   currently waiting for subops from 13,16
> > >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410
> 7 : cluster
> > >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for
> >
> > >>> >>>>>>>> 30.379680 secs
> > >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410
> 8 : cluster
> > >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at
> 2015-09-22
> > >>> >>>>>>>> 12:55:06.406303:
> > >>> >>>>>>>>   osd_op(client.250874.0:1384
> rbd_data.3380e2ae8944a.0000000000000541
> > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> > >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected
> e56785)
> > >>> >>>>>>>>   currently waiting for subops from 13,17
> > >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410
> 9 : cluster
> > >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at
> 2015-09-22
> > >>> >>>>>>>> 12:55:06.318144:
> > >>> >>>>>>>>   osd_op(client.250874.0:1382
> rbd_data.3380e2ae8944a.000000000000053f
> > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> > >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected
> e56785)
> > >>> >>>>>>>>   currently waiting for subops from 13,14
> > >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574
> 130 :
> > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest
> blocked for >
> > >>> >>>>>>>> 30.954212 secs
> > >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574
> 131 :
> > >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received
> at
> > >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> > >>> >>>>>>>>   osd_op(client.250874.0:1873
> rbd_data.3380e2ae8944a.000000000000070d
> > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> > >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected
> e56785)
> > >>> >>>>>>>>   currently waiting for subops from 16,17
> > >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410
> 10 :
> > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest
> blocked for >
> > >>> >>>>>>>> 30.704367 secs
> > >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410
> 11 :
> > >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received
> at
> > >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> > >>> >>>>>>>>   osd_op(client.250874.0:1874
> rbd_data.3380e2ae8944a.000000000000070e
> > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> > >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected
> e56785)
> > >>> >>>>>>>>   currently waiting for subops from 13,17
> > >>> >>>>>>>>
> > >>> >>>>>>>> Server   IP addr              OSD
> > >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> > >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> > >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> > >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> > >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> > >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> > >>> >>>>>>>>
> > >>> >>>>>>>> fio job:
> > >>> >>>>>>>> [rbd-test]
> > >>> >>>>>>>> readwrite=write
> > >>> >>>>>>>> blocksize=4M
> > >>> >>>>>>>> #runtime=60
> > >>> >>>>>>>> name=rbd-test
> > >>> >>>>>>>> #readwrite=randwrite
> > >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> > >>> >>>>>>>> #rwmixread=72
> > >>> >>>>>>>> #norandommap
> > >>> >>>>>>>> #size=1T
> > >>> >>>>>>>> #blocksize=4k
> > >>> >>>>>>>> ioengine=rbd
> > >>> >>>>>>>> rbdname=test2
> > >>> >>>>>>>> pool=rbd
> > >>> >>>>>>>> clientname=admin
> > >>> >>>>>>>> iodepth=8
> > >>> >>>>>>>> #numjobs=4
> > >>> >>>>>>>> #thread
> > >>> >>>>>>>> #group_reporting
> > >>> >>>>>>>> #time_based
> > >>> >>>>>>>> #direct=1
> > >>> >>>>>>>> #ramp_time=60
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>> Thanks,
> > >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> > >>> >>>>>>>> Version: Mailvelope v1.1.0
> > >>> >>>>>>>> Comment: https://www.mailvelope.com
> > >>> >>>>>>>>
> > >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> > >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> > >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> > >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> > >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> > >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> > >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> > >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> > >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> > >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> > >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> > >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> > >>> >>>>>>>> J3hS
> > >>> >>>>>>>> =0J7F
> > >>> >>>>>>>> -----END PGP SIGNATURE-----
> > >>> >>>>>>>> ----------------
> > >>> >>>>>>>> Robert LeBlanc
> > >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
> FA62 B9F1
> > >>> >>>>>>>>
> > >>> >>>>>>>>
> > >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> > >>> >>>>>>>>>> Hash: SHA256
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> Is there some way to tell in the logs that this is
> happening?
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> I'm not
> > >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there
> some way to
> > >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to
> doing so?
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the
> list for
> > >>> >>>>>>>>> this, it was discussed not too long ago.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as
> the sessions
> > >>> >>>>>>>>>> are aborted, they are reestablished and complete
> immediately.
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> The fio test is just a seq write, starting it over
> (rewriting from
> > >>> >>>>>>>>>> the
> > >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that
> it is not
> > >>> >>>>>>>>>> having to create new file and therefore split
> collections. This is
> > >>> >>>>>>>>>> on
> > >>> >>>>>>>>>> my test cluster with no other load.
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really
> not creating
> > >>> >>>>>>>>> new objects, if you're actually running fio in such a way
> that it's
> > >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set
> up?).
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>>
> > >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options
> and depths
> > >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> > >>> >>>>>>>>>
> > >>> >>>>>>>>>
> > >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug
> filestore =
> > >>> >>>>>>>>> 20",
> > >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That
> should spit
> > >>> >>>>>>>>> out
> > >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> > >>> >>>>>>>>> -Greg
> > >>> >>>>>>>>
> > >>> >>>>>>>> --
> > >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> > >>> >>>>>>>> in
> > >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >>> >>>>>>>> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>>>
> > >>> >>>>>
> > >>> >>>>> -----BEGIN PGP SIGNATURE-----
> > >>> >>>>> Version: Mailvelope v1.1.0
> > >>> >>>>> Comment: https://www.mailvelope.com
> > >>> >>>>>
> > >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> > >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> > >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> > >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> > >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> > >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> > >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> > >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> > >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> > >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> > >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> > >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> > >>> >>>>> gcZm
> > >>> >>>>> =CjwB
> > >>> >>>>> -----END PGP SIGNATURE-----
> > >>> >>>>
> > >>> >>>> --
> > >>> >>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > >>> >>>> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > >>> >>>>
> > >>> >>>
> > >>> >>
> > >>> >> -----BEGIN PGP SIGNATURE-----
> > >>> >> Version: Mailvelope v1.1.0
> > >>> >> Comment: https://www.mailvelope.com
> > >>> >>
> > >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> > >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> > >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> > >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> > >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> > >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> > >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> > >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> > >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> > >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> > >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> > >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> > >>> >> ae22
> > >>> >> =AX+L
> > >>> >> -----END PGP SIGNATURE-----
> > >>> _______________________________________________
> > >>> ceph-users mailing list
> > >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >>>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>

[-- Attachment #1.2: Type: text/html, Size: 51596 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                         ` <CAANLjFo==i7wivrGR9LJFs3GOrD2iQHLdCpEfR-AruHyOMLi-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 14:38                                                                           ` Sage Weil
  2015-10-06 15:51                                                                             ` [ceph-users] " Robert LeBlanc
  2015-10-06 16:26                                                                             ` Ken Dreyer
  0 siblings, 2 replies; 45+ messages in thread
From: Sage Weil @ 2015-10-06 14:38 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw

[-- Attachment #1: Type: TEXT/PLAIN, Size: 39980 bytes --]

On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> Thanks for your time Sage. It sounds like a few people may be helped if you
> can find something.
> 
> I did a recursive chown as in the instructions (although I didn't know about
> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
> I'll also do ms and make the logs available. I'll also review the document
> to make sure I didn't miss anything else.

Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build) 
first.  They won't be allowed to boot until that happens... all upgrades 
must stop at 0.94.4 first.  And that isn't released yet.. we'll try to 
do that today.  In the meantime, you can use the hammer gitbuilder 
build...

sage


> 
> Robert LeBlanc
> 
> Sent from a mobile device please excuse any typos.
> 
> On Oct 6, 2015 6:37 AM, "Sage Weil" <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>       On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>       > -----BEGIN PGP SIGNED MESSAGE-----
>       > Hash: SHA256
>       >
>       > With some off-list help, we have adjusted
>       > osd_client_message_cap=10000. This seems to have helped a bit
>       and we
>       > have seen some OSDs have a value up to 4,000 for client
>       messages. But
>       > it does not solve the problem with the blocked I/O.
>       >
>       > One thing that I have noticed is that almost exactly 30
>       seconds elapse
>       > between an OSD boots and the first blocked I/O message. I
>       don't know
>       > if the OSD doesn't have time to get it's brain right about a
>       PG before
>       > it starts servicing it or what exactly.
> 
>       I'm downloading the logs from yesterday now; sorry it's taking
>       so long.
> 
>       > On another note, I tried upgrading our CentOS dev cluster from
>       Hammer
>       > to master and things didn't go so well. The OSDs would not
>       start
>       > because /var/lib/ceph was not owned by ceph. I chowned the
>       directory
>       > and all OSDs and the OSD then started, but never became active
>       in the
>       > cluster. It just sat there after reading all the PGs. There
>       were
>       > sockets open to the monitor, but no OSD to OSD sockets. I
>       tried
>       > downgrading to the Infernalis branch and still no luck getting
>       the
>       > OSDs to come up. The OSD processes were idle after the initial
>       boot.
>       > All packages were installed from gitbuilder.
> 
>       Did you chown -R ?
> 
>              https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
>       g-from-hammer
> 
>       My guess is you only chowned the root dir, and the OSD didn't
>       throw
>       an error when it encountered the other files?  If you can
>       generate a debug
>       osd = 20 log, that would be helpful.. thanks!
> 
>       sage
> 
> 
>       >
>       > Thanks,
>       > -----BEGIN PGP SIGNATURE-----
>       > Version: Mailvelope v1.2.0
>       > Comment: https://www.mailvelope.com
>       >
>       > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>       > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>       > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>       > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>       > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>       > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>       > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>       > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>       > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>       > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>       > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>       > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>       > GdXC
>       > =Aigq
>       > -----END PGP SIGNATURE-----
>       > ----------------
>       > Robert LeBlanc
>       > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>       B9F1
>       >
>       >
>       > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>       > > -----BEGIN PGP SIGNED MESSAGE-----
>       > > Hash: SHA256
>       > >
>       > > I have eight nodes running the fio job rbd_test_real to
>       different RBD
>       > > volumes. I've included the CRUSH map in the tarball.
>       > >
>       > > I stopped one OSD process and marked it out. I let it
>       recover for a
>       > > few minutes and then I started the process again and marked
>       it in. I
>       > > started getting block I/O messages during the recovery.
>       > >
>       > > The logs are located at
>       http://162.144.87.113/files/ushou1.tar.xz
>       > >
>       > > Thanks,
>       > > -----BEGIN PGP SIGNATURE-----
>       > > Version: Mailvelope v1.2.0
>       > > Comment: https://www.mailvelope.com
>       > >
>       > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>       > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>       > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>       > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>       > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>       > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>       > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>       > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>       > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>       > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>       > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>       > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>       > > 3EPx
>       > > =UDIV
>       > > -----END PGP SIGNATURE-----
>       > >
>       > > ----------------
>       > > Robert LeBlanc
>       > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>       FA62 B9F1
>       > >
>       > >
>       > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>       wrote:
>       > >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>       > >>> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> Hash: SHA256
>       > >>>
>       > >>> We are still struggling with this and have tried a lot of
>       different
>       > >>> things. Unfortunately, Inktank (now Red Hat) no longer
>       provides
>       > >>> consulting services for non-Red Hat systems. If there are
>       some
>       > >>> certified Ceph consultants in the US that we can do both
>       remote and
>       > >>> on-site engagements, please let us know.
>       > >>>
>       > >>> This certainly seems to be network related, but somewhere
>       in the
>       > >>> kernel. We have tried increasing the network and TCP
>       buffers, number
>       > >>> of TCP sockets, reduced the FIN_WAIT2 state. There is
>       about 25% idle
>       > >>> on the boxes, the disks are busy, but not constantly at
>       100% (they
>       > >>> cycle from <10% up to 100%, but not 100% for more than a
>       few seconds
>       > >>> at a time). There seems to be no reasonable explanation
>       why I/O is
>       > >>> blocked pretty frequently longer than 30 seconds. We have
>       verified
>       > >>> Jumbo frames by pinging from/to each node with 9000 byte
>       packets. The
>       > >>> network admins have verified that packets are not being
>       dropped in the
>       > >>> switches for these nodes. We have tried different kernels
>       including
>       > >>> the recent Google patch to cubic. This is showing up on
>       three cluster
>       > >>> (two Ethernet and one IPoIB). I booted one cluster into
>       Debian Jessie
>       > >>> (from CentOS 7.1) with similar results.
>       > >>>
>       > >>> The messages seem slightly different:
>       > >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425
>       439 :
>       > >>> cluster [WRN] 14 slow requests, 1 included below; oldest
>       blocked for >
>       > >>> 100.087155 secs
>       > >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425
>       440 :
>       > >>> cluster [WRN] slow request 30.041999 seconds old, received
>       at
>       > >>> 2015-10-03 14:37:53.151014:
>       osd_op(client.1328605.0:7082862
>       > >>> rbd_data.13fdcb2ae8944a.000000000001264f [read
>       975360~4096]
>       > >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently
>       no flag
>       > >>> points reached
>       > >>>
>       > >>> I don't know what "no flag points reached" means.
>       > >>
>       > >> Just that the op hasn't been marked as reaching any
>       interesting points
>       > >> (op->mark_*() calls).
>       > >>
>       > >> Is it possible to gather a lot with debug ms = 20 and debug
>       osd = 20?
>       > >> It's extremely verbose but it'll let us see where the op is
>       getting
>       > >> blocked.  If you see the "slow request" message it means
>       the op in
>       > >> received by ceph (that's when the clock starts), so I
>       suspect it's not
>       > >> something we can blame on the network stack.
>       > >>
>       > >> sage
>       > >>
>       > >>
>       > >>>
>       > >>> The problem is most pronounced when we have to reboot an
>       OSD node (1
>       > >>> of 13), we will have hundreds of I/O blocked for some
>       times up to 300
>       > >>> seconds. It takes a good 15 minutes for things to settle
>       down. The
>       > >>> production cluster is very busy doing normally 8,000 I/O
>       and peaking
>       > >>> at 15,000. This is all 4TB spindles with SSD journals and
>       the disks
>       > >>> are between 25-50% full. We are currently splitting PGs to
>       distribute
>       > >>> the load better across the disks, but we are having to do
>       this 10 PGs
>       > >>> at a time as we get blocked I/O. We have max_backfills and
>       > >>> max_recovery set to 1, client op priority is set higher
>       than recovery
>       > >>> priority. We tried increasing the number of op threads but
>       this didn't
>       > >>> seem to help. It seems as soon as PGs are finished being
>       checked, they
>       > >>> become active and could be the cause for slow I/O while
>       the other PGs
>       > >>> are being checked.
>       > >>>
>       > >>> What I don't understand is that the messages are delayed.
>       As soon as
>       > >>> the message is received by Ceph OSD process, it is very
>       quickly
>       > >>> committed to the journal and a response is sent back to
>       the primary
>       > >>> OSD which is received very quickly as well. I've adjust
>       > >>> min_free_kbytes and it seems to keep the OSDs from
>       crashing, but
>       > >>> doesn't solve the main problem. We don't have swap and
>       there is 64 GB
>       > >>> of RAM per nodes for 10 OSDs.
>       > >>>
>       > >>> Is there something that could cause the kernel to get a
>       packet but not
>       > >>> be able to dispatch it to Ceph such that it could be
>       explaining why we
>       > >>> are seeing these blocked I/O for 30+ seconds. Is there
>       some pointers
>       > >>> to tracing Ceph messages from the network buffer through
>       the kernel to
>       > >>> the Ceph process?
>       > >>>
>       > >>> We can really use some pointers no matter how outrageous.
>       We've have
>       > >>> over 6 people looking into this for weeks now and just
>       can't think of
>       > >>> anything else.
>       > >>>
>       > >>> Thanks,
>       > >>> -----BEGIN PGP SIGNATURE-----
>       > >>> Version: Mailvelope v1.1.0
>       > >>> Comment: https://www.mailvelope.com
>       > >>>
>       > >>>
>       wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>       > >>>
>       NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>       > >>>
>       prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>       > >>>
>       K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>       > >>>
>       h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>       > >>>
>       iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>       > >>>
>       Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>       > >>>
>       Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>       > >>>
>       JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>       > >>>
>       8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>       > >>>
>       lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>       > >>>
>       4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>       > >>> l7OF
>       > >>> =OI++
>       > >>> -----END PGP SIGNATURE-----
>       > >>> ----------------
>       > >>> Robert LeBlanc
>       > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>       FA62 B9F1
>       > >>>
>       > >>>
>       > >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc
>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>       > >>> > We dropped the replication on our cluster from 4 to 3
>       and it looks
>       > >>> > like all the blocked I/O has stopped (no entries in the
>       log for the
>       > >>> > last 12 hours). This makes me believe that there is some
>       issue with
>       > >>> > the number of sockets or some other TCP issue. We have
>       not messed with
>       > >>> > Ephemeral ports and TIME_WAIT at this point. There are
>       130 OSDs, 8 KVM
>       > >>> > hosts hosting about 150 VMs. Open files is set at 32K
>       for the OSD
>       > >>> > processes and 16K system wide.
>       > >>> >
>       > >>> > Does this seem like the right spot to be looking? What
>       are some
>       > >>> > configuration items we should be looking at?
>       > >>> >
>       > >>> > Thanks,
>       > >>> > ----------------
>       > >>> > Robert LeBlanc
>       > >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>       FA62 B9F1
>       > >>> >
>       > >>> >
>       > >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc
>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>       > >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> >> Hash: SHA256
>       > >>> >>
>       > >>> >> We were able to only get ~17Gb out of the XL710
>       (heavily tweaked)
>       > >>> >> until we went to the 4.x kernel where we got ~36Gb (no
>       tweaking). It
>       > >>> >> seems that there were some major reworks in the network
>       handling in
>       > >>> >> the kernel to efficiently handle that network rate. If
>       I remember
>       > >>> >> right we also saw a drop in CPU utilization. I'm
>       starting to think
>       > >>> >> that we did see packet loss while congesting our ISLs
>       in our initial
>       > >>> >> testing, but we could not tell where the dropping was
>       happening. We
>       > >>> >> saw some on the switches, but it didn't seem to be bad
>       if we weren't
>       > >>> >> trying to congest things. We probably already saw this
>       issue, just
>       > >>> >> didn't know it.
>       > >>> >> - ----------------
>       > >>> >> Robert LeBlanc
>       > >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>       3BB2 FA62 B9F1
>       > >>> >>
>       > >>> >>
>       > >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>       > >>> >>> FWIW, we've got some 40GbE Intel cards in the
>       community performance cluster
>       > >>> >>> on a Mellanox 40GbE switch that appear (knock on wood)
>       to be running fine
>       > >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback
>       from Intel that older
>       > >>> >>> drivers might cause problems though.
>       > >>> >>>
>       > >>> >>> Here's ifconfig from one of the nodes:
>       > >>> >>>
>       > >>> >>> ens513f1: flags=4163  mtu 1500
>       > >>> >>>         inet 10.0.10.101  netmask 255.255.255.0 
>       broadcast 10.0.10.255
>       > >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64 
>       scopeid 0x20
>       > >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000 
>       (Ethernet)
>       > >>> >>>         RX packets 169232242875  bytes 229346261232279
>       (208.5 TiB)
>       > >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>       > >>> >>>         TX packets 153491686361  bytes 203976410836881
>       (185.5 TiB)
>       > >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0 
>       collisions 0
>       > >>> >>>
>       > >>> >>> Mark
>       > >>> >>>
>       > >>> >>>
>       > >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>       > >>> >>>>
>       > >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> >>>> Hash: SHA256
>       > >>> >>>>
>       > >>> >>>> OK, here is the update on the saga...
>       > >>> >>>>
>       > >>> >>>> I traced some more of blocked I/Os and it seems that
>       communication
>       > >>> >>>> between two hosts seemed worse than others. I did a
>       two way ping flood
>       > >>> >>>> between the two hosts using max packet sizes (1500).
>       After 1.5M
>       > >>> >>>> packets, no lost pings. Then then had the ping flood
>       running while I
>       > >>> >>>> put Ceph load on the cluster and the dropped pings
>       started increasing
>       > >>> >>>> after stopping the Ceph workload the pings stopped
>       dropping.
>       > >>> >>>>
>       > >>> >>>> I then ran iperf between all the nodes with the same
>       results, so that
>       > >>> >>>> ruled out Ceph to a large degree. I then booted in
>       the the
>       > >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour
>       test so far there
>       > >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40
>       Gb NICs really
>       > >>> >>>> need the network enhancements in the 4.x series to
>       work well.
>       > >>> >>>>
>       > >>> >>>> Does this sound familiar to anyone? I'll probably
>       start bisecting the
>       > >>> >>>> kernel to see where this issue in introduced. Both of
>       the clusters
>       > >>> >>>> with this issue are running 4.x, other than that,
>       they are pretty
>       > >>> >>>> differing hardware and network configs.
>       > >>> >>>>
>       > >>> >>>> Thanks,
>       > >>> >>>> -----BEGIN PGP SIGNATURE-----
>       > >>> >>>> Version: Mailvelope v1.1.0
>       > >>> >>>> Comment: https://www.mailvelope.com
>       > >>> >>>>
>       > >>> >>>>
>       wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>       > >>> >>>>
>       RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>       > >>> >>>>
>       AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>       > >>> >>>>
>       7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>       > >>> >>>>
>       cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>       > >>> >>>>
>       F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>       > >>> >>>>
>       byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>       > >>> >>>>
>       /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>       > >>> >>>>
>       LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>       > >>> >>>>
>       UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>       > >>> >>>>
>       sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>       > >>> >>>>
>       KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>       > >>> >>>> 4OEo
>       > >>> >>>> =P33I
>       > >>> >>>> -----END PGP SIGNATURE-----
>       > >>> >>>> ----------------
>       > >>> >>>> Robert LeBlanc
>       > >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>       3BB2 FA62 B9F1
>       > >>> >>>>
>       > >>> >>>>
>       > >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>       > >>> >>>> wrote:
>       > >>> >>>>>
>       > >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> >>>>> Hash: SHA256
>       > >>> >>>>>
>       > >>> >>>>> This is IPoIB and we have the MTU set to 64K. There
>       was some issues
>       > >>> >>>>> pinging hosts with "No buffer space available"
>       (hosts are currently
>       > >>> >>>>> configured for 4GB to test SSD caching rather than
>       page cache). I
>       > >>> >>>>> found that MTU under 32K worked reliable for ping,
>       but still had the
>       > >>> >>>>> blocked I/O.
>       > >>> >>>>>
>       > >>> >>>>> I reduced the MTU to 1500 and checked pings (OK),
>       but I'm still seeing
>       > >>> >>>>> the blocked I/O.
>       > >>> >>>>> - ----------------
>       > >>> >>>>> Robert LeBlanc
>       > >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>       3BB2 FA62 B9F1
>       > >>> >>>>>
>       > >>> >>>>>
>       > >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>       > >>> >>>>>>
>       > >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>       > >>> >>>>>>>
>       > >>> >>>>>>> I looked at the logs, it looks like there was a 53
>       second delay
>       > >>> >>>>>>> between when osd.17 started sending the osd_repop
>       message and when
>       > >>> >>>>>>> osd.13 started reading it, which is pretty weird. 
>       Sage, didn't we
>       > >>> >>>>>>> once see a kernel issue which caused some messages
>       to be mysteriously
>       > >>> >>>>>>> delayed for many 10s of seconds?
>       > >>> >>>>>>
>       > >>> >>>>>>
>       > >>> >>>>>> Every time we have seen this behavior and diagnosed
>       it in the wild it
>       > >>> >>>>>> has
>       > >>> >>>>>> been a network misconfiguration.  Usually related
>       to jumbo frames.
>       > >>> >>>>>>
>       > >>> >>>>>> sage
>       > >>> >>>>>>
>       > >>> >>>>>>
>       > >>> >>>>>>>
>       > >>> >>>>>>> What kernel are you running?
>       > >>> >>>>>>> -Sam
>       > >>> >>>>>>>
>       > >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc 
>       wrote:
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> >>>>>>>> Hash: SHA256
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> OK, looping in ceph-devel to see if I can get
>       some more eyes. I've
>       > >>> >>>>>>>> extracted what I think are important entries from
>       the logs for the
>       > >>> >>>>>>>> first blocked request. NTP is running all the
>       servers so the logs
>       > >>> >>>>>>>> should be close in terms of time. Logs for 12:50
>       to 13:00 are
>       > >>> >>>>>>>> available at
>       http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from
>       client
>       > >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O
>       to osd.13
>       > >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O
>       to osd.16
>       > >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from
>       osd.17
>       > >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk
>       result=0 from osd.16
>       > >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to
>       osd.17 ondisk result=0
>       > >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow
>       I/O > 30.439150 sec
>       > >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from
>       osd.17
>       > >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk
>       result=0 from osd.13
>       > >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to
>       osd.17 ondisk result=0
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> In the logs I can see that osd.17 dispatches the
>       I/O to osd.13 and
>       > >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get
>       the I/O right away,
>       > >>> >>>>>>>> but for some reason osd.13 doesn't get the
>       message until 53 seconds
>       > >>> >>>>>>>> later. osd.17 seems happy to just wait and
>       doesn't resend the data
>       > >>> >>>>>>>> (well, I'm not 100% sure how to tell which
>       entries are the actual data
>       > >>> >>>>>>>> transfer).
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> It looks like osd.17 is receiving responses to
>       start the communication
>       > >>> >>>>>>>> with osd.13, but the op is not acknowledged until
>       almost a minute
>       > >>> >>>>>>>> later. To me it seems that the message is getting
>       received but not
>       > >>> >>>>>>>> passed to another thread right away or something.
>       This test was done
>       > >>> >>>>>>>> with an idle cluster, a single fio client (rbd
>       engine) with a single
>       > >>> >>>>>>>> thread.
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> The OSD servers are almost 100% idle during these
>       blocked I/O
>       > >>> >>>>>>>> requests. I think I'm at the end of my
>       troubleshooting, so I can use
>       > >>> >>>>>>>> some help.
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> Single Test started about
>       > >>> >>>>>>>> 2015-09-22 12:52:36
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17
>       192.168.55.14:6800/16726 56 :
>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>       oldest blocked for >
>       > >>> >>>>>>>> 30.439150 secs
>       > >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17
>       192.168.55.14:6800/16726 57 :
>       > >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old,
>       received at
>       > >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>       > >>> >>>>>>>>   osd_op(client.250874.0:1388
>       rbd_data.3380e2ae8944a.0000000000000545
>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>       4194304,write
>       > >>> >>>>>>>> 0~4194304] 8.bbf3e8ff
>       ack+ondisk+write+known_if_redirected e56785)
>       > >>> >>>>>>>>   currently waiting for subops from 13,16
>       > >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16
>       192.168.55.13:6800/29410 7 : cluster
>       > >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest
>       blocked for >
>       > >>> >>>>>>>> 30.379680 secs
>       > >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16
>       192.168.55.13:6800/29410 8 : cluster
>       > >>> >>>>>>>> [WRN] slow request 30.291520 seconds old,
>       received at 2015-09-22
>       > >>> >>>>>>>> 12:55:06.406303:
>       > >>> >>>>>>>>   osd_op(client.250874.0:1384
>       rbd_data.3380e2ae8944a.0000000000000541
>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>       4194304,write
>       > >>> >>>>>>>> 0~4194304] 8.5fb2123f
>       ack+ondisk+write+known_if_redirected e56785)
>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>       > >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16
>       192.168.55.13:6800/29410 9 : cluster
>       > >>> >>>>>>>> [WRN] slow request 30.379680 seconds old,
>       received at 2015-09-22
>       > >>> >>>>>>>> 12:55:06.318144:
>       > >>> >>>>>>>>   osd_op(client.250874.0:1382
>       rbd_data.3380e2ae8944a.000000000000053f
>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>       4194304,write
>       > >>> >>>>>>>> 0~4194304] 8.312e69ca
>       ack+ondisk+write+known_if_redirected e56785)
>       > >>> >>>>>>>>   currently waiting for subops from 13,14
>       > >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13
>       192.168.55.12:6804/4574 130 :
>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>       oldest blocked for >
>       > >>> >>>>>>>> 30.954212 secs
>       > >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13
>       192.168.55.12:6804/4574 131 :
>       > >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old,
>       received at
>       > >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>       > >>> >>>>>>>>   osd_op(client.250874.0:1873
>       rbd_data.3380e2ae8944a.000000000000070d
>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>       4194304,write
>       > >>> >>>>>>>> 0~4194304] 8.e69870d4
>       ack+ondisk+write+known_if_redirected e56785)
>       > >>> >>>>>>>>   currently waiting for subops from 16,17
>       > >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16
>       192.168.55.13:6800/29410 10 :
>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>       oldest blocked for >
>       > >>> >>>>>>>> 30.704367 secs
>       > >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16
>       192.168.55.13:6800/29410 11 :
>       > >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old,
>       received at
>       > >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>       > >>> >>>>>>>>   osd_op(client.250874.0:1874
>       rbd_data.3380e2ae8944a.000000000000070e
>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>       4194304,write
>       > >>> >>>>>>>> 0~4194304] 8.f7635819
>       ack+ondisk+write+known_if_redirected e56785)
>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> Server   IP addr              OSD
>       > >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>       > >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>       > >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>       > >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>       > >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>       > >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> fio job:
>       > >>> >>>>>>>> [rbd-test]
>       > >>> >>>>>>>> readwrite=write
>       > >>> >>>>>>>> blocksize=4M
>       > >>> >>>>>>>> #runtime=60
>       > >>> >>>>>>>> name=rbd-test
>       > >>> >>>>>>>> #readwrite=randwrite
>       > >>> >>>>>>>>
>       #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>       > >>> >>>>>>>> #rwmixread=72
>       > >>> >>>>>>>> #norandommap
>       > >>> >>>>>>>> #size=1T
>       > >>> >>>>>>>> #blocksize=4k
>       > >>> >>>>>>>> ioengine=rbd
>       > >>> >>>>>>>> rbdname=test2
>       > >>> >>>>>>>> pool=rbd
>       > >>> >>>>>>>> clientname=admin
>       > >>> >>>>>>>> iodepth=8
>       > >>> >>>>>>>> #numjobs=4
>       > >>> >>>>>>>> #thread
>       > >>> >>>>>>>> #group_reporting
>       > >>> >>>>>>>> #time_based
>       > >>> >>>>>>>> #direct=1
>       > >>> >>>>>>>> #ramp_time=60
>       > >>> >>>>>>>>
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> Thanks,
>       > >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>       > >>> >>>>>>>> Version: Mailvelope v1.1.0
>       > >>> >>>>>>>> Comment: https://www.mailvelope.com
>       > >>> >>>>>>>>
>       > >>> >>>>>>>>
>       wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>       > >>> >>>>>>>>
>       tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>       > >>> >>>>>>>>
>       h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>       > >>> >>>>>>>>
>       903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>       > >>> >>>>>>>>
>       sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>       > >>> >>>>>>>>
>       FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>       > >>> >>>>>>>>
>       pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>       > >>> >>>>>>>>
>       5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>       > >>> >>>>>>>>
>       B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>       > >>> >>>>>>>>
>       4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>       > >>> >>>>>>>>
>       o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>       > >>> >>>>>>>>
>       gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>       > >>> >>>>>>>> J3hS
>       > >>> >>>>>>>> =0J7F
>       > >>> >>>>>>>> -----END PGP SIGNATURE-----
>       > >>> >>>>>>>> ----------------
>       > >>> >>>>>>>> Robert LeBlanc
>       > >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E
>       E654 3BB2 FA62 B9F1
>       > >>> >>>>>>>>
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum 
>       wrote:
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc 
>       wrote:
>       > >>> >>>>>>>>>>
>       > >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>       > >>> >>>>>>>>>> Hash: SHA256
>       > >>> >>>>>>>>>>
>       > >>> >>>>>>>>>> Is there some way to tell in the logs that this
>       is happening?
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>> You can search for the (mangled) name
>       _split_collection
>       > >>> >>>>>>>>>>
>       > >>> >>>>>>>>>> I'm not
>       > >>> >>>>>>>>>> seeing much I/O, CPU usage during these times.
>       Is there some way to
>       > >>> >>>>>>>>>> prevent the splitting? Is there a negative side
>       effect to doing so?
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>> Bump up the split and merge thresholds. You can
>       search the list for
>       > >>> >>>>>>>>> this, it was discussed not too long ago.
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as
>       soon as the sessions
>       > >>> >>>>>>>>>> are aborted, they are reestablished and
>       complete immediately.
>       > >>> >>>>>>>>>>
>       > >>> >>>>>>>>>> The fio test is just a seq write, starting it
>       over (rewriting from
>       > >>> >>>>>>>>>> the
>       > >>> >>>>>>>>>> beginning) is still causing the issue. I was
>       suspect that it is not
>       > >>> >>>>>>>>>> having to create new file and therefore split
>       collections. This is
>       > >>> >>>>>>>>>> on
>       > >>> >>>>>>>>>> my test cluster with no other load.
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>> Hmm, that does make it seem less likely if
>       you're really not creating
>       > >>> >>>>>>>>> new objects, if you're actually running fio in
>       such a way that it's
>       > >>> >>>>>>>>> not allocating new FS blocks (this is probably
>       hard to set up?).
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>>
>       > >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log
>       options and depths
>       > >>> >>>>>>>>>> would be the most helpful for tracking this
>       issue down?
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>>
>       > >>> >>>>>>>>> If you want to go log diving "debug osd = 20",
>       "debug filestore =
>       > >>> >>>>>>>>> 20",
>       > >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to
>       see. That should spit
>       > >>> >>>>>>>>> out
>       > >>> >>>>>>>>> everything you need to track exactly what each
>       Op is doing.
>       > >>> >>>>>>>>> -Greg
>       > >>> >>>>>>>>
>       > >>> >>>>>>>> --
>       > >>> >>>>>>>> To unsubscribe from this list: send the line
>       "unsubscribe ceph-devel"
>       > >>> >>>>>>>> in
>       > >>> >>>>>>>> the body of a message to
>       majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>       > >>> >>>>>>>> More majordomo info at 
>       http://vger.kernel.org/majordomo-info.html
>       > >>> >>>>>>>
>       > >>> >>>>>>>
>       > >>> >>>>>>>
>       > >>> >>>>>
>       > >>> >>>>> -----BEGIN PGP SIGNATURE-----
>       > >>> >>>>> Version: Mailvelope v1.1.0
>       > >>> >>>>> Comment: https://www.mailvelope.com
>       > >>> >>>>>
>       > >>> >>>>>
>       wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>       > >>> >>>>>
>       a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>       > >>> >>>>>
>       a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>       > >>> >>>>>
>       s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>       > >>> >>>>>
>       iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>       > >>> >>>>>
>       izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>       > >>> >>>>>
>       caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>       > >>> >>>>>
>       efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>       > >>> >>>>>
>       GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>       > >>> >>>>>
>       glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>       > >>> >>>>>
>       +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>       > >>> >>>>>
>       pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>       > >>> >>>>> gcZm
>       > >>> >>>>> =CjwB
>       > >>> >>>>> -----END PGP SIGNATURE-----
>       > >>> >>>>
>       > >>> >>>> --
>       > >>> >>>> To unsubscribe from this list: send the line
>       "unsubscribe ceph-devel" in
>       > >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>       > >>> >>>> More majordomo info at 
>       http://vger.kernel.org/majordomo-info.html
>       > >>> >>>>
>       > >>> >>>
>       > >>> >>
>       > >>> >> -----BEGIN PGP SIGNATURE-----
>       > >>> >> Version: Mailvelope v1.1.0
>       > >>> >> Comment: https://www.mailvelope.com
>       > >>> >>
>       > >>> >>
>       wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>       > >>> >>
>       S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>       > >>> >>
>       lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>       > >>> >>
>       0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>       > >>> >>
>       JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>       > >>> >>
>       dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>       > >>> >>
>       nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>       > >>> >>
>       krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>       > >>> >>
>       FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>       > >>> >>
>       tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>       > >>> >>
>       hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>       > >>> >>
>       BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>       > >>> >> ae22
>       > >>> >> =AX+L
>       > >>> >> -----END PGP SIGNATURE-----
>       > >>> _______________________________________________
>       > >>> ceph-users mailing list
>       > >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>       > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>       > >>>
>       > >>>
>       > _______________________________________________
>       > ceph-users mailing list
>       > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>       >
>       >
> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 14:38                                                                           ` Sage Weil
@ 2015-10-06 15:51                                                                             ` Robert LeBlanc
       [not found]                                                                               ` <CAANLjFpEPQbvpnMREu-kcPORK28V1CdWBe7655wHp-74AwwQUg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-10-06 16:26                                                                             ` Ken Dreyer
  1 sibling, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 15:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users

I downgraded to the hammer gitbuilder branch, but it looks like I've
passed the point of no return:

2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
includes unsupported features:
compat={},rocompat={},incompat={7=support shec erasure code}
2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
(1) Operation not permitted

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> Thanks for your time Sage. It sounds like a few people may be helped if you
>> can find something.
>>
>> I did a recursive chown as in the instructions (although I didn't know about
>> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
>> I'll also do ms and make the logs available. I'll also review the document
>> to make sure I didn't miss anything else.
>
> Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> first.  They won't be allowed to boot until that happens... all upgrades
> must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
> do that today.  In the meantime, you can use the hammer gitbuilder
> build...
>
> sage
>
>
>>
>> Robert LeBlanc
>>
>> Sent from a mobile device please excuse any typos.
>>
>> On Oct 6, 2015 6:37 AM, "Sage Weil" <sweil@redhat.com> wrote:
>>       On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>       > -----BEGIN PGP SIGNED MESSAGE-----
>>       > Hash: SHA256
>>       >
>>       > With some off-list help, we have adjusted
>>       > osd_client_message_cap=10000. This seems to have helped a bit
>>       and we
>>       > have seen some OSDs have a value up to 4,000 for client
>>       messages. But
>>       > it does not solve the problem with the blocked I/O.
>>       >
>>       > One thing that I have noticed is that almost exactly 30
>>       seconds elapse
>>       > between an OSD boots and the first blocked I/O message. I
>>       don't know
>>       > if the OSD doesn't have time to get it's brain right about a
>>       PG before
>>       > it starts servicing it or what exactly.
>>
>>       I'm downloading the logs from yesterday now; sorry it's taking
>>       so long.
>>
>>       > On another note, I tried upgrading our CentOS dev cluster from
>>       Hammer
>>       > to master and things didn't go so well. The OSDs would not
>>       start
>>       > because /var/lib/ceph was not owned by ceph. I chowned the
>>       directory
>>       > and all OSDs and the OSD then started, but never became active
>>       in the
>>       > cluster. It just sat there after reading all the PGs. There
>>       were
>>       > sockets open to the monitor, but no OSD to OSD sockets. I
>>       tried
>>       > downgrading to the Infernalis branch and still no luck getting
>>       the
>>       > OSDs to come up. The OSD processes were idle after the initial
>>       boot.
>>       > All packages were installed from gitbuilder.
>>
>>       Did you chown -R ?
>>
>>              https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
>>       g-from-hammer
>>
>>       My guess is you only chowned the root dir, and the OSD didn't
>>       throw
>>       an error when it encountered the other files?  If you can
>>       generate a debug
>>       osd = 20 log, that would be helpful.. thanks!
>>
>>       sage
>>
>>
>>       >
>>       > Thanks,
>>       > -----BEGIN PGP SIGNATURE-----
>>       > Version: Mailvelope v1.2.0
>>       > Comment: https://www.mailvelope.com
>>       >
>>       > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>       > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>       > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>       > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>       > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>       > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>       > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>       > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>       > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>       > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>       > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>       > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>       > GdXC
>>       > =Aigq
>>       > -----END PGP SIGNATURE-----
>>       > ----------------
>>       > Robert LeBlanc
>>       > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>>       B9F1
>>       >
>>       >
>>       > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
>>       <robert@leblancnet.us> wrote:
>>       > > -----BEGIN PGP SIGNED MESSAGE-----
>>       > > Hash: SHA256
>>       > >
>>       > > I have eight nodes running the fio job rbd_test_real to
>>       different RBD
>>       > > volumes. I've included the CRUSH map in the tarball.
>>       > >
>>       > > I stopped one OSD process and marked it out. I let it
>>       recover for a
>>       > > few minutes and then I started the process again and marked
>>       it in. I
>>       > > started getting block I/O messages during the recovery.
>>       > >
>>       > > The logs are located at
>>       http://162.144.87.113/files/ushou1.tar.xz
>>       > >
>>       > > Thanks,
>>       > > -----BEGIN PGP SIGNATURE-----
>>       > > Version: Mailvelope v1.2.0
>>       > > Comment: https://www.mailvelope.com
>>       > >
>>       > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>       > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>       > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>       > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>       > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>       > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>       > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>       > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>       > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>       > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>       > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>       > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>       > > 3EPx
>>       > > =UDIV
>>       > > -----END PGP SIGNATURE-----
>>       > >
>>       > > ----------------
>>       > > Robert LeBlanc
>>       > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>>       FA62 B9F1
>>       > >
>>       > >
>>       > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil@redhat.com>
>>       wrote:
>>       > >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>       > >>> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> Hash: SHA256
>>       > >>>
>>       > >>> We are still struggling with this and have tried a lot of
>>       different
>>       > >>> things. Unfortunately, Inktank (now Red Hat) no longer
>>       provides
>>       > >>> consulting services for non-Red Hat systems. If there are
>>       some
>>       > >>> certified Ceph consultants in the US that we can do both
>>       remote and
>>       > >>> on-site engagements, please let us know.
>>       > >>>
>>       > >>> This certainly seems to be network related, but somewhere
>>       in the
>>       > >>> kernel. We have tried increasing the network and TCP
>>       buffers, number
>>       > >>> of TCP sockets, reduced the FIN_WAIT2 state. There is
>>       about 25% idle
>>       > >>> on the boxes, the disks are busy, but not constantly at
>>       100% (they
>>       > >>> cycle from <10% up to 100%, but not 100% for more than a
>>       few seconds
>>       > >>> at a time). There seems to be no reasonable explanation
>>       why I/O is
>>       > >>> blocked pretty frequently longer than 30 seconds. We have
>>       verified
>>       > >>> Jumbo frames by pinging from/to each node with 9000 byte
>>       packets. The
>>       > >>> network admins have verified that packets are not being
>>       dropped in the
>>       > >>> switches for these nodes. We have tried different kernels
>>       including
>>       > >>> the recent Google patch to cubic. This is showing up on
>>       three cluster
>>       > >>> (two Ethernet and one IPoIB). I booted one cluster into
>>       Debian Jessie
>>       > >>> (from CentOS 7.1) with similar results.
>>       > >>>
>>       > >>> The messages seem slightly different:
>>       > >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425
>>       439 :
>>       > >>> cluster [WRN] 14 slow requests, 1 included below; oldest
>>       blocked for >
>>       > >>> 100.087155 secs
>>       > >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425
>>       440 :
>>       > >>> cluster [WRN] slow request 30.041999 seconds old, received
>>       at
>>       > >>> 2015-10-03 14:37:53.151014:
>>       osd_op(client.1328605.0:7082862
>>       > >>> rbd_data.13fdcb2ae8944a.000000000001264f [read
>>       975360~4096]
>>       > >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently
>>       no flag
>>       > >>> points reached
>>       > >>>
>>       > >>> I don't know what "no flag points reached" means.
>>       > >>
>>       > >> Just that the op hasn't been marked as reaching any
>>       interesting points
>>       > >> (op->mark_*() calls).
>>       > >>
>>       > >> Is it possible to gather a lot with debug ms = 20 and debug
>>       osd = 20?
>>       > >> It's extremely verbose but it'll let us see where the op is
>>       getting
>>       > >> blocked.  If you see the "slow request" message it means
>>       the op in
>>       > >> received by ceph (that's when the clock starts), so I
>>       suspect it's not
>>       > >> something we can blame on the network stack.
>>       > >>
>>       > >> sage
>>       > >>
>>       > >>
>>       > >>>
>>       > >>> The problem is most pronounced when we have to reboot an
>>       OSD node (1
>>       > >>> of 13), we will have hundreds of I/O blocked for some
>>       times up to 300
>>       > >>> seconds. It takes a good 15 minutes for things to settle
>>       down. The
>>       > >>> production cluster is very busy doing normally 8,000 I/O
>>       and peaking
>>       > >>> at 15,000. This is all 4TB spindles with SSD journals and
>>       the disks
>>       > >>> are between 25-50% full. We are currently splitting PGs to
>>       distribute
>>       > >>> the load better across the disks, but we are having to do
>>       this 10 PGs
>>       > >>> at a time as we get blocked I/O. We have max_backfills and
>>       > >>> max_recovery set to 1, client op priority is set higher
>>       than recovery
>>       > >>> priority. We tried increasing the number of op threads but
>>       this didn't
>>       > >>> seem to help. It seems as soon as PGs are finished being
>>       checked, they
>>       > >>> become active and could be the cause for slow I/O while
>>       the other PGs
>>       > >>> are being checked.
>>       > >>>
>>       > >>> What I don't understand is that the messages are delayed.
>>       As soon as
>>       > >>> the message is received by Ceph OSD process, it is very
>>       quickly
>>       > >>> committed to the journal and a response is sent back to
>>       the primary
>>       > >>> OSD which is received very quickly as well. I've adjust
>>       > >>> min_free_kbytes and it seems to keep the OSDs from
>>       crashing, but
>>       > >>> doesn't solve the main problem. We don't have swap and
>>       there is 64 GB
>>       > >>> of RAM per nodes for 10 OSDs.
>>       > >>>
>>       > >>> Is there something that could cause the kernel to get a
>>       packet but not
>>       > >>> be able to dispatch it to Ceph such that it could be
>>       explaining why we
>>       > >>> are seeing these blocked I/O for 30+ seconds. Is there
>>       some pointers
>>       > >>> to tracing Ceph messages from the network buffer through
>>       the kernel to
>>       > >>> the Ceph process?
>>       > >>>
>>       > >>> We can really use some pointers no matter how outrageous.
>>       We've have
>>       > >>> over 6 people looking into this for weeks now and just
>>       can't think of
>>       > >>> anything else.
>>       > >>>
>>       > >>> Thanks,
>>       > >>> -----BEGIN PGP SIGNATURE-----
>>       > >>> Version: Mailvelope v1.1.0
>>       > >>> Comment: https://www.mailvelope.com
>>       > >>>
>>       > >>>
>>       wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>       > >>>
>>       NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>       > >>>
>>       prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>       > >>>
>>       K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>       > >>>
>>       h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>       > >>>
>>       iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>       > >>>
>>       Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>       > >>>
>>       Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>       > >>>
>>       JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>       > >>>
>>       8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>       > >>>
>>       lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>       > >>>
>>       4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>       > >>> l7OF
>>       > >>> =OI++
>>       > >>> -----END PGP SIGNATURE-----
>>       > >>> ----------------
>>       > >>> Robert LeBlanc
>>       > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>>       FA62 B9F1
>>       > >>>
>>       > >>>
>>       > >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc
>>       <robert@leblancnet.us> wrote:
>>       > >>> > We dropped the replication on our cluster from 4 to 3
>>       and it looks
>>       > >>> > like all the blocked I/O has stopped (no entries in the
>>       log for the
>>       > >>> > last 12 hours). This makes me believe that there is some
>>       issue with
>>       > >>> > the number of sockets or some other TCP issue. We have
>>       not messed with
>>       > >>> > Ephemeral ports and TIME_WAIT at this point. There are
>>       130 OSDs, 8 KVM
>>       > >>> > hosts hosting about 150 VMs. Open files is set at 32K
>>       for the OSD
>>       > >>> > processes and 16K system wide.
>>       > >>> >
>>       > >>> > Does this seem like the right spot to be looking? What
>>       are some
>>       > >>> > configuration items we should be looking at?
>>       > >>> >
>>       > >>> > Thanks,
>>       > >>> > ----------------
>>       > >>> > Robert LeBlanc
>>       > >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>>       FA62 B9F1
>>       > >>> >
>>       > >>> >
>>       > >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc
>>       <robert@leblancnet.us> wrote:
>>       > >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> >> Hash: SHA256
>>       > >>> >>
>>       > >>> >> We were able to only get ~17Gb out of the XL710
>>       (heavily tweaked)
>>       > >>> >> until we went to the 4.x kernel where we got ~36Gb (no
>>       tweaking). It
>>       > >>> >> seems that there were some major reworks in the network
>>       handling in
>>       > >>> >> the kernel to efficiently handle that network rate. If
>>       I remember
>>       > >>> >> right we also saw a drop in CPU utilization. I'm
>>       starting to think
>>       > >>> >> that we did see packet loss while congesting our ISLs
>>       in our initial
>>       > >>> >> testing, but we could not tell where the dropping was
>>       happening. We
>>       > >>> >> saw some on the switches, but it didn't seem to be bad
>>       if we weren't
>>       > >>> >> trying to congest things. We probably already saw this
>>       issue, just
>>       > >>> >> didn't know it.
>>       > >>> >> - ----------------
>>       > >>> >> Robert LeBlanc
>>       > >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>>       3BB2 FA62 B9F1
>>       > >>> >>
>>       > >>> >>
>>       > >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>       > >>> >>> FWIW, we've got some 40GbE Intel cards in the
>>       community performance cluster
>>       > >>> >>> on a Mellanox 40GbE switch that appear (knock on wood)
>>       to be running fine
>>       > >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback
>>       from Intel that older
>>       > >>> >>> drivers might cause problems though.
>>       > >>> >>>
>>       > >>> >>> Here's ifconfig from one of the nodes:
>>       > >>> >>>
>>       > >>> >>> ens513f1: flags=4163  mtu 1500
>>       > >>> >>>         inet 10.0.10.101  netmask 255.255.255.0
>>       broadcast 10.0.10.255
>>       > >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64
>>       scopeid 0x20
>>       > >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000
>>       (Ethernet)
>>       > >>> >>>         RX packets 169232242875  bytes 229346261232279
>>       (208.5 TiB)
>>       > >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>       > >>> >>>         TX packets 153491686361  bytes 203976410836881
>>       (185.5 TiB)
>>       > >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0
>>       collisions 0
>>       > >>> >>>
>>       > >>> >>> Mark
>>       > >>> >>>
>>       > >>> >>>
>>       > >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>       > >>> >>>>
>>       > >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> >>>> Hash: SHA256
>>       > >>> >>>>
>>       > >>> >>>> OK, here is the update on the saga...
>>       > >>> >>>>
>>       > >>> >>>> I traced some more of blocked I/Os and it seems that
>>       communication
>>       > >>> >>>> between two hosts seemed worse than others. I did a
>>       two way ping flood
>>       > >>> >>>> between the two hosts using max packet sizes (1500).
>>       After 1.5M
>>       > >>> >>>> packets, no lost pings. Then then had the ping flood
>>       running while I
>>       > >>> >>>> put Ceph load on the cluster and the dropped pings
>>       started increasing
>>       > >>> >>>> after stopping the Ceph workload the pings stopped
>>       dropping.
>>       > >>> >>>>
>>       > >>> >>>> I then ran iperf between all the nodes with the same
>>       results, so that
>>       > >>> >>>> ruled out Ceph to a large degree. I then booted in
>>       the the
>>       > >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour
>>       test so far there
>>       > >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40
>>       Gb NICs really
>>       > >>> >>>> need the network enhancements in the 4.x series to
>>       work well.
>>       > >>> >>>>
>>       > >>> >>>> Does this sound familiar to anyone? I'll probably
>>       start bisecting the
>>       > >>> >>>> kernel to see where this issue in introduced. Both of
>>       the clusters
>>       > >>> >>>> with this issue are running 4.x, other than that,
>>       they are pretty
>>       > >>> >>>> differing hardware and network configs.
>>       > >>> >>>>
>>       > >>> >>>> Thanks,
>>       > >>> >>>> -----BEGIN PGP SIGNATURE-----
>>       > >>> >>>> Version: Mailvelope v1.1.0
>>       > >>> >>>> Comment: https://www.mailvelope.com
>>       > >>> >>>>
>>       > >>> >>>>
>>       wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>       > >>> >>>>
>>       RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>       > >>> >>>>
>>       AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>       > >>> >>>>
>>       7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>       > >>> >>>>
>>       cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>       > >>> >>>>
>>       F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>       > >>> >>>>
>>       byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>       > >>> >>>>
>>       /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>       > >>> >>>>
>>       LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>       > >>> >>>>
>>       UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>       > >>> >>>>
>>       sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>       > >>> >>>>
>>       KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>       > >>> >>>> 4OEo
>>       > >>> >>>> =P33I
>>       > >>> >>>> -----END PGP SIGNATURE-----
>>       > >>> >>>> ----------------
>>       > >>> >>>> Robert LeBlanc
>>       > >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>>       3BB2 FA62 B9F1
>>       > >>> >>>>
>>       > >>> >>>>
>>       > >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>       > >>> >>>> wrote:
>>       > >>> >>>>>
>>       > >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> >>>>> Hash: SHA256
>>       > >>> >>>>>
>>       > >>> >>>>> This is IPoIB and we have the MTU set to 64K. There
>>       was some issues
>>       > >>> >>>>> pinging hosts with "No buffer space available"
>>       (hosts are currently
>>       > >>> >>>>> configured for 4GB to test SSD caching rather than
>>       page cache). I
>>       > >>> >>>>> found that MTU under 32K worked reliable for ping,
>>       but still had the
>>       > >>> >>>>> blocked I/O.
>>       > >>> >>>>>
>>       > >>> >>>>> I reduced the MTU to 1500 and checked pings (OK),
>>       but I'm still seeing
>>       > >>> >>>>> the blocked I/O.
>>       > >>> >>>>> - ----------------
>>       > >>> >>>>> Robert LeBlanc
>>       > >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>>       3BB2 FA62 B9F1
>>       > >>> >>>>>
>>       > >>> >>>>>
>>       > >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>       > >>> >>>>>>
>>       > >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>       > >>> >>>>>>>
>>       > >>> >>>>>>> I looked at the logs, it looks like there was a 53
>>       second delay
>>       > >>> >>>>>>> between when osd.17 started sending the osd_repop
>>       message and when
>>       > >>> >>>>>>> osd.13 started reading it, which is pretty weird.
>>       Sage, didn't we
>>       > >>> >>>>>>> once see a kernel issue which caused some messages
>>       to be mysteriously
>>       > >>> >>>>>>> delayed for many 10s of seconds?
>>       > >>> >>>>>>
>>       > >>> >>>>>>
>>       > >>> >>>>>> Every time we have seen this behavior and diagnosed
>>       it in the wild it
>>       > >>> >>>>>> has
>>       > >>> >>>>>> been a network misconfiguration.  Usually related
>>       to jumbo frames.
>>       > >>> >>>>>>
>>       > >>> >>>>>> sage
>>       > >>> >>>>>>
>>       > >>> >>>>>>
>>       > >>> >>>>>>>
>>       > >>> >>>>>>> What kernel are you running?
>>       > >>> >>>>>>> -Sam
>>       > >>> >>>>>>>
>>       > >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc
>>       wrote:
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> >>>>>>>> Hash: SHA256
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> OK, looping in ceph-devel to see if I can get
>>       some more eyes. I've
>>       > >>> >>>>>>>> extracted what I think are important entries from
>>       the logs for the
>>       > >>> >>>>>>>> first blocked request. NTP is running all the
>>       servers so the logs
>>       > >>> >>>>>>>> should be close in terms of time. Logs for 12:50
>>       to 13:00 are
>>       > >>> >>>>>>>> available at
>>       http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from
>>       client
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O
>>       to osd.13
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O
>>       to osd.16
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from
>>       osd.17
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk
>>       result=0 from osd.16
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to
>>       osd.17 ondisk result=0
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow
>>       I/O > 30.439150 sec
>>       > >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from
>>       osd.17
>>       > >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk
>>       result=0 from osd.13
>>       > >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to
>>       osd.17 ondisk result=0
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> In the logs I can see that osd.17 dispatches the
>>       I/O to osd.13 and
>>       > >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get
>>       the I/O right away,
>>       > >>> >>>>>>>> but for some reason osd.13 doesn't get the
>>       message until 53 seconds
>>       > >>> >>>>>>>> later. osd.17 seems happy to just wait and
>>       doesn't resend the data
>>       > >>> >>>>>>>> (well, I'm not 100% sure how to tell which
>>       entries are the actual data
>>       > >>> >>>>>>>> transfer).
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> It looks like osd.17 is receiving responses to
>>       start the communication
>>       > >>> >>>>>>>> with osd.13, but the op is not acknowledged until
>>       almost a minute
>>       > >>> >>>>>>>> later. To me it seems that the message is getting
>>       received but not
>>       > >>> >>>>>>>> passed to another thread right away or something.
>>       This test was done
>>       > >>> >>>>>>>> with an idle cluster, a single fio client (rbd
>>       engine) with a single
>>       > >>> >>>>>>>> thread.
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> The OSD servers are almost 100% idle during these
>>       blocked I/O
>>       > >>> >>>>>>>> requests. I think I'm at the end of my
>>       troubleshooting, so I can use
>>       > >>> >>>>>>>> some help.
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> Single Test started about
>>       > >>> >>>>>>>> 2015-09-22 12:52:36
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17
>>       192.168.55.14:6800/16726 56 :
>>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>>       oldest blocked for >
>>       > >>> >>>>>>>> 30.439150 secs
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17
>>       192.168.55.14:6800/16726 57 :
>>       > >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old,
>>       received at
>>       > >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>       > >>> >>>>>>>>   osd_op(client.250874.0:1388
>>       rbd_data.3380e2ae8944a.0000000000000545
>>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>>       4194304,write
>>       > >>> >>>>>>>> 0~4194304] 8.bbf3e8ff
>>       ack+ondisk+write+known_if_redirected e56785)
>>       > >>> >>>>>>>>   currently waiting for subops from 13,16
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16
>>       192.168.55.13:6800/29410 7 : cluster
>>       > >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest
>>       blocked for >
>>       > >>> >>>>>>>> 30.379680 secs
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16
>>       192.168.55.13:6800/29410 8 : cluster
>>       > >>> >>>>>>>> [WRN] slow request 30.291520 seconds old,
>>       received at 2015-09-22
>>       > >>> >>>>>>>> 12:55:06.406303:
>>       > >>> >>>>>>>>   osd_op(client.250874.0:1384
>>       rbd_data.3380e2ae8944a.0000000000000541
>>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>>       4194304,write
>>       > >>> >>>>>>>> 0~4194304] 8.5fb2123f
>>       ack+ondisk+write+known_if_redirected e56785)
>>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>>       > >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16
>>       192.168.55.13:6800/29410 9 : cluster
>>       > >>> >>>>>>>> [WRN] slow request 30.379680 seconds old,
>>       received at 2015-09-22
>>       > >>> >>>>>>>> 12:55:06.318144:
>>       > >>> >>>>>>>>   osd_op(client.250874.0:1382
>>       rbd_data.3380e2ae8944a.000000000000053f
>>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>>       4194304,write
>>       > >>> >>>>>>>> 0~4194304] 8.312e69ca
>>       ack+ondisk+write+known_if_redirected e56785)
>>       > >>> >>>>>>>>   currently waiting for subops from 13,14
>>       > >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13
>>       192.168.55.12:6804/4574 130 :
>>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>>       oldest blocked for >
>>       > >>> >>>>>>>> 30.954212 secs
>>       > >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13
>>       192.168.55.12:6804/4574 131 :
>>       > >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old,
>>       received at
>>       > >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>       > >>> >>>>>>>>   osd_op(client.250874.0:1873
>>       rbd_data.3380e2ae8944a.000000000000070d
>>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>>       4194304,write
>>       > >>> >>>>>>>> 0~4194304] 8.e69870d4
>>       ack+ondisk+write+known_if_redirected e56785)
>>       > >>> >>>>>>>>   currently waiting for subops from 16,17
>>       > >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16
>>       192.168.55.13:6800/29410 10 :
>>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>>       oldest blocked for >
>>       > >>> >>>>>>>> 30.704367 secs
>>       > >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16
>>       192.168.55.13:6800/29410 11 :
>>       > >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old,
>>       received at
>>       > >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>       > >>> >>>>>>>>   osd_op(client.250874.0:1874
>>       rbd_data.3380e2ae8944a.000000000000070e
>>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>>       4194304,write
>>       > >>> >>>>>>>> 0~4194304] 8.f7635819
>>       ack+ondisk+write+known_if_redirected e56785)
>>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> Server   IP addr              OSD
>>       > >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>       > >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>       > >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>       > >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>       > >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>       > >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> fio job:
>>       > >>> >>>>>>>> [rbd-test]
>>       > >>> >>>>>>>> readwrite=write
>>       > >>> >>>>>>>> blocksize=4M
>>       > >>> >>>>>>>> #runtime=60
>>       > >>> >>>>>>>> name=rbd-test
>>       > >>> >>>>>>>> #readwrite=randwrite
>>       > >>> >>>>>>>>
>>       #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>       > >>> >>>>>>>> #rwmixread=72
>>       > >>> >>>>>>>> #norandommap
>>       > >>> >>>>>>>> #size=1T
>>       > >>> >>>>>>>> #blocksize=4k
>>       > >>> >>>>>>>> ioengine=rbd
>>       > >>> >>>>>>>> rbdname=test2
>>       > >>> >>>>>>>> pool=rbd
>>       > >>> >>>>>>>> clientname=admin
>>       > >>> >>>>>>>> iodepth=8
>>       > >>> >>>>>>>> #numjobs=4
>>       > >>> >>>>>>>> #thread
>>       > >>> >>>>>>>> #group_reporting
>>       > >>> >>>>>>>> #time_based
>>       > >>> >>>>>>>> #direct=1
>>       > >>> >>>>>>>> #ramp_time=60
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> Thanks,
>>       > >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>       > >>> >>>>>>>> Version: Mailvelope v1.1.0
>>       > >>> >>>>>>>> Comment: https://www.mailvelope.com
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>>
>>       wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>       > >>> >>>>>>>>
>>       tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>       > >>> >>>>>>>>
>>       h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>       > >>> >>>>>>>>
>>       903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>       > >>> >>>>>>>>
>>       sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>       > >>> >>>>>>>>
>>       FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>       > >>> >>>>>>>>
>>       pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>       > >>> >>>>>>>>
>>       5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>       > >>> >>>>>>>>
>>       B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>       > >>> >>>>>>>>
>>       4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>       > >>> >>>>>>>>
>>       o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>       > >>> >>>>>>>>
>>       gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>       > >>> >>>>>>>> J3hS
>>       > >>> >>>>>>>> =0J7F
>>       > >>> >>>>>>>> -----END PGP SIGNATURE-----
>>       > >>> >>>>>>>> ----------------
>>       > >>> >>>>>>>> Robert LeBlanc
>>       > >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E
>>       E654 3BB2 FA62 B9F1
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum
>>       wrote:
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc
>>       wrote:
>>       > >>> >>>>>>>>>>
>>       > >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>       > >>> >>>>>>>>>> Hash: SHA256
>>       > >>> >>>>>>>>>>
>>       > >>> >>>>>>>>>> Is there some way to tell in the logs that this
>>       is happening?
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>> You can search for the (mangled) name
>>       _split_collection
>>       > >>> >>>>>>>>>>
>>       > >>> >>>>>>>>>> I'm not
>>       > >>> >>>>>>>>>> seeing much I/O, CPU usage during these times.
>>       Is there some way to
>>       > >>> >>>>>>>>>> prevent the splitting? Is there a negative side
>>       effect to doing so?
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>> Bump up the split and merge thresholds. You can
>>       search the list for
>>       > >>> >>>>>>>>> this, it was discussed not too long ago.
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as
>>       soon as the sessions
>>       > >>> >>>>>>>>>> are aborted, they are reestablished and
>>       complete immediately.
>>       > >>> >>>>>>>>>>
>>       > >>> >>>>>>>>>> The fio test is just a seq write, starting it
>>       over (rewriting from
>>       > >>> >>>>>>>>>> the
>>       > >>> >>>>>>>>>> beginning) is still causing the issue. I was
>>       suspect that it is not
>>       > >>> >>>>>>>>>> having to create new file and therefore split
>>       collections. This is
>>       > >>> >>>>>>>>>> on
>>       > >>> >>>>>>>>>> my test cluster with no other load.
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>> Hmm, that does make it seem less likely if
>>       you're really not creating
>>       > >>> >>>>>>>>> new objects, if you're actually running fio in
>>       such a way that it's
>>       > >>> >>>>>>>>> not allocating new FS blocks (this is probably
>>       hard to set up?).
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>>
>>       > >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log
>>       options and depths
>>       > >>> >>>>>>>>>> would be the most helpful for tracking this
>>       issue down?
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>>
>>       > >>> >>>>>>>>> If you want to go log diving "debug osd = 20",
>>       "debug filestore =
>>       > >>> >>>>>>>>> 20",
>>       > >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to
>>       see. That should spit
>>       > >>> >>>>>>>>> out
>>       > >>> >>>>>>>>> everything you need to track exactly what each
>>       Op is doing.
>>       > >>> >>>>>>>>> -Greg
>>       > >>> >>>>>>>>
>>       > >>> >>>>>>>> --
>>       > >>> >>>>>>>> To unsubscribe from this list: send the line
>>       "unsubscribe ceph-devel"
>>       > >>> >>>>>>>> in
>>       > >>> >>>>>>>> the body of a message to
>>       majordomo@vger.kernel.org
>>       > >>> >>>>>>>> More majordomo info at
>>       http://vger.kernel.org/majordomo-info.html
>>       > >>> >>>>>>>
>>       > >>> >>>>>>>
>>       > >>> >>>>>>>
>>       > >>> >>>>>
>>       > >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>       > >>> >>>>> Version: Mailvelope v1.1.0
>>       > >>> >>>>> Comment: https://www.mailvelope.com
>>       > >>> >>>>>
>>       > >>> >>>>>
>>       wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>       > >>> >>>>>
>>       a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>       > >>> >>>>>
>>       a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>       > >>> >>>>>
>>       s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>       > >>> >>>>>
>>       iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>       > >>> >>>>>
>>       izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>       > >>> >>>>>
>>       caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>       > >>> >>>>>
>>       efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>       > >>> >>>>>
>>       GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>       > >>> >>>>>
>>       glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>       > >>> >>>>>
>>       +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>       > >>> >>>>>
>>       pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>       > >>> >>>>> gcZm
>>       > >>> >>>>> =CjwB
>>       > >>> >>>>> -----END PGP SIGNATURE-----
>>       > >>> >>>>
>>       > >>> >>>> --
>>       > >>> >>>> To unsubscribe from this list: send the line
>>       "unsubscribe ceph-devel" in
>>       > >>> >>>> the body of a message to majordomo@vger.kernel.org
>>       > >>> >>>> More majordomo info at
>>       http://vger.kernel.org/majordomo-info.html
>>       > >>> >>>>
>>       > >>> >>>
>>       > >>> >>
>>       > >>> >> -----BEGIN PGP SIGNATURE-----
>>       > >>> >> Version: Mailvelope v1.1.0
>>       > >>> >> Comment: https://www.mailvelope.com
>>       > >>> >>
>>       > >>> >>
>>       wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>       > >>> >>
>>       S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>       > >>> >>
>>       lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>       > >>> >>
>>       0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>       > >>> >>
>>       JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>       > >>> >>
>>       dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>       > >>> >>
>>       nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>       > >>> >>
>>       krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>       > >>> >>
>>       FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>       > >>> >>
>>       tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>       > >>> >>
>>       hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>       > >>> >>
>>       BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>       > >>> >> ae22
>>       > >>> >> =AX+L
>>       > >>> >> -----END PGP SIGNATURE-----
>>       > >>> _______________________________________________
>>       > >>> ceph-users mailing list
>>       > >>> ceph-users@lists.ceph.com
>>       > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>       > >>>
>>       > >>>
>>       > _______________________________________________
>>       > ceph-users mailing list
>>       > ceph-users@lists.ceph.com
>>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>       >
>>       >
>>
>>
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                               ` <CAANLjFpEPQbvpnMREu-kcPORK28V1CdWBe7655wHp-74AwwQUg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 16:19                                                                                 ` Sage Weil
  2015-10-06 17:47                                                                                   ` [ceph-users] " Robert LeBlanc
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-06 16:19 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw

On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> I downgraded to the hammer gitbuilder branch, but it looks like I've
> passed the point of no return:
> 
> 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec erasure code}
> 2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
> (1) Operation not permitted

In that case, mark all osds down, upgrade again, and they'll be 
allowed to start.  The restriction is that each osd can't go backwards, 
and post-hammer osds can't talk to pre-hammer osds.

sage

> 
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >> Thanks for your time Sage. It sounds like a few people may be helped if you
> >> can find something.
> >>
> >> I did a recursive chown as in the instructions (although I didn't know about
> >> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
> >> I'll also do ms and make the logs available. I'll also review the document
> >> to make sure I didn't miss anything else.
> >
> > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> > first.  They won't be allowed to boot until that happens... all upgrades
> > must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
> > do that today.  In the meantime, you can use the hammer gitbuilder
> > build...
> >
> > sage
> >
> >
> >>
> >> Robert LeBlanc
> >>
> >> Sent from a mobile device please excuse any typos.
> >>
> >> On Oct 6, 2015 6:37 AM, "Sage Weil" <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >>       On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >>       > -----BEGIN PGP SIGNED MESSAGE-----
> >>       > Hash: SHA256
> >>       >
> >>       > With some off-list help, we have adjusted
> >>       > osd_client_message_cap=10000. This seems to have helped a bit
> >>       and we
> >>       > have seen some OSDs have a value up to 4,000 for client
> >>       messages. But
> >>       > it does not solve the problem with the blocked I/O.
> >>       >
> >>       > One thing that I have noticed is that almost exactly 30
> >>       seconds elapse
> >>       > between an OSD boots and the first blocked I/O message. I
> >>       don't know
> >>       > if the OSD doesn't have time to get it's brain right about a
> >>       PG before
> >>       > it starts servicing it or what exactly.
> >>
> >>       I'm downloading the logs from yesterday now; sorry it's taking
> >>       so long.
> >>
> >>       > On another note, I tried upgrading our CentOS dev cluster from
> >>       Hammer
> >>       > to master and things didn't go so well. The OSDs would not
> >>       start
> >>       > because /var/lib/ceph was not owned by ceph. I chowned the
> >>       directory
> >>       > and all OSDs and the OSD then started, but never became active
> >>       in the
> >>       > cluster. It just sat there after reading all the PGs. There
> >>       were
> >>       > sockets open to the monitor, but no OSD to OSD sockets. I
> >>       tried
> >>       > downgrading to the Infernalis branch and still no luck getting
> >>       the
> >>       > OSDs to come up. The OSD processes were idle after the initial
> >>       boot.
> >>       > All packages were installed from gitbuilder.
> >>
> >>       Did you chown -R ?
> >>
> >>              https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
> >>       g-from-hammer
> >>
> >>       My guess is you only chowned the root dir, and the OSD didn't
> >>       throw
> >>       an error when it encountered the other files?  If you can
> >>       generate a debug
> >>       osd = 20 log, that would be helpful.. thanks!
> >>
> >>       sage
> >>
> >>
> >>       >
> >>       > Thanks,
> >>       > -----BEGIN PGP SIGNATURE-----
> >>       > Version: Mailvelope v1.2.0
> >>       > Comment: https://www.mailvelope.com
> >>       >
> >>       > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >>       > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >>       > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >>       > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >>       > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >>       > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >>       > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >>       > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >>       > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >>       > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >>       > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >>       > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >>       > GdXC
> >>       > =Aigq
> >>       > -----END PGP SIGNATURE-----
> >>       > ----------------
> >>       > Robert LeBlanc
> >>       > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
> >>       B9F1
> >>       >
> >>       >
> >>       > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
> >>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>       > > -----BEGIN PGP SIGNED MESSAGE-----
> >>       > > Hash: SHA256
> >>       > >
> >>       > > I have eight nodes running the fio job rbd_test_real to
> >>       different RBD
> >>       > > volumes. I've included the CRUSH map in the tarball.
> >>       > >
> >>       > > I stopped one OSD process and marked it out. I let it
> >>       recover for a
> >>       > > few minutes and then I started the process again and marked
> >>       it in. I
> >>       > > started getting block I/O messages during the recovery.
> >>       > >
> >>       > > The logs are located at
> >>       http://162.144.87.113/files/ushou1.tar.xz
> >>       > >
> >>       > > Thanks,
> >>       > > -----BEGIN PGP SIGNATURE-----
> >>       > > Version: Mailvelope v1.2.0
> >>       > > Comment: https://www.mailvelope.com
> >>       > >
> >>       > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> >>       > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> >>       > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> >>       > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> >>       > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> >>       > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> >>       > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> >>       > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> >>       > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> >>       > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> >>       > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> >>       > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> >>       > > 3EPx
> >>       > > =UDIV
> >>       > > -----END PGP SIGNATURE-----
> >>       > >
> >>       > > ----------------
> >>       > > Robert LeBlanc
> >>       > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
> >>       FA62 B9F1
> >>       > >
> >>       > >
> >>       > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>       wrote:
> >>       > >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >>       > >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> Hash: SHA256
> >>       > >>>
> >>       > >>> We are still struggling with this and have tried a lot of
> >>       different
> >>       > >>> things. Unfortunately, Inktank (now Red Hat) no longer
> >>       provides
> >>       > >>> consulting services for non-Red Hat systems. If there are
> >>       some
> >>       > >>> certified Ceph consultants in the US that we can do both
> >>       remote and
> >>       > >>> on-site engagements, please let us know.
> >>       > >>>
> >>       > >>> This certainly seems to be network related, but somewhere
> >>       in the
> >>       > >>> kernel. We have tried increasing the network and TCP
> >>       buffers, number
> >>       > >>> of TCP sockets, reduced the FIN_WAIT2 state. There is
> >>       about 25% idle
> >>       > >>> on the boxes, the disks are busy, but not constantly at
> >>       100% (they
> >>       > >>> cycle from <10% up to 100%, but not 100% for more than a
> >>       few seconds
> >>       > >>> at a time). There seems to be no reasonable explanation
> >>       why I/O is
> >>       > >>> blocked pretty frequently longer than 30 seconds. We have
> >>       verified
> >>       > >>> Jumbo frames by pinging from/to each node with 9000 byte
> >>       packets. The
> >>       > >>> network admins have verified that packets are not being
> >>       dropped in the
> >>       > >>> switches for these nodes. We have tried different kernels
> >>       including
> >>       > >>> the recent Google patch to cubic. This is showing up on
> >>       three cluster
> >>       > >>> (two Ethernet and one IPoIB). I booted one cluster into
> >>       Debian Jessie
> >>       > >>> (from CentOS 7.1) with similar results.
> >>       > >>>
> >>       > >>> The messages seem slightly different:
> >>       > >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425
> >>       439 :
> >>       > >>> cluster [WRN] 14 slow requests, 1 included below; oldest
> >>       blocked for >
> >>       > >>> 100.087155 secs
> >>       > >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425
> >>       440 :
> >>       > >>> cluster [WRN] slow request 30.041999 seconds old, received
> >>       at
> >>       > >>> 2015-10-03 14:37:53.151014:
> >>       osd_op(client.1328605.0:7082862
> >>       > >>> rbd_data.13fdcb2ae8944a.000000000001264f [read
> >>       975360~4096]
> >>       > >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently
> >>       no flag
> >>       > >>> points reached
> >>       > >>>
> >>       > >>> I don't know what "no flag points reached" means.
> >>       > >>
> >>       > >> Just that the op hasn't been marked as reaching any
> >>       interesting points
> >>       > >> (op->mark_*() calls).
> >>       > >>
> >>       > >> Is it possible to gather a lot with debug ms = 20 and debug
> >>       osd = 20?
> >>       > >> It's extremely verbose but it'll let us see where the op is
> >>       getting
> >>       > >> blocked.  If you see the "slow request" message it means
> >>       the op in
> >>       > >> received by ceph (that's when the clock starts), so I
> >>       suspect it's not
> >>       > >> something we can blame on the network stack.
> >>       > >>
> >>       > >> sage
> >>       > >>
> >>       > >>
> >>       > >>>
> >>       > >>> The problem is most pronounced when we have to reboot an
> >>       OSD node (1
> >>       > >>> of 13), we will have hundreds of I/O blocked for some
> >>       times up to 300
> >>       > >>> seconds. It takes a good 15 minutes for things to settle
> >>       down. The
> >>       > >>> production cluster is very busy doing normally 8,000 I/O
> >>       and peaking
> >>       > >>> at 15,000. This is all 4TB spindles with SSD journals and
> >>       the disks
> >>       > >>> are between 25-50% full. We are currently splitting PGs to
> >>       distribute
> >>       > >>> the load better across the disks, but we are having to do
> >>       this 10 PGs
> >>       > >>> at a time as we get blocked I/O. We have max_backfills and
> >>       > >>> max_recovery set to 1, client op priority is set higher
> >>       than recovery
> >>       > >>> priority. We tried increasing the number of op threads but
> >>       this didn't
> >>       > >>> seem to help. It seems as soon as PGs are finished being
> >>       checked, they
> >>       > >>> become active and could be the cause for slow I/O while
> >>       the other PGs
> >>       > >>> are being checked.
> >>       > >>>
> >>       > >>> What I don't understand is that the messages are delayed.
> >>       As soon as
> >>       > >>> the message is received by Ceph OSD process, it is very
> >>       quickly
> >>       > >>> committed to the journal and a response is sent back to
> >>       the primary
> >>       > >>> OSD which is received very quickly as well. I've adjust
> >>       > >>> min_free_kbytes and it seems to keep the OSDs from
> >>       crashing, but
> >>       > >>> doesn't solve the main problem. We don't have swap and
> >>       there is 64 GB
> >>       > >>> of RAM per nodes for 10 OSDs.
> >>       > >>>
> >>       > >>> Is there something that could cause the kernel to get a
> >>       packet but not
> >>       > >>> be able to dispatch it to Ceph such that it could be
> >>       explaining why we
> >>       > >>> are seeing these blocked I/O for 30+ seconds. Is there
> >>       some pointers
> >>       > >>> to tracing Ceph messages from the network buffer through
> >>       the kernel to
> >>       > >>> the Ceph process?
> >>       > >>>
> >>       > >>> We can really use some pointers no matter how outrageous.
> >>       We've have
> >>       > >>> over 6 people looking into this for weeks now and just
> >>       can't think of
> >>       > >>> anything else.
> >>       > >>>
> >>       > >>> Thanks,
> >>       > >>> -----BEGIN PGP SIGNATURE-----
> >>       > >>> Version: Mailvelope v1.1.0
> >>       > >>> Comment: https://www.mailvelope.com
> >>       > >>>
> >>       > >>>
> >>       wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >>       > >>>
> >>       NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >>       > >>>
> >>       prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >>       > >>>
> >>       K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >>       > >>>
> >>       h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >>       > >>>
> >>       iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >>       > >>>
> >>       Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >>       > >>>
> >>       Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >>       > >>>
> >>       JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >>       > >>>
> >>       8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >>       > >>>
> >>       lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >>       > >>>
> >>       4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >>       > >>> l7OF
> >>       > >>> =OI++
> >>       > >>> -----END PGP SIGNATURE-----
> >>       > >>> ----------------
> >>       > >>> Robert LeBlanc
> >>       > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
> >>       FA62 B9F1
> >>       > >>>
> >>       > >>>
> >>       > >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc
> >>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>       > >>> > We dropped the replication on our cluster from 4 to 3
> >>       and it looks
> >>       > >>> > like all the blocked I/O has stopped (no entries in the
> >>       log for the
> >>       > >>> > last 12 hours). This makes me believe that there is some
> >>       issue with
> >>       > >>> > the number of sockets or some other TCP issue. We have
> >>       not messed with
> >>       > >>> > Ephemeral ports and TIME_WAIT at this point. There are
> >>       130 OSDs, 8 KVM
> >>       > >>> > hosts hosting about 150 VMs. Open files is set at 32K
> >>       for the OSD
> >>       > >>> > processes and 16K system wide.
> >>       > >>> >
> >>       > >>> > Does this seem like the right spot to be looking? What
> >>       are some
> >>       > >>> > configuration items we should be looking at?
> >>       > >>> >
> >>       > >>> > Thanks,
> >>       > >>> > ----------------
> >>       > >>> > Robert LeBlanc
> >>       > >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
> >>       FA62 B9F1
> >>       > >>> >
> >>       > >>> >
> >>       > >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc
> >>       <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>       > >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> >> Hash: SHA256
> >>       > >>> >>
> >>       > >>> >> We were able to only get ~17Gb out of the XL710
> >>       (heavily tweaked)
> >>       > >>> >> until we went to the 4.x kernel where we got ~36Gb (no
> >>       tweaking). It
> >>       > >>> >> seems that there were some major reworks in the network
> >>       handling in
> >>       > >>> >> the kernel to efficiently handle that network rate. If
> >>       I remember
> >>       > >>> >> right we also saw a drop in CPU utilization. I'm
> >>       starting to think
> >>       > >>> >> that we did see packet loss while congesting our ISLs
> >>       in our initial
> >>       > >>> >> testing, but we could not tell where the dropping was
> >>       happening. We
> >>       > >>> >> saw some on the switches, but it didn't seem to be bad
> >>       if we weren't
> >>       > >>> >> trying to congest things. We probably already saw this
> >>       issue, just
> >>       > >>> >> didn't know it.
> >>       > >>> >> - ----------------
> >>       > >>> >> Robert LeBlanc
> >>       > >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
> >>       3BB2 FA62 B9F1
> >>       > >>> >>
> >>       > >>> >>
> >>       > >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>       > >>> >>> FWIW, we've got some 40GbE Intel cards in the
> >>       community performance cluster
> >>       > >>> >>> on a Mellanox 40GbE switch that appear (knock on wood)
> >>       to be running fine
> >>       > >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback
> >>       from Intel that older
> >>       > >>> >>> drivers might cause problems though.
> >>       > >>> >>>
> >>       > >>> >>> Here's ifconfig from one of the nodes:
> >>       > >>> >>>
> >>       > >>> >>> ens513f1: flags=4163  mtu 1500
> >>       > >>> >>>         inet 10.0.10.101  netmask 255.255.255.0
> >>       broadcast 10.0.10.255
> >>       > >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64
> >>       scopeid 0x20
> >>       > >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000
> >>       (Ethernet)
> >>       > >>> >>>         RX packets 169232242875  bytes 229346261232279
> >>       (208.5 TiB)
> >>       > >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>       > >>> >>>         TX packets 153491686361  bytes 203976410836881
> >>       (185.5 TiB)
> >>       > >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0
> >>       collisions 0
> >>       > >>> >>>
> >>       > >>> >>> Mark
> >>       > >>> >>>
> >>       > >>> >>>
> >>       > >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>       > >>> >>>>
> >>       > >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> >>>> Hash: SHA256
> >>       > >>> >>>>
> >>       > >>> >>>> OK, here is the update on the saga...
> >>       > >>> >>>>
> >>       > >>> >>>> I traced some more of blocked I/Os and it seems that
> >>       communication
> >>       > >>> >>>> between two hosts seemed worse than others. I did a
> >>       two way ping flood
> >>       > >>> >>>> between the two hosts using max packet sizes (1500).
> >>       After 1.5M
> >>       > >>> >>>> packets, no lost pings. Then then had the ping flood
> >>       running while I
> >>       > >>> >>>> put Ceph load on the cluster and the dropped pings
> >>       started increasing
> >>       > >>> >>>> after stopping the Ceph workload the pings stopped
> >>       dropping.
> >>       > >>> >>>>
> >>       > >>> >>>> I then ran iperf between all the nodes with the same
> >>       results, so that
> >>       > >>> >>>> ruled out Ceph to a large degree. I then booted in
> >>       the the
> >>       > >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour
> >>       test so far there
> >>       > >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40
> >>       Gb NICs really
> >>       > >>> >>>> need the network enhancements in the 4.x series to
> >>       work well.
> >>       > >>> >>>>
> >>       > >>> >>>> Does this sound familiar to anyone? I'll probably
> >>       start bisecting the
> >>       > >>> >>>> kernel to see where this issue in introduced. Both of
> >>       the clusters
> >>       > >>> >>>> with this issue are running 4.x, other than that,
> >>       they are pretty
> >>       > >>> >>>> differing hardware and network configs.
> >>       > >>> >>>>
> >>       > >>> >>>> Thanks,
> >>       > >>> >>>> -----BEGIN PGP SIGNATURE-----
> >>       > >>> >>>> Version: Mailvelope v1.1.0
> >>       > >>> >>>> Comment: https://www.mailvelope.com
> >>       > >>> >>>>
> >>       > >>> >>>>
> >>       wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>       > >>> >>>>
> >>       RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>       > >>> >>>>
> >>       AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>       > >>> >>>>
> >>       7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>       > >>> >>>>
> >>       cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>       > >>> >>>>
> >>       F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>       > >>> >>>>
> >>       byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>       > >>> >>>>
> >>       /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>       > >>> >>>>
> >>       LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>       > >>> >>>>
> >>       UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>       > >>> >>>>
> >>       sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>       > >>> >>>>
> >>       KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>       > >>> >>>> 4OEo
> >>       > >>> >>>> =P33I
> >>       > >>> >>>> -----END PGP SIGNATURE-----
> >>       > >>> >>>> ----------------
> >>       > >>> >>>> Robert LeBlanc
> >>       > >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
> >>       3BB2 FA62 B9F1
> >>       > >>> >>>>
> >>       > >>> >>>>
> >>       > >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>       > >>> >>>> wrote:
> >>       > >>> >>>>>
> >>       > >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> >>>>> Hash: SHA256
> >>       > >>> >>>>>
> >>       > >>> >>>>> This is IPoIB and we have the MTU set to 64K. There
> >>       was some issues
> >>       > >>> >>>>> pinging hosts with "No buffer space available"
> >>       (hosts are currently
> >>       > >>> >>>>> configured for 4GB to test SSD caching rather than
> >>       page cache). I
> >>       > >>> >>>>> found that MTU under 32K worked reliable for ping,
> >>       but still had the
> >>       > >>> >>>>> blocked I/O.
> >>       > >>> >>>>>
> >>       > >>> >>>>> I reduced the MTU to 1500 and checked pings (OK),
> >>       but I'm still seeing
> >>       > >>> >>>>> the blocked I/O.
> >>       > >>> >>>>> - ----------------
> >>       > >>> >>>>> Robert LeBlanc
> >>       > >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
> >>       3BB2 FA62 B9F1
> >>       > >>> >>>>>
> >>       > >>> >>>>>
> >>       > >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>       > >>> >>>>>>
> >>       > >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>>> I looked at the logs, it looks like there was a 53
> >>       second delay
> >>       > >>> >>>>>>> between when osd.17 started sending the osd_repop
> >>       message and when
> >>       > >>> >>>>>>> osd.13 started reading it, which is pretty weird.
> >>       Sage, didn't we
> >>       > >>> >>>>>>> once see a kernel issue which caused some messages
> >>       to be mysteriously
> >>       > >>> >>>>>>> delayed for many 10s of seconds?
> >>       > >>> >>>>>>
> >>       > >>> >>>>>>
> >>       > >>> >>>>>> Every time we have seen this behavior and diagnosed
> >>       it in the wild it
> >>       > >>> >>>>>> has
> >>       > >>> >>>>>> been a network misconfiguration.  Usually related
> >>       to jumbo frames.
> >>       > >>> >>>>>>
> >>       > >>> >>>>>> sage
> >>       > >>> >>>>>>
> >>       > >>> >>>>>>
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>>> What kernel are you running?
> >>       > >>> >>>>>>> -Sam
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc
> >>       wrote:
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> >>>>>>>> Hash: SHA256
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> OK, looping in ceph-devel to see if I can get
> >>       some more eyes. I've
> >>       > >>> >>>>>>>> extracted what I think are important entries from
> >>       the logs for the
> >>       > >>> >>>>>>>> first blocked request. NTP is running all the
> >>       servers so the logs
> >>       > >>> >>>>>>>> should be close in terms of time. Logs for 12:50
> >>       to 13:00 are
> >>       > >>> >>>>>>>> available at
> >>       http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from
> >>       client
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O
> >>       to osd.13
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O
> >>       to osd.16
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from
> >>       osd.17
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk
> >>       result=0 from osd.16
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to
> >>       osd.17 ondisk result=0
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow
> >>       I/O > 30.439150 sec
> >>       > >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from
> >>       osd.17
> >>       > >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk
> >>       result=0 from osd.13
> >>       > >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to
> >>       osd.17 ondisk result=0
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> In the logs I can see that osd.17 dispatches the
> >>       I/O to osd.13 and
> >>       > >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get
> >>       the I/O right away,
> >>       > >>> >>>>>>>> but for some reason osd.13 doesn't get the
> >>       message until 53 seconds
> >>       > >>> >>>>>>>> later. osd.17 seems happy to just wait and
> >>       doesn't resend the data
> >>       > >>> >>>>>>>> (well, I'm not 100% sure how to tell which
> >>       entries are the actual data
> >>       > >>> >>>>>>>> transfer).
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> It looks like osd.17 is receiving responses to
> >>       start the communication
> >>       > >>> >>>>>>>> with osd.13, but the op is not acknowledged until
> >>       almost a minute
> >>       > >>> >>>>>>>> later. To me it seems that the message is getting
> >>       received but not
> >>       > >>> >>>>>>>> passed to another thread right away or something.
> >>       This test was done
> >>       > >>> >>>>>>>> with an idle cluster, a single fio client (rbd
> >>       engine) with a single
> >>       > >>> >>>>>>>> thread.
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> The OSD servers are almost 100% idle during these
> >>       blocked I/O
> >>       > >>> >>>>>>>> requests. I think I'm at the end of my
> >>       troubleshooting, so I can use
> >>       > >>> >>>>>>>> some help.
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> Single Test started about
> >>       > >>> >>>>>>>> 2015-09-22 12:52:36
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17
> >>       192.168.55.14:6800/16726 56 :
> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
> >>       oldest blocked for >
> >>       > >>> >>>>>>>> 30.439150 secs
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17
> >>       192.168.55.14:6800/16726 57 :
> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old,
> >>       received at
> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1388
> >>       rbd_data.3380e2ae8944a.0000000000000545
> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
> >>       4194304,write
> >>       > >>> >>>>>>>> 0~4194304] 8.bbf3e8ff
> >>       ack+ondisk+write+known_if_redirected e56785)
> >>       > >>> >>>>>>>>   currently waiting for subops from 13,16
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16
> >>       192.168.55.13:6800/29410 7 : cluster
> >>       > >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest
> >>       blocked for >
> >>       > >>> >>>>>>>> 30.379680 secs
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16
> >>       192.168.55.13:6800/29410 8 : cluster
> >>       > >>> >>>>>>>> [WRN] slow request 30.291520 seconds old,
> >>       received at 2015-09-22
> >>       > >>> >>>>>>>> 12:55:06.406303:
> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1384
> >>       rbd_data.3380e2ae8944a.0000000000000541
> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
> >>       4194304,write
> >>       > >>> >>>>>>>> 0~4194304] 8.5fb2123f
> >>       ack+ondisk+write+known_if_redirected e56785)
> >>       > >>> >>>>>>>>   currently waiting for subops from 13,17
> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16
> >>       192.168.55.13:6800/29410 9 : cluster
> >>       > >>> >>>>>>>> [WRN] slow request 30.379680 seconds old,
> >>       received at 2015-09-22
> >>       > >>> >>>>>>>> 12:55:06.318144:
> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1382
> >>       rbd_data.3380e2ae8944a.000000000000053f
> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
> >>       4194304,write
> >>       > >>> >>>>>>>> 0~4194304] 8.312e69ca
> >>       ack+ondisk+write+known_if_redirected e56785)
> >>       > >>> >>>>>>>>   currently waiting for subops from 13,14
> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13
> >>       192.168.55.12:6804/4574 130 :
> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
> >>       oldest blocked for >
> >>       > >>> >>>>>>>> 30.954212 secs
> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13
> >>       192.168.55.12:6804/4574 131 :
> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old,
> >>       received at
> >>       > >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1873
> >>       rbd_data.3380e2ae8944a.000000000000070d
> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
> >>       4194304,write
> >>       > >>> >>>>>>>> 0~4194304] 8.e69870d4
> >>       ack+ondisk+write+known_if_redirected e56785)
> >>       > >>> >>>>>>>>   currently waiting for subops from 16,17
> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16
> >>       192.168.55.13:6800/29410 10 :
> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
> >>       oldest blocked for >
> >>       > >>> >>>>>>>> 30.704367 secs
> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16
> >>       192.168.55.13:6800/29410 11 :
> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old,
> >>       received at
> >>       > >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1874
> >>       rbd_data.3380e2ae8944a.000000000000070e
> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
> >>       4194304,write
> >>       > >>> >>>>>>>> 0~4194304] 8.f7635819
> >>       ack+ondisk+write+known_if_redirected e56785)
> >>       > >>> >>>>>>>>   currently waiting for subops from 13,17
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> Server   IP addr              OSD
> >>       > >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>       > >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>       > >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>       > >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>       > >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>       > >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> fio job:
> >>       > >>> >>>>>>>> [rbd-test]
> >>       > >>> >>>>>>>> readwrite=write
> >>       > >>> >>>>>>>> blocksize=4M
> >>       > >>> >>>>>>>> #runtime=60
> >>       > >>> >>>>>>>> name=rbd-test
> >>       > >>> >>>>>>>> #readwrite=randwrite
> >>       > >>> >>>>>>>>
> >>       #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>       > >>> >>>>>>>> #rwmixread=72
> >>       > >>> >>>>>>>> #norandommap
> >>       > >>> >>>>>>>> #size=1T
> >>       > >>> >>>>>>>> #blocksize=4k
> >>       > >>> >>>>>>>> ioengine=rbd
> >>       > >>> >>>>>>>> rbdname=test2
> >>       > >>> >>>>>>>> pool=rbd
> >>       > >>> >>>>>>>> clientname=admin
> >>       > >>> >>>>>>>> iodepth=8
> >>       > >>> >>>>>>>> #numjobs=4
> >>       > >>> >>>>>>>> #thread
> >>       > >>> >>>>>>>> #group_reporting
> >>       > >>> >>>>>>>> #time_based
> >>       > >>> >>>>>>>> #direct=1
> >>       > >>> >>>>>>>> #ramp_time=60
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> Thanks,
> >>       > >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>       > >>> >>>>>>>> Version: Mailvelope v1.1.0
> >>       > >>> >>>>>>>> Comment: https://www.mailvelope.com
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>>
> >>       wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>       > >>> >>>>>>>>
> >>       tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>       > >>> >>>>>>>>
> >>       h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>       > >>> >>>>>>>>
> >>       903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>       > >>> >>>>>>>>
> >>       sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>       > >>> >>>>>>>>
> >>       FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>       > >>> >>>>>>>>
> >>       pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>       > >>> >>>>>>>>
> >>       5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>       > >>> >>>>>>>>
> >>       B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>       > >>> >>>>>>>>
> >>       4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>       > >>> >>>>>>>>
> >>       o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>       > >>> >>>>>>>>
> >>       gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>       > >>> >>>>>>>> J3hS
> >>       > >>> >>>>>>>> =0J7F
> >>       > >>> >>>>>>>> -----END PGP SIGNATURE-----
> >>       > >>> >>>>>>>> ----------------
> >>       > >>> >>>>>>>> Robert LeBlanc
> >>       > >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E
> >>       E654 3BB2 FA62 B9F1
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum
> >>       wrote:
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc
> >>       wrote:
> >>       > >>> >>>>>>>>>>
> >>       > >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>       > >>> >>>>>>>>>> Hash: SHA256
> >>       > >>> >>>>>>>>>>
> >>       > >>> >>>>>>>>>> Is there some way to tell in the logs that this
> >>       is happening?
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>> You can search for the (mangled) name
> >>       _split_collection
> >>       > >>> >>>>>>>>>>
> >>       > >>> >>>>>>>>>> I'm not
> >>       > >>> >>>>>>>>>> seeing much I/O, CPU usage during these times.
> >>       Is there some way to
> >>       > >>> >>>>>>>>>> prevent the splitting? Is there a negative side
> >>       effect to doing so?
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>> Bump up the split and merge thresholds. You can
> >>       search the list for
> >>       > >>> >>>>>>>>> this, it was discussed not too long ago.
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as
> >>       soon as the sessions
> >>       > >>> >>>>>>>>>> are aborted, they are reestablished and
> >>       complete immediately.
> >>       > >>> >>>>>>>>>>
> >>       > >>> >>>>>>>>>> The fio test is just a seq write, starting it
> >>       over (rewriting from
> >>       > >>> >>>>>>>>>> the
> >>       > >>> >>>>>>>>>> beginning) is still causing the issue. I was
> >>       suspect that it is not
> >>       > >>> >>>>>>>>>> having to create new file and therefore split
> >>       collections. This is
> >>       > >>> >>>>>>>>>> on
> >>       > >>> >>>>>>>>>> my test cluster with no other load.
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>> Hmm, that does make it seem less likely if
> >>       you're really not creating
> >>       > >>> >>>>>>>>> new objects, if you're actually running fio in
> >>       such a way that it's
> >>       > >>> >>>>>>>>> not allocating new FS blocks (this is probably
> >>       hard to set up?).
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>>
> >>       > >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log
> >>       options and depths
> >>       > >>> >>>>>>>>>> would be the most helpful for tracking this
> >>       issue down?
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>>
> >>       > >>> >>>>>>>>> If you want to go log diving "debug osd = 20",
> >>       "debug filestore =
> >>       > >>> >>>>>>>>> 20",
> >>       > >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to
> >>       see. That should spit
> >>       > >>> >>>>>>>>> out
> >>       > >>> >>>>>>>>> everything you need to track exactly what each
> >>       Op is doing.
> >>       > >>> >>>>>>>>> -Greg
> >>       > >>> >>>>>>>>
> >>       > >>> >>>>>>>> --
> >>       > >>> >>>>>>>> To unsubscribe from this list: send the line
> >>       "unsubscribe ceph-devel"
> >>       > >>> >>>>>>>> in
> >>       > >>> >>>>>>>> the body of a message to
> >>       majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>       > >>> >>>>>>>> More majordomo info at
> >>       http://vger.kernel.org/majordomo-info.html
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>>>
> >>       > >>> >>>>>
> >>       > >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >>       > >>> >>>>> Version: Mailvelope v1.1.0
> >>       > >>> >>>>> Comment: https://www.mailvelope.com
> >>       > >>> >>>>>
> >>       > >>> >>>>>
> >>       wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>       > >>> >>>>>
> >>       a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>       > >>> >>>>>
> >>       a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>       > >>> >>>>>
> >>       s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>       > >>> >>>>>
> >>       iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>       > >>> >>>>>
> >>       izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>       > >>> >>>>>
> >>       caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>       > >>> >>>>>
> >>       efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>       > >>> >>>>>
> >>       GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>       > >>> >>>>>
> >>       glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>       > >>> >>>>>
> >>       +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>       > >>> >>>>>
> >>       pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>       > >>> >>>>> gcZm
> >>       > >>> >>>>> =CjwB
> >>       > >>> >>>>> -----END PGP SIGNATURE-----
> >>       > >>> >>>>
> >>       > >>> >>>> --
> >>       > >>> >>>> To unsubscribe from this list: send the line
> >>       "unsubscribe ceph-devel" in
> >>       > >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>       > >>> >>>> More majordomo info at
> >>       http://vger.kernel.org/majordomo-info.html
> >>       > >>> >>>>
> >>       > >>> >>>
> >>       > >>> >>
> >>       > >>> >> -----BEGIN PGP SIGNATURE-----
> >>       > >>> >> Version: Mailvelope v1.1.0
> >>       > >>> >> Comment: https://www.mailvelope.com
> >>       > >>> >>
> >>       > >>> >>
> >>       wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >>       > >>> >>
> >>       S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >>       > >>> >>
> >>       lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >>       > >>> >>
> >>       0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >>       > >>> >>
> >>       JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >>       > >>> >>
> >>       dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >>       > >>> >>
> >>       nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >>       > >>> >>
> >>       krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >>       > >>> >>
> >>       FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >>       > >>> >>
> >>       tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >>       > >>> >>
> >>       hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >>       > >>> >>
> >>       BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >>       > >>> >> ae22
> >>       > >>> >> =AX+L
> >>       > >>> >> -----END PGP SIGNATURE-----
> >>       > >>> _______________________________________________
> >>       > >>> ceph-users mailing list
> >>       > >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>       > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>       > >>>
> >>       > >>>
> >>       > _______________________________________________
> >>       > ceph-users mailing list
> >>       > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>       >
> >>       >
> >>
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 14:38                                                                           ` Sage Weil
  2015-10-06 15:51                                                                             ` [ceph-users] " Robert LeBlanc
@ 2015-10-06 16:26                                                                             ` Ken Dreyer
       [not found]                                                                               ` <CALqRxCw=tV5h-xfaxsCwwqhK=zLiP=m_iGTn+dF0Op=Uth_vPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Ken Dreyer @ 2015-10-06 16:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Robert LeBlanc, ceph-devel, ceph-users

On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil <sweil@redhat.com> wrote:
> Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> first.  They won't be allowed to boot until that happens... all upgrades
> must stop at 0.94.4 first.

This sounds pretty crucial. is there Redmine ticket(s)?

- Ken

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                               ` <CALqRxCw=tV5h-xfaxsCwwqhK=zLiP=m_iGTn+dF0Op=Uth_vPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 16:40                                                                                 ` Sage Weil
  0 siblings, 0 replies; 45+ messages in thread
From: Sage Weil @ 2015-10-06 16:40 UTC (permalink / raw)
  To: Ken Dreyer; +Cc: ceph-devel, ceph-users-idqoXFIVOFJgJs9I8MT0rw

On Tue, 6 Oct 2015, Ken Dreyer wrote:
> On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> > first.  They won't be allowed to boot until that happens... all upgrades
> > must stop at 0.94.4 first.
> 
> This sounds pretty crucial. is there Redmine ticket(s)?

It's documented in the (draft) release notes, which discuss how to 
upgrade.  There's no bug...

sage

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 16:19                                                                                 ` Sage Weil
@ 2015-10-06 17:47                                                                                   ` Robert LeBlanc
  0 siblings, 0 replies; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 17:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

This was from the monitor (can't bring it up with Hammer now, complete
cluster is down, this is only my lab, so no urgency).

I got it up and running this way:
1. Upgrade the mon node to Infernalis and started the mon.
2. Downgraded the OSDs to to-be-0.94.4 and started them up.
3. Upgraded the OSD node to Infernalis
4. Stopped the OSD processes
5. Chowned the files that were updated by downgrading (find
/var/lib/ceph -user root -exec chown ceph. {} +)
6. Ran ceph-disk activate

The OSDs then came up and into the cluster.

I had tried ceph-disk activate on the nodes while downgraded to 0.94.4
and the monitor was down. It took a while to timeout searching for the
monitor, based on the last OSDs I started it seemed to do enough to
allow the OSD to join an Infernalis monitor (My monitor is running on
an OSD node, I tried starting these OSDs just in case I could skip
zapping the disk and having to backfill it. It worked just fine).

Hopefully this helps someone who runs into this after the fact.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFAkLCRDmVDuy+mK58QAAp3AP/RptIt2yPrmL1EXzvl4V
N3Q69NE17ac9xb5ruxN/LqNMyZAE85UKhzkTFi2NdSMJzYygL3hpgLBqEOpF
5VhaKoaW/H/gfrXVTGt5reFySMvDZEA/9hqF9KLQggemRRebAv6DIHb8wLTO
OLHF/XSsi+JALlIx2a04OSFZQ2M9rPTmOGneZ63T0YoPK5XQVJgQT9D4h60+
IeSn9Drh+HPJQag1E6cuh9ixOofJAP9grAnGBqy4XWznMFMDYxaKYovS5Nkg
yt1ukH6R23dYNnIklVnpK3MmnU6JSnWyCraiolVb/Ddjd6D/wart95aClwHo
EmvirdctCk3mbfG/2MjcUO8UII9Dk0xs7ck/nqyDBatRcOCGOdn1SVCOT+0Q
N3dDFeEY7FoLZf0g9YYmtTnYtE5TQ0fJGOAwvJQeupJESMlXohXAQHgxKg0H
ksjmLrY1OTFdFMeS5P3sHHzN6qNGDKJyG6aURB6rN2xexTITQjl9ZZ/o9bfZ
vEldAdXp4B1TBlPqkVfHGkTrMezEOwOi0kGAChkflFu2nQB6LvKDKLiHjpKv
87MCaB97FvoUdZpGnbcuft6NU4lWH/ynVLLY8fOKb2/x1GEKYNz3cGKx3M0u
S5wRtYvOOZEwb/B2bhvCSNtb2FgqpS4INfgKn+334ibd2X1o42oj52SA1lz/
baNh
=LmYc
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 10:19 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> I downgraded to the hammer gitbuilder branch, but it looks like I've
>> passed the point of no return:
>>
>> 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
>> includes unsupported features:
>> compat={},rocompat={},incompat={7=support shec erasure code}
>> 2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1 error checking features:
>> (1) Operation not permitted
>
> In that case, mark all osds down, upgrade again, and they'll be
> allowed to start.  The restriction is that each osd can't go backwards,
> and post-hammer osds can't talk to pre-hammer osds.
>
> sage
>
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil <sweil@redhat.com> wrote:
>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >> Thanks for your time Sage. It sounds like a few people may be helped if you
>> >> can find something.
>> >>
>> >> I did a recursive chown as in the instructions (although I didn't know about
>> >> the doc at the time). I did an osd debug at 20/20 but didn't see anything.
>> >> I'll also do ms and make the logs available. I'll also review the document
>> >> to make sure I didn't miss anything else.
>> >
>> > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
>> > first.  They won't be allowed to boot until that happens... all upgrades
>> > must stop at 0.94.4 first.  And that isn't released yet.. we'll try to
>> > do that today.  In the meantime, you can use the hammer gitbuilder
>> > build...
>> >
>> > sage
>> >
>> >
>> >>
>> >> Robert LeBlanc
>> >>
>> >> Sent from a mobile device please excuse any typos.
>> >>
>> >> On Oct 6, 2015 6:37 AM, "Sage Weil" <sweil@redhat.com> wrote:
>> >>       On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >>       > -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > Hash: SHA256
>> >>       >
>> >>       > With some off-list help, we have adjusted
>> >>       > osd_client_message_cap=10000. This seems to have helped a bit
>> >>       and we
>> >>       > have seen some OSDs have a value up to 4,000 for client
>> >>       messages. But
>> >>       > it does not solve the problem with the blocked I/O.
>> >>       >
>> >>       > One thing that I have noticed is that almost exactly 30
>> >>       seconds elapse
>> >>       > between an OSD boots and the first blocked I/O message. I
>> >>       don't know
>> >>       > if the OSD doesn't have time to get it's brain right about a
>> >>       PG before
>> >>       > it starts servicing it or what exactly.
>> >>
>> >>       I'm downloading the logs from yesterday now; sorry it's taking
>> >>       so long.
>> >>
>> >>       > On another note, I tried upgrading our CentOS dev cluster from
>> >>       Hammer
>> >>       > to master and things didn't go so well. The OSDs would not
>> >>       start
>> >>       > because /var/lib/ceph was not owned by ceph. I chowned the
>> >>       directory
>> >>       > and all OSDs and the OSD then started, but never became active
>> >>       in the
>> >>       > cluster. It just sat there after reading all the PGs. There
>> >>       were
>> >>       > sockets open to the monitor, but no OSD to OSD sockets. I
>> >>       tried
>> >>       > downgrading to the Infernalis branch and still no luck getting
>> >>       the
>> >>       > OSDs to come up. The OSD processes were idle after the initial
>> >>       boot.
>> >>       > All packages were installed from gitbuilder.
>> >>
>> >>       Did you chown -R ?
>> >>
>> >>              https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgradin
>> >>       g-from-hammer
>> >>
>> >>       My guess is you only chowned the root dir, and the OSD didn't
>> >>       throw
>> >>       an error when it encountered the other files?  If you can
>> >>       generate a debug
>> >>       osd = 20 log, that would be helpful.. thanks!
>> >>
>> >>       sage
>> >>
>> >>
>> >>       >
>> >>       > Thanks,
>> >>       > -----BEGIN PGP SIGNATURE-----
>> >>       > Version: Mailvelope v1.2.0
>> >>       > Comment: https://www.mailvelope.com
>> >>       >
>> >>       > wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> >>       > YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> >>       > 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> >>       > aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> >>       > y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> >>       > 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> >>       > ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> >>       > zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> >>       > D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> >>       > CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> >>       > 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> >>       > fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> >>       > GdXC
>> >>       > =Aigq
>> >>       > -----END PGP SIGNATURE-----
>> >>       > ----------------
>> >>       > Robert LeBlanc
>> >>       > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
>> >>       B9F1
>> >>       >
>> >>       >
>> >>       > On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc
>> >>       <robert@leblancnet.us> wrote:
>> >>       > > -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > > Hash: SHA256
>> >>       > >
>> >>       > > I have eight nodes running the fio job rbd_test_real to
>> >>       different RBD
>> >>       > > volumes. I've included the CRUSH map in the tarball.
>> >>       > >
>> >>       > > I stopped one OSD process and marked it out. I let it
>> >>       recover for a
>> >>       > > few minutes and then I started the process again and marked
>> >>       it in. I
>> >>       > > started getting block I/O messages during the recovery.
>> >>       > >
>> >>       > > The logs are located at
>> >>       http://162.144.87.113/files/ushou1.tar.xz
>> >>       > >
>> >>       > > Thanks,
>> >>       > > -----BEGIN PGP SIGNATURE-----
>> >>       > > Version: Mailvelope v1.2.0
>> >>       > > Comment: https://www.mailvelope.com
>> >>       > >
>> >>       > > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> >>       > > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> >>       > > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> >>       > > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> >>       > > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> >>       > > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> >>       > > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> >>       > > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> >>       > > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> >>       > > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> >>       > > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> >>       > > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> >>       > > 3EPx
>> >>       > > =UDIV
>> >>       > > -----END PGP SIGNATURE-----
>> >>       > >
>> >>       > > ----------------
>> >>       > > Robert LeBlanc
>> >>       > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>> >>       FA62 B9F1
>> >>       > >
>> >>       > >
>> >>       > > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil@redhat.com>
>> >>       wrote:
>> >>       > >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >>       > >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> Hash: SHA256
>> >>       > >>>
>> >>       > >>> We are still struggling with this and have tried a lot of
>> >>       different
>> >>       > >>> things. Unfortunately, Inktank (now Red Hat) no longer
>> >>       provides
>> >>       > >>> consulting services for non-Red Hat systems. If there are
>> >>       some
>> >>       > >>> certified Ceph consultants in the US that we can do both
>> >>       remote and
>> >>       > >>> on-site engagements, please let us know.
>> >>       > >>>
>> >>       > >>> This certainly seems to be network related, but somewhere
>> >>       in the
>> >>       > >>> kernel. We have tried increasing the network and TCP
>> >>       buffers, number
>> >>       > >>> of TCP sockets, reduced the FIN_WAIT2 state. There is
>> >>       about 25% idle
>> >>       > >>> on the boxes, the disks are busy, but not constantly at
>> >>       100% (they
>> >>       > >>> cycle from <10% up to 100%, but not 100% for more than a
>> >>       few seconds
>> >>       > >>> at a time). There seems to be no reasonable explanation
>> >>       why I/O is
>> >>       > >>> blocked pretty frequently longer than 30 seconds. We have
>> >>       verified
>> >>       > >>> Jumbo frames by pinging from/to each node with 9000 byte
>> >>       packets. The
>> >>       > >>> network admins have verified that packets are not being
>> >>       dropped in the
>> >>       > >>> switches for these nodes. We have tried different kernels
>> >>       including
>> >>       > >>> the recent Google patch to cubic. This is showing up on
>> >>       three cluster
>> >>       > >>> (two Ethernet and one IPoIB). I booted one cluster into
>> >>       Debian Jessie
>> >>       > >>> (from CentOS 7.1) with similar results.
>> >>       > >>>
>> >>       > >>> The messages seem slightly different:
>> >>       > >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425
>> >>       439 :
>> >>       > >>> cluster [WRN] 14 slow requests, 1 included below; oldest
>> >>       blocked for >
>> >>       > >>> 100.087155 secs
>> >>       > >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425
>> >>       440 :
>> >>       > >>> cluster [WRN] slow request 30.041999 seconds old, received
>> >>       at
>> >>       > >>> 2015-10-03 14:37:53.151014:
>> >>       osd_op(client.1328605.0:7082862
>> >>       > >>> rbd_data.13fdcb2ae8944a.000000000001264f [read
>> >>       975360~4096]
>> >>       > >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently
>> >>       no flag
>> >>       > >>> points reached
>> >>       > >>>
>> >>       > >>> I don't know what "no flag points reached" means.
>> >>       > >>
>> >>       > >> Just that the op hasn't been marked as reaching any
>> >>       interesting points
>> >>       > >> (op->mark_*() calls).
>> >>       > >>
>> >>       > >> Is it possible to gather a lot with debug ms = 20 and debug
>> >>       osd = 20?
>> >>       > >> It's extremely verbose but it'll let us see where the op is
>> >>       getting
>> >>       > >> blocked.  If you see the "slow request" message it means
>> >>       the op in
>> >>       > >> received by ceph (that's when the clock starts), so I
>> >>       suspect it's not
>> >>       > >> something we can blame on the network stack.
>> >>       > >>
>> >>       > >> sage
>> >>       > >>
>> >>       > >>
>> >>       > >>>
>> >>       > >>> The problem is most pronounced when we have to reboot an
>> >>       OSD node (1
>> >>       > >>> of 13), we will have hundreds of I/O blocked for some
>> >>       times up to 300
>> >>       > >>> seconds. It takes a good 15 minutes for things to settle
>> >>       down. The
>> >>       > >>> production cluster is very busy doing normally 8,000 I/O
>> >>       and peaking
>> >>       > >>> at 15,000. This is all 4TB spindles with SSD journals and
>> >>       the disks
>> >>       > >>> are between 25-50% full. We are currently splitting PGs to
>> >>       distribute
>> >>       > >>> the load better across the disks, but we are having to do
>> >>       this 10 PGs
>> >>       > >>> at a time as we get blocked I/O. We have max_backfills and
>> >>       > >>> max_recovery set to 1, client op priority is set higher
>> >>       than recovery
>> >>       > >>> priority. We tried increasing the number of op threads but
>> >>       this didn't
>> >>       > >>> seem to help. It seems as soon as PGs are finished being
>> >>       checked, they
>> >>       > >>> become active and could be the cause for slow I/O while
>> >>       the other PGs
>> >>       > >>> are being checked.
>> >>       > >>>
>> >>       > >>> What I don't understand is that the messages are delayed.
>> >>       As soon as
>> >>       > >>> the message is received by Ceph OSD process, it is very
>> >>       quickly
>> >>       > >>> committed to the journal and a response is sent back to
>> >>       the primary
>> >>       > >>> OSD which is received very quickly as well. I've adjust
>> >>       > >>> min_free_kbytes and it seems to keep the OSDs from
>> >>       crashing, but
>> >>       > >>> doesn't solve the main problem. We don't have swap and
>> >>       there is 64 GB
>> >>       > >>> of RAM per nodes for 10 OSDs.
>> >>       > >>>
>> >>       > >>> Is there something that could cause the kernel to get a
>> >>       packet but not
>> >>       > >>> be able to dispatch it to Ceph such that it could be
>> >>       explaining why we
>> >>       > >>> are seeing these blocked I/O for 30+ seconds. Is there
>> >>       some pointers
>> >>       > >>> to tracing Ceph messages from the network buffer through
>> >>       the kernel to
>> >>       > >>> the Ceph process?
>> >>       > >>>
>> >>       > >>> We can really use some pointers no matter how outrageous.
>> >>       We've have
>> >>       > >>> over 6 people looking into this for weeks now and just
>> >>       can't think of
>> >>       > >>> anything else.
>> >>       > >>>
>> >>       > >>> Thanks,
>> >>       > >>> -----BEGIN PGP SIGNATURE-----
>> >>       > >>> Version: Mailvelope v1.1.0
>> >>       > >>> Comment: https://www.mailvelope.com
>> >>       > >>>
>> >>       > >>>
>> >>       wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >>       > >>>
>> >>       NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >>       > >>>
>> >>       prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >>       > >>>
>> >>       K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >>       > >>>
>> >>       h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >>       > >>>
>> >>       iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >>       > >>>
>> >>       Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >>       > >>>
>> >>       Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >>       > >>>
>> >>       JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >>       > >>>
>> >>       8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >>       > >>>
>> >>       lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >>       > >>>
>> >>       4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >>       > >>> l7OF
>> >>       > >>> =OI++
>> >>       > >>> -----END PGP SIGNATURE-----
>> >>       > >>> ----------------
>> >>       > >>> Robert LeBlanc
>> >>       > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>> >>       FA62 B9F1
>> >>       > >>>
>> >>       > >>>
>> >>       > >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc
>> >>       <robert@leblancnet.us> wrote:
>> >>       > >>> > We dropped the replication on our cluster from 4 to 3
>> >>       and it looks
>> >>       > >>> > like all the blocked I/O has stopped (no entries in the
>> >>       log for the
>> >>       > >>> > last 12 hours). This makes me believe that there is some
>> >>       issue with
>> >>       > >>> > the number of sockets or some other TCP issue. We have
>> >>       not messed with
>> >>       > >>> > Ephemeral ports and TIME_WAIT at this point. There are
>> >>       130 OSDs, 8 KVM
>> >>       > >>> > hosts hosting about 150 VMs. Open files is set at 32K
>> >>       for the OSD
>> >>       > >>> > processes and 16K system wide.
>> >>       > >>> >
>> >>       > >>> > Does this seem like the right spot to be looking? What
>> >>       are some
>> >>       > >>> > configuration items we should be looking at?
>> >>       > >>> >
>> >>       > >>> > Thanks,
>> >>       > >>> > ----------------
>> >>       > >>> > Robert LeBlanc
>> >>       > >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2
>> >>       FA62 B9F1
>> >>       > >>> >
>> >>       > >>> >
>> >>       > >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc
>> >>       <robert@leblancnet.us> wrote:
>> >>       > >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> >> Hash: SHA256
>> >>       > >>> >>
>> >>       > >>> >> We were able to only get ~17Gb out of the XL710
>> >>       (heavily tweaked)
>> >>       > >>> >> until we went to the 4.x kernel where we got ~36Gb (no
>> >>       tweaking). It
>> >>       > >>> >> seems that there were some major reworks in the network
>> >>       handling in
>> >>       > >>> >> the kernel to efficiently handle that network rate. If
>> >>       I remember
>> >>       > >>> >> right we also saw a drop in CPU utilization. I'm
>> >>       starting to think
>> >>       > >>> >> that we did see packet loss while congesting our ISLs
>> >>       in our initial
>> >>       > >>> >> testing, but we could not tell where the dropping was
>> >>       happening. We
>> >>       > >>> >> saw some on the switches, but it didn't seem to be bad
>> >>       if we weren't
>> >>       > >>> >> trying to congest things. We probably already saw this
>> >>       issue, just
>> >>       > >>> >> didn't know it.
>> >>       > >>> >> - ----------------
>> >>       > >>> >> Robert LeBlanc
>> >>       > >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>> >>       3BB2 FA62 B9F1
>> >>       > >>> >>
>> >>       > >>> >>
>> >>       > >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>       > >>> >>> FWIW, we've got some 40GbE Intel cards in the
>> >>       community performance cluster
>> >>       > >>> >>> on a Mellanox 40GbE switch that appear (knock on wood)
>> >>       to be running fine
>> >>       > >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback
>> >>       from Intel that older
>> >>       > >>> >>> drivers might cause problems though.
>> >>       > >>> >>>
>> >>       > >>> >>> Here's ifconfig from one of the nodes:
>> >>       > >>> >>>
>> >>       > >>> >>> ens513f1: flags=4163  mtu 1500
>> >>       > >>> >>>         inet 10.0.10.101  netmask 255.255.255.0
>> >>       broadcast 10.0.10.255
>> >>       > >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64
>> >>       scopeid 0x20
>> >>       > >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000
>> >>       (Ethernet)
>> >>       > >>> >>>         RX packets 169232242875  bytes 229346261232279
>> >>       (208.5 TiB)
>> >>       > >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>       > >>> >>>         TX packets 153491686361  bytes 203976410836881
>> >>       (185.5 TiB)
>> >>       > >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0
>> >>       collisions 0
>> >>       > >>> >>>
>> >>       > >>> >>> Mark
>> >>       > >>> >>>
>> >>       > >>> >>>
>> >>       > >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>       > >>> >>>>
>> >>       > >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> >>>> Hash: SHA256
>> >>       > >>> >>>>
>> >>       > >>> >>>> OK, here is the update on the saga...
>> >>       > >>> >>>>
>> >>       > >>> >>>> I traced some more of blocked I/Os and it seems that
>> >>       communication
>> >>       > >>> >>>> between two hosts seemed worse than others. I did a
>> >>       two way ping flood
>> >>       > >>> >>>> between the two hosts using max packet sizes (1500).
>> >>       After 1.5M
>> >>       > >>> >>>> packets, no lost pings. Then then had the ping flood
>> >>       running while I
>> >>       > >>> >>>> put Ceph load on the cluster and the dropped pings
>> >>       started increasing
>> >>       > >>> >>>> after stopping the Ceph workload the pings stopped
>> >>       dropping.
>> >>       > >>> >>>>
>> >>       > >>> >>>> I then ran iperf between all the nodes with the same
>> >>       results, so that
>> >>       > >>> >>>> ruled out Ceph to a large degree. I then booted in
>> >>       the the
>> >>       > >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour
>> >>       test so far there
>> >>       > >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40
>> >>       Gb NICs really
>> >>       > >>> >>>> need the network enhancements in the 4.x series to
>> >>       work well.
>> >>       > >>> >>>>
>> >>       > >>> >>>> Does this sound familiar to anyone? I'll probably
>> >>       start bisecting the
>> >>       > >>> >>>> kernel to see where this issue in introduced. Both of
>> >>       the clusters
>> >>       > >>> >>>> with this issue are running 4.x, other than that,
>> >>       they are pretty
>> >>       > >>> >>>> differing hardware and network configs.
>> >>       > >>> >>>>
>> >>       > >>> >>>> Thanks,
>> >>       > >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>       > >>> >>>> Version: Mailvelope v1.1.0
>> >>       > >>> >>>> Comment: https://www.mailvelope.com
>> >>       > >>> >>>>
>> >>       > >>> >>>>
>> >>       wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>       > >>> >>>>
>> >>       RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>       > >>> >>>>
>> >>       AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>       > >>> >>>>
>> >>       7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>       > >>> >>>>
>> >>       cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>       > >>> >>>>
>> >>       F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>       > >>> >>>>
>> >>       byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>       > >>> >>>>
>> >>       /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>       > >>> >>>>
>> >>       LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>       > >>> >>>>
>> >>       UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>       > >>> >>>>
>> >>       sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>       > >>> >>>>
>> >>       KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>       > >>> >>>> 4OEo
>> >>       > >>> >>>> =P33I
>> >>       > >>> >>>> -----END PGP SIGNATURE-----
>> >>       > >>> >>>> ----------------
>> >>       > >>> >>>> Robert LeBlanc
>> >>       > >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>> >>       3BB2 FA62 B9F1
>> >>       > >>> >>>>
>> >>       > >>> >>>>
>> >>       > >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>       > >>> >>>> wrote:
>> >>       > >>> >>>>>
>> >>       > >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> >>>>> Hash: SHA256
>> >>       > >>> >>>>>
>> >>       > >>> >>>>> This is IPoIB and we have the MTU set to 64K. There
>> >>       was some issues
>> >>       > >>> >>>>> pinging hosts with "No buffer space available"
>> >>       (hosts are currently
>> >>       > >>> >>>>> configured for 4GB to test SSD caching rather than
>> >>       page cache). I
>> >>       > >>> >>>>> found that MTU under 32K worked reliable for ping,
>> >>       but still had the
>> >>       > >>> >>>>> blocked I/O.
>> >>       > >>> >>>>>
>> >>       > >>> >>>>> I reduced the MTU to 1500 and checked pings (OK),
>> >>       but I'm still seeing
>> >>       > >>> >>>>> the blocked I/O.
>> >>       > >>> >>>>> - ----------------
>> >>       > >>> >>>>> Robert LeBlanc
>> >>       > >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654
>> >>       3BB2 FA62 B9F1
>> >>       > >>> >>>>>
>> >>       > >>> >>>>>
>> >>       > >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>>> I looked at the logs, it looks like there was a 53
>> >>       second delay
>> >>       > >>> >>>>>>> between when osd.17 started sending the osd_repop
>> >>       message and when
>> >>       > >>> >>>>>>> osd.13 started reading it, which is pretty weird.
>> >>       Sage, didn't we
>> >>       > >>> >>>>>>> once see a kernel issue which caused some messages
>> >>       to be mysteriously
>> >>       > >>> >>>>>>> delayed for many 10s of seconds?
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>> Every time we have seen this behavior and diagnosed
>> >>       it in the wild it
>> >>       > >>> >>>>>> has
>> >>       > >>> >>>>>> been a network misconfiguration.  Usually related
>> >>       to jumbo frames.
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>> sage
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>>
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>>> What kernel are you running?
>> >>       > >>> >>>>>>> -Sam
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc
>> >>       wrote:
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> >>>>>>>> Hash: SHA256
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> OK, looping in ceph-devel to see if I can get
>> >>       some more eyes. I've
>> >>       > >>> >>>>>>>> extracted what I think are important entries from
>> >>       the logs for the
>> >>       > >>> >>>>>>>> first blocked request. NTP is running all the
>> >>       servers so the logs
>> >>       > >>> >>>>>>>> should be close in terms of time. Logs for 12:50
>> >>       to 13:00 are
>> >>       > >>> >>>>>>>> available at
>> >>       http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from
>> >>       client
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O
>> >>       to osd.13
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O
>> >>       to osd.16
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from
>> >>       osd.17
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk
>> >>       result=0 from osd.16
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to
>> >>       osd.17 ondisk result=0
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow
>> >>       I/O > 30.439150 sec
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from
>> >>       osd.17
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk
>> >>       result=0 from osd.13
>> >>       > >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to
>> >>       osd.17 ondisk result=0
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> In the logs I can see that osd.17 dispatches the
>> >>       I/O to osd.13 and
>> >>       > >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get
>> >>       the I/O right away,
>> >>       > >>> >>>>>>>> but for some reason osd.13 doesn't get the
>> >>       message until 53 seconds
>> >>       > >>> >>>>>>>> later. osd.17 seems happy to just wait and
>> >>       doesn't resend the data
>> >>       > >>> >>>>>>>> (well, I'm not 100% sure how to tell which
>> >>       entries are the actual data
>> >>       > >>> >>>>>>>> transfer).
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> It looks like osd.17 is receiving responses to
>> >>       start the communication
>> >>       > >>> >>>>>>>> with osd.13, but the op is not acknowledged until
>> >>       almost a minute
>> >>       > >>> >>>>>>>> later. To me it seems that the message is getting
>> >>       received but not
>> >>       > >>> >>>>>>>> passed to another thread right away or something.
>> >>       This test was done
>> >>       > >>> >>>>>>>> with an idle cluster, a single fio client (rbd
>> >>       engine) with a single
>> >>       > >>> >>>>>>>> thread.
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> The OSD servers are almost 100% idle during these
>> >>       blocked I/O
>> >>       > >>> >>>>>>>> requests. I think I'm at the end of my
>> >>       troubleshooting, so I can use
>> >>       > >>> >>>>>>>> some help.
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> Single Test started about
>> >>       > >>> >>>>>>>> 2015-09-22 12:52:36
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17
>> >>       192.168.55.14:6800/16726 56 :
>> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>> >>       oldest blocked for >
>> >>       > >>> >>>>>>>> 30.439150 secs
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17
>> >>       192.168.55.14:6800/16726 57 :
>> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old,
>> >>       received at
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1388
>> >>       rbd_data.3380e2ae8944a.0000000000000545
>> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>> >>       4194304,write
>> >>       > >>> >>>>>>>> 0~4194304] 8.bbf3e8ff
>> >>       ack+ondisk+write+known_if_redirected e56785)
>> >>       > >>> >>>>>>>>   currently waiting for subops from 13,16
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16
>> >>       192.168.55.13:6800/29410 7 : cluster
>> >>       > >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest
>> >>       blocked for >
>> >>       > >>> >>>>>>>> 30.379680 secs
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16
>> >>       192.168.55.13:6800/29410 8 : cluster
>> >>       > >>> >>>>>>>> [WRN] slow request 30.291520 seconds old,
>> >>       received at 2015-09-22
>> >>       > >>> >>>>>>>> 12:55:06.406303:
>> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1384
>> >>       rbd_data.3380e2ae8944a.0000000000000541
>> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>> >>       4194304,write
>> >>       > >>> >>>>>>>> 0~4194304] 8.5fb2123f
>> >>       ack+ondisk+write+known_if_redirected e56785)
>> >>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>       > >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16
>> >>       192.168.55.13:6800/29410 9 : cluster
>> >>       > >>> >>>>>>>> [WRN] slow request 30.379680 seconds old,
>> >>       received at 2015-09-22
>> >>       > >>> >>>>>>>> 12:55:06.318144:
>> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1382
>> >>       rbd_data.3380e2ae8944a.000000000000053f
>> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>> >>       4194304,write
>> >>       > >>> >>>>>>>> 0~4194304] 8.312e69ca
>> >>       ack+ondisk+write+known_if_redirected e56785)
>> >>       > >>> >>>>>>>>   currently waiting for subops from 13,14
>> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13
>> >>       192.168.55.12:6804/4574 130 :
>> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>> >>       oldest blocked for >
>> >>       > >>> >>>>>>>> 30.954212 secs
>> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13
>> >>       192.168.55.12:6804/4574 131 :
>> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old,
>> >>       received at
>> >>       > >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1873
>> >>       rbd_data.3380e2ae8944a.000000000000070d
>> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>> >>       4194304,write
>> >>       > >>> >>>>>>>> 0~4194304] 8.e69870d4
>> >>       ack+ondisk+write+known_if_redirected e56785)
>> >>       > >>> >>>>>>>>   currently waiting for subops from 16,17
>> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16
>> >>       192.168.55.13:6800/29410 10 :
>> >>       > >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below;
>> >>       oldest blocked for >
>> >>       > >>> >>>>>>>> 30.704367 secs
>> >>       > >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16
>> >>       192.168.55.13:6800/29410 11 :
>> >>       > >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old,
>> >>       received at
>> >>       > >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>       > >>> >>>>>>>>   osd_op(client.250874.0:1874
>> >>       rbd_data.3380e2ae8944a.000000000000070e
>> >>       > >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size
>> >>       4194304,write
>> >>       > >>> >>>>>>>> 0~4194304] 8.f7635819
>> >>       ack+ondisk+write+known_if_redirected e56785)
>> >>       > >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> Server   IP addr              OSD
>> >>       > >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>       > >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>       > >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>       > >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>       > >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>       > >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> fio job:
>> >>       > >>> >>>>>>>> [rbd-test]
>> >>       > >>> >>>>>>>> readwrite=write
>> >>       > >>> >>>>>>>> blocksize=4M
>> >>       > >>> >>>>>>>> #runtime=60
>> >>       > >>> >>>>>>>> name=rbd-test
>> >>       > >>> >>>>>>>> #readwrite=randwrite
>> >>       > >>> >>>>>>>>
>> >>       #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>       > >>> >>>>>>>> #rwmixread=72
>> >>       > >>> >>>>>>>> #norandommap
>> >>       > >>> >>>>>>>> #size=1T
>> >>       > >>> >>>>>>>> #blocksize=4k
>> >>       > >>> >>>>>>>> ioengine=rbd
>> >>       > >>> >>>>>>>> rbdname=test2
>> >>       > >>> >>>>>>>> pool=rbd
>> >>       > >>> >>>>>>>> clientname=admin
>> >>       > >>> >>>>>>>> iodepth=8
>> >>       > >>> >>>>>>>> #numjobs=4
>> >>       > >>> >>>>>>>> #thread
>> >>       > >>> >>>>>>>> #group_reporting
>> >>       > >>> >>>>>>>> #time_based
>> >>       > >>> >>>>>>>> #direct=1
>> >>       > >>> >>>>>>>> #ramp_time=60
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> Thanks,
>> >>       > >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>       > >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>       > >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>>
>> >>       wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>       > >>> >>>>>>>>
>> >>       tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>       > >>> >>>>>>>>
>> >>       h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>       > >>> >>>>>>>>
>> >>       903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>       > >>> >>>>>>>>
>> >>       sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>       > >>> >>>>>>>>
>> >>       FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>       > >>> >>>>>>>>
>> >>       pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>       > >>> >>>>>>>>
>> >>       5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>       > >>> >>>>>>>>
>> >>       B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>       > >>> >>>>>>>>
>> >>       4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>       > >>> >>>>>>>>
>> >>       o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>       > >>> >>>>>>>>
>> >>       gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>       > >>> >>>>>>>> J3hS
>> >>       > >>> >>>>>>>> =0J7F
>> >>       > >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>       > >>> >>>>>>>> ----------------
>> >>       > >>> >>>>>>>> Robert LeBlanc
>> >>       > >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E
>> >>       E654 3BB2 FA62 B9F1
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum
>> >>       wrote:
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc
>> >>       wrote:
>> >>       > >>> >>>>>>>>>>
>> >>       > >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>       > >>> >>>>>>>>>> Hash: SHA256
>> >>       > >>> >>>>>>>>>>
>> >>       > >>> >>>>>>>>>> Is there some way to tell in the logs that this
>> >>       is happening?
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>> You can search for the (mangled) name
>> >>       _split_collection
>> >>       > >>> >>>>>>>>>>
>> >>       > >>> >>>>>>>>>> I'm not
>> >>       > >>> >>>>>>>>>> seeing much I/O, CPU usage during these times.
>> >>       Is there some way to
>> >>       > >>> >>>>>>>>>> prevent the splitting? Is there a negative side
>> >>       effect to doing so?
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>> Bump up the split and merge thresholds. You can
>> >>       search the list for
>> >>       > >>> >>>>>>>>> this, it was discussed not too long ago.
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as
>> >>       soon as the sessions
>> >>       > >>> >>>>>>>>>> are aborted, they are reestablished and
>> >>       complete immediately.
>> >>       > >>> >>>>>>>>>>
>> >>       > >>> >>>>>>>>>> The fio test is just a seq write, starting it
>> >>       over (rewriting from
>> >>       > >>> >>>>>>>>>> the
>> >>       > >>> >>>>>>>>>> beginning) is still causing the issue. I was
>> >>       suspect that it is not
>> >>       > >>> >>>>>>>>>> having to create new file and therefore split
>> >>       collections. This is
>> >>       > >>> >>>>>>>>>> on
>> >>       > >>> >>>>>>>>>> my test cluster with no other load.
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>> Hmm, that does make it seem less likely if
>> >>       you're really not creating
>> >>       > >>> >>>>>>>>> new objects, if you're actually running fio in
>> >>       such a way that it's
>> >>       > >>> >>>>>>>>> not allocating new FS blocks (this is probably
>> >>       hard to set up?).
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>>
>> >>       > >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log
>> >>       options and depths
>> >>       > >>> >>>>>>>>>> would be the most helpful for tracking this
>> >>       issue down?
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>>
>> >>       > >>> >>>>>>>>> If you want to go log diving "debug osd = 20",
>> >>       "debug filestore =
>> >>       > >>> >>>>>>>>> 20",
>> >>       > >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to
>> >>       see. That should spit
>> >>       > >>> >>>>>>>>> out
>> >>       > >>> >>>>>>>>> everything you need to track exactly what each
>> >>       Op is doing.
>> >>       > >>> >>>>>>>>> -Greg
>> >>       > >>> >>>>>>>>
>> >>       > >>> >>>>>>>> --
>> >>       > >>> >>>>>>>> To unsubscribe from this list: send the line
>> >>       "unsubscribe ceph-devel"
>> >>       > >>> >>>>>>>> in
>> >>       > >>> >>>>>>>> the body of a message to
>> >>       majordomo@vger.kernel.org
>> >>       > >>> >>>>>>>> More majordomo info at
>> >>       http://vger.kernel.org/majordomo-info.html
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>>>
>> >>       > >>> >>>>>
>> >>       > >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>       > >>> >>>>> Version: Mailvelope v1.1.0
>> >>       > >>> >>>>> Comment: https://www.mailvelope.com
>> >>       > >>> >>>>>
>> >>       > >>> >>>>>
>> >>       wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>       > >>> >>>>>
>> >>       a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>       > >>> >>>>>
>> >>       a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>       > >>> >>>>>
>> >>       s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>       > >>> >>>>>
>> >>       iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>       > >>> >>>>>
>> >>       izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>       > >>> >>>>>
>> >>       caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>       > >>> >>>>>
>> >>       efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>       > >>> >>>>>
>> >>       GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>       > >>> >>>>>
>> >>       glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>       > >>> >>>>>
>> >>       +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>       > >>> >>>>>
>> >>       pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>       > >>> >>>>> gcZm
>> >>       > >>> >>>>> =CjwB
>> >>       > >>> >>>>> -----END PGP SIGNATURE-----
>> >>       > >>> >>>>
>> >>       > >>> >>>> --
>> >>       > >>> >>>> To unsubscribe from this list: send the line
>> >>       "unsubscribe ceph-devel" in
>> >>       > >>> >>>> the body of a message to majordomo@vger.kernel.org
>> >>       > >>> >>>> More majordomo info at
>> >>       http://vger.kernel.org/majordomo-info.html
>> >>       > >>> >>>>
>> >>       > >>> >>>
>> >>       > >>> >>
>> >>       > >>> >> -----BEGIN PGP SIGNATURE-----
>> >>       > >>> >> Version: Mailvelope v1.1.0
>> >>       > >>> >> Comment: https://www.mailvelope.com
>> >>       > >>> >>
>> >>       > >>> >>
>> >>       wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >>       > >>> >>
>> >>       S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >>       > >>> >>
>> >>       lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >>       > >>> >>
>> >>       0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >>       > >>> >>
>> >>       JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >>       > >>> >>
>> >>       dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >>       > >>> >>
>> >>       nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >>       > >>> >>
>> >>       krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >>       > >>> >>
>> >>       FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >>       > >>> >>
>> >>       tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >>       > >>> >>
>> >>       hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >>       > >>> >>
>> >>       BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >>       > >>> >> ae22
>> >>       > >>> >> =AX+L
>> >>       > >>> >> -----END PGP SIGNATURE-----
>> >>       > >>> _______________________________________________
>> >>       > >>> ceph-users mailing list
>> >>       > >>> ceph-users@lists.ceph.com
>> >>       > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>       > >>>
>> >>       > >>>
>> >>       > _______________________________________________
>> >>       > ceph-users mailing list
>> >>       > ceph-users@lists.ceph.com
>> >>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>       >
>> >>       >
>> >>
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 12:37                                                                   ` Sage Weil
       [not found]                                                                     ` <alpine.DEB.2.00.1510060534010.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-06 18:03                                                                     ` Robert LeBlanc
       [not found]                                                                       ` <CAANLjFosPfynahiTmC2r=wPGWKg8YQAak58XGt0MfVXC9bmXuw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 18:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
(4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
messages when the OSD was marked out:

2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
34.476006 secs
2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
cluster [WRN] slow request 32.913474 seconds old, received at
2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
ack+read+known_if_redirected e58744) currently waiting for peered
2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
cluster [WRN] slow request 32.697545 seconds old, received at
2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
ack+read+known_if_redirected e58744) currently waiting for peered
2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
cluster [WRN] slow request 32.668006 seconds old, received at
2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
ack+read+known_if_redirected e58744) currently waiting for peered

But I'm not seeing the blocked messages when the OSD came back in. The
OSD spindles have been running at 100% during this test. I have seen
slowed I/O from the clients as expected from the extra load, but so
far no blocked messages. I'm going to run some more tests.

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
fo5a
=ahEi
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil <sweil@redhat.com> wrote:
> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> With some off-list help, we have adjusted
>> osd_client_message_cap=10000. This seems to have helped a bit and we
>> have seen some OSDs have a value up to 4,000 for client messages. But
>> it does not solve the problem with the blocked I/O.
>>
>> One thing that I have noticed is that almost exactly 30 seconds elapse
>> between an OSD boots and the first blocked I/O message. I don't know
>> if the OSD doesn't have time to get it's brain right about a PG before
>> it starts servicing it or what exactly.
>
> I'm downloading the logs from yesterday now; sorry it's taking so long.
>
>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> to master and things didn't go so well. The OSDs would not start
>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> and all OSDs and the OSD then started, but never became active in the
>> cluster. It just sat there after reading all the PGs. There were
>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> downgrading to the Infernalis branch and still no luck getting the
>> OSDs to come up. The OSD processes were idle after the initial boot.
>> All packages were installed from gitbuilder.
>
> Did you chown -R ?
>
>         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>
> My guess is you only chowned the root dir, and the OSD didn't throw
> an error when it encountered the other files?  If you can generate a debug
> osd = 20 log, that would be helpful.. thanks!
>
> sage
>
>
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> GdXC
>> =Aigq
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > I have eight nodes running the fio job rbd_test_real to different RBD
>> > volumes. I've included the CRUSH map in the tarball.
>> >
>> > I stopped one OSD process and marked it out. I let it recover for a
>> > few minutes and then I started the process again and marked it in. I
>> > started getting block I/O messages during the recovery.
>> >
>> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>> >
>> > Thanks,
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: Mailvelope v1.2.0
>> > Comment: https://www.mailvelope.com
>> >
>> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> > 3EPx
>> > =UDIV
>> > -----END PGP SIGNATURE-----
>> >
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil@redhat.com> wrote:
>> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA256
>> >>>
>> >>> We are still struggling with this and have tried a lot of different
>> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> >>> consulting services for non-Red Hat systems. If there are some
>> >>> certified Ceph consultants in the US that we can do both remote and
>> >>> on-site engagements, please let us know.
>> >>>
>> >>> This certainly seems to be network related, but somewhere in the
>> >>> kernel. We have tried increasing the network and TCP buffers, number
>> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> >>> at a time). There seems to be no reasonable explanation why I/O is
>> >>> blocked pretty frequently longer than 30 seconds. We have verified
>> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> >>> network admins have verified that packets are not being dropped in the
>> >>> switches for these nodes. We have tried different kernels including
>> >>> the recent Google patch to cubic. This is showing up on three cluster
>> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> >>> (from CentOS 7.1) with similar results.
>> >>>
>> >>> The messages seem slightly different:
>> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> >>> 100.087155 secs
>> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> >>> points reached
>> >>>
>> >>> I don't know what "no flag points reached" means.
>> >>
>> >> Just that the op hasn't been marked as reaching any interesting points
>> >> (op->mark_*() calls).
>> >>
>> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> >> It's extremely verbose but it'll let us see where the op is getting
>> >> blocked.  If you see the "slow request" message it means the op in
>> >> received by ceph (that's when the clock starts), so I suspect it's not
>> >> something we can blame on the network stack.
>> >>
>> >> sage
>> >>
>> >>
>> >>>
>> >>> The problem is most pronounced when we have to reboot an OSD node (1
>> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> >>> seconds. It takes a good 15 minutes for things to settle down. The
>> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> >>> are between 25-50% full. We are currently splitting PGs to distribute
>> >>> the load better across the disks, but we are having to do this 10 PGs
>> >>> at a time as we get blocked I/O. We have max_backfills and
>> >>> max_recovery set to 1, client op priority is set higher than recovery
>> >>> priority. We tried increasing the number of op threads but this didn't
>> >>> seem to help. It seems as soon as PGs are finished being checked, they
>> >>> become active and could be the cause for slow I/O while the other PGs
>> >>> are being checked.
>> >>>
>> >>> What I don't understand is that the messages are delayed. As soon as
>> >>> the message is received by Ceph OSD process, it is very quickly
>> >>> committed to the journal and a response is sent back to the primary
>> >>> OSD which is received very quickly as well. I've adjust
>> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> >>> of RAM per nodes for 10 OSDs.
>> >>>
>> >>> Is there something that could cause the kernel to get a packet but not
>> >>> be able to dispatch it to Ceph such that it could be explaining why we
>> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> >>> to tracing Ceph messages from the network buffer through the kernel to
>> >>> the Ceph process?
>> >>>
>> >>> We can really use some pointers no matter how outrageous. We've have
>> >>> over 6 people looking into this for weeks now and just can't think of
>> >>> anything else.
>> >>>
>> >>> Thanks,
>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> Version: Mailvelope v1.1.0
>> >>> Comment: https://www.mailvelope.com
>> >>>
>> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >>> l7OF
>> >>> =OI++
>> >>> -----END PGP SIGNATURE-----
>> >>> ----------------
>> >>> Robert LeBlanc
>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>
>> >>>
>> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> >>> > like all the blocked I/O has stopped (no entries in the log for the
>> >>> > last 12 hours). This makes me believe that there is some issue with
>> >>> > the number of sockets or some other TCP issue. We have not messed with
>> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> >>> > processes and 16K system wide.
>> >>> >
>> >>> > Does this seem like the right spot to be looking? What are some
>> >>> > configuration items we should be looking at?
>> >>> >
>> >>> > Thanks,
>> >>> > ----------------
>> >>> > Robert LeBlanc
>> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >
>> >>> >
>> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >> Hash: SHA256
>> >>> >>
>> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >>> >> seems that there were some major reworks in the network handling in
>> >>> >> the kernel to efficiently handle that network rate. If I remember
>> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >>> >> that we did see packet loss while congesting our ISLs in our initial
>> >>> >> testing, but we could not tell where the dropping was happening. We
>> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >>> >> trying to congest things. We probably already saw this issue, just
>> >>> >> didn't know it.
>> >>> >> - ----------------
>> >>> >> Robert LeBlanc
>> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >>> >>> drivers might cause problems though.
>> >>> >>>
>> >>> >>> Here's ifconfig from one of the nodes:
>> >>> >>>
>> >>> >>> ens513f1: flags=4163  mtu 1500
>> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >>> >>>
>> >>> >>> Mark
>> >>> >>>
>> >>> >>>
>> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>> >>>>
>> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>> Hash: SHA256
>> >>> >>>>
>> >>> >>>> OK, here is the update on the saga...
>> >>> >>>>
>> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >>> >>>>
>> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >>> >>>> need the network enhancements in the 4.x series to work well.
>> >>> >>>>
>> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >>> >>>> differing hardware and network configs.
>> >>> >>>>
>> >>> >>>> Thanks,
>> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>> Version: Mailvelope v1.1.0
>> >>> >>>> Comment: https://www.mailvelope.com
>> >>> >>>>
>> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>> >>>> 4OEo
>> >>> >>>> =P33I
>> >>> >>>> -----END PGP SIGNATURE-----
>> >>> >>>> ----------------
>> >>> >>>> Robert LeBlanc
>> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>> >>>> wrote:
>> >>> >>>>>
>> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>> Hash: SHA256
>> >>> >>>>>
>> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >>> >>>>> blocked I/O.
>> >>> >>>>>
>> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >>> >>>>> the blocked I/O.
>> >>> >>>>> - ----------------
>> >>> >>>>> Robert LeBlanc
>> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>> >>>>>>
>> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>> >>>>>>>
>> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >>> >>>>>>> delayed for many 10s of seconds?
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >>> >>>>>> has
>> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >>> >>>>>>
>> >>> >>>>>> sage
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> What kernel are you running?
>> >>> >>>>>>> -Sam
>> >>> >>>>>>>
>> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>>> Hash: SHA256
>> >>> >>>>>>>>
>> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>> >>>>>>>>
>> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >>> >>>>>>>>
>> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >>> >>>>>>>> transfer).
>> >>> >>>>>>>>
>> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >>> >>>>>>>> passed to another thread right away or something. This test was done
>> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >>> >>>>>>>> thread.
>> >>> >>>>>>>>
>> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >>> >>>>>>>> some help.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Single Test started about
>> >>> >>>>>>>> 2015-09-22 12:52:36
>> >>> >>>>>>>>
>> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>>> 30.439150 secs
>> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>>>   currently waiting for subops from 13,16
>> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >>> >>>>>>>> 30.379680 secs
>> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >>> >>>>>>>> 12:55:06.406303:
>> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >>> >>>>>>>> 12:55:06.318144:
>> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>>>   currently waiting for subops from 13,14
>> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>>> 30.954212 secs
>> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>>>   currently waiting for subops from 16,17
>> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>>> 30.704367 secs
>> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>> >>>>>>>>
>> >>> >>>>>>>> Server   IP addr              OSD
>> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>> >>>>>>>>
>> >>> >>>>>>>> fio job:
>> >>> >>>>>>>> [rbd-test]
>> >>> >>>>>>>> readwrite=write
>> >>> >>>>>>>> blocksize=4M
>> >>> >>>>>>>> #runtime=60
>> >>> >>>>>>>> name=rbd-test
>> >>> >>>>>>>> #readwrite=randwrite
>> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>> >>>>>>>> #rwmixread=72
>> >>> >>>>>>>> #norandommap
>> >>> >>>>>>>> #size=1T
>> >>> >>>>>>>> #blocksize=4k
>> >>> >>>>>>>> ioengine=rbd
>> >>> >>>>>>>> rbdname=test2
>> >>> >>>>>>>> pool=rbd
>> >>> >>>>>>>> clientname=admin
>> >>> >>>>>>>> iodepth=8
>> >>> >>>>>>>> #numjobs=4
>> >>> >>>>>>>> #thread
>> >>> >>>>>>>> #group_reporting
>> >>> >>>>>>>> #time_based
>> >>> >>>>>>>> #direct=1
>> >>> >>>>>>>> #ramp_time=60
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> Thanks,
>> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>>>>
>> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>> >>>>>>>> J3hS
>> >>> >>>>>>>> =0J7F
>> >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>> >>>>>>>> ----------------
>> >>> >>>>>>>> Robert LeBlanc
>> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>>>>> Hash: SHA256
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> I'm not
>> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >>> >>>>>>>>> this, it was discussed not too long ago.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >>> >>>>>>>>>> the
>> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >>> >>>>>>>>>> on
>> >>> >>>>>>>>>> my test cluster with no other load.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>>
>> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >>> >>>>>>>>> 20",
>> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >>> >>>>>>>>> out
>> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >>> >>>>>>>>> -Greg
>> >>> >>>>>>>>
>> >>> >>>>>>>> --
>> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>> >>>>>>>> in
>> >>> >>>>>>>> the body of a message to majordomo@vger.kernel.org
>> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>
>> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>> Version: Mailvelope v1.1.0
>> >>> >>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>
>> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>> >>>>> gcZm
>> >>> >>>>> =CjwB
>> >>> >>>>> -----END PGP SIGNATURE-----
>> >>> >>>>
>> >>> >>>> --
>> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> >>>> the body of a message to majordomo@vger.kernel.org
>> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >> -----BEGIN PGP SIGNATURE-----
>> >>> >> Version: Mailvelope v1.1.0
>> >>> >> Comment: https://www.mailvelope.com
>> >>> >>
>> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >>> >> ae22
>> >>> >> =AX+L
>> >>> >> -----END PGP SIGNATURE-----
>> >>> _______________________________________________
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                       ` <CAANLjFosPfynahiTmC2r=wPGWKg8YQAak58XGt0MfVXC9bmXuw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 18:32                                                                         ` Sage Weil
       [not found]                                                                           ` <alpine.DEB.2.00.1510061122370.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-06 18:32 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> messages when the OSD was marked out:
> 
> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> 34.476006 secs
> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> cluster [WRN] slow request 32.913474 seconds old, received at
> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
> ack+read+known_if_redirected e58744) currently waiting for peered
> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> cluster [WRN] slow request 32.697545 seconds old, received at
> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
> ack+read+known_if_redirected e58744) currently waiting for peered
> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> cluster [WRN] slow request 32.668006 seconds old, received at
> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
> ack+read+known_if_redirected e58744) currently waiting for peered
> 
> But I'm not seeing the blocked messages when the OSD came back in. The
> OSD spindles have been running at 100% during this test. I have seen
> slowed I/O from the clients as expected from the extra load, but so
> far no blocked messages. I'm going to run some more tests.

Good to hear.

FWIW I looked through the logs and all of the slow request no flag point 
messages came from osd.163... and the logs don't show when they arrived.  
My guess is this OSD has a slower disk than the others, or something else 
funny is going on?

I spot checked another OSD at random (60) where I saw a slow request.  It 
was stuck peering for 10s of seconds... waiting on a pg log message from 
osd.163.

sage


> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
> fo5a
> =ahEi
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> With some off-list help, we have adjusted
> >> osd_client_message_cap=10000. This seems to have helped a bit and we
> >> have seen some OSDs have a value up to 4,000 for client messages. But
> >> it does not solve the problem with the blocked I/O.
> >>
> >> One thing that I have noticed is that almost exactly 30 seconds elapse
> >> between an OSD boots and the first blocked I/O message. I don't know
> >> if the OSD doesn't have time to get it's brain right about a PG before
> >> it starts servicing it or what exactly.
> >
> > I'm downloading the logs from yesterday now; sorry it's taking so long.
> >
> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
> >> to master and things didn't go so well. The OSDs would not start
> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
> >> and all OSDs and the OSD then started, but never became active in the
> >> cluster. It just sat there after reading all the PGs. There were
> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
> >> downgrading to the Infernalis branch and still no luck getting the
> >> OSDs to come up. The OSD processes were idle after the initial boot.
> >> All packages were installed from gitbuilder.
> >
> > Did you chown -R ?
> >
> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
> >
> > My guess is you only chowned the root dir, and the OSD didn't throw
> > an error when it encountered the other files?  If you can generate a debug
> > osd = 20 log, that would be helpful.. thanks!
> >
> > sage
> >
> >
> >>
> >> Thanks,
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.2.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >> GdXC
> >> =Aigq
> >> -----END PGP SIGNATURE-----
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >> > -----BEGIN PGP SIGNED MESSAGE-----
> >> > Hash: SHA256
> >> >
> >> > I have eight nodes running the fio job rbd_test_real to different RBD
> >> > volumes. I've included the CRUSH map in the tarball.
> >> >
> >> > I stopped one OSD process and marked it out. I let it recover for a
> >> > few minutes and then I started the process again and marked it in. I
> >> > started getting block I/O messages during the recovery.
> >> >
> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >> >
> >> > Thanks,
> >> > -----BEGIN PGP SIGNATURE-----
> >> > Version: Mailvelope v1.2.0
> >> > Comment: https://www.mailvelope.com
> >> >
> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> >> > 3EPx
> >> > =UDIV
> >> > -----END PGP SIGNATURE-----
> >> >
> >> > ----------------
> >> > Robert LeBlanc
> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >
> >> >
> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> Hash: SHA256
> >> >>>
> >> >>> We are still struggling with this and have tried a lot of different
> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >> >>> consulting services for non-Red Hat systems. If there are some
> >> >>> certified Ceph consultants in the US that we can do both remote and
> >> >>> on-site engagements, please let us know.
> >> >>>
> >> >>> This certainly seems to be network related, but somewhere in the
> >> >>> kernel. We have tried increasing the network and TCP buffers, number
> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> >> >>> at a time). There seems to be no reasonable explanation why I/O is
> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> >> >>> network admins have verified that packets are not being dropped in the
> >> >>> switches for these nodes. We have tried different kernels including
> >> >>> the recent Google patch to cubic. This is showing up on three cluster
> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> >> >>> (from CentOS 7.1) with similar results.
> >> >>>
> >> >>> The messages seem slightly different:
> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> >> >>> 100.087155 secs
> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> >> >>> points reached
> >> >>>
> >> >>> I don't know what "no flag points reached" means.
> >> >>
> >> >> Just that the op hasn't been marked as reaching any interesting points
> >> >> (op->mark_*() calls).
> >> >>
> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> >> >> It's extremely verbose but it'll let us see where the op is getting
> >> >> blocked.  If you see the "slow request" message it means the op in
> >> >> received by ceph (that's when the clock starts), so I suspect it's not
> >> >> something we can blame on the network stack.
> >> >>
> >> >> sage
> >> >>
> >> >>
> >> >>>
> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
> >> >>> the load better across the disks, but we are having to do this 10 PGs
> >> >>> at a time as we get blocked I/O. We have max_backfills and
> >> >>> max_recovery set to 1, client op priority is set higher than recovery
> >> >>> priority. We tried increasing the number of op threads but this didn't
> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
> >> >>> become active and could be the cause for slow I/O while the other PGs
> >> >>> are being checked.
> >> >>>
> >> >>> What I don't understand is that the messages are delayed. As soon as
> >> >>> the message is received by Ceph OSD process, it is very quickly
> >> >>> committed to the journal and a response is sent back to the primary
> >> >>> OSD which is received very quickly as well. I've adjust
> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> >> >>> of RAM per nodes for 10 OSDs.
> >> >>>
> >> >>> Is there something that could cause the kernel to get a packet but not
> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> >> >>> to tracing Ceph messages from the network buffer through the kernel to
> >> >>> the Ceph process?
> >> >>>
> >> >>> We can really use some pointers no matter how outrageous. We've have
> >> >>> over 6 people looking into this for weeks now and just can't think of
> >> >>> anything else.
> >> >>>
> >> >>> Thanks,
> >> >>> -----BEGIN PGP SIGNATURE-----
> >> >>> Version: Mailvelope v1.1.0
> >> >>> Comment: https://www.mailvelope.com
> >> >>>
> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >> >>> l7OF
> >> >>> =OI++
> >> >>> -----END PGP SIGNATURE-----
> >> >>> ----------------
> >> >>> Robert LeBlanc
> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>>
> >> >>>
> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
> >> >>> > last 12 hours). This makes me believe that there is some issue with
> >> >>> > the number of sockets or some other TCP issue. We have not messed with
> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> >> >>> > processes and 16K system wide.
> >> >>> >
> >> >>> > Does this seem like the right spot to be looking? What are some
> >> >>> > configuration items we should be looking at?
> >> >>> >
> >> >>> > Thanks,
> >> >>> > ----------------
> >> >>> > Robert LeBlanc
> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>> >
> >> >>> >
> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> >> Hash: SHA256
> >> >>> >>
> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> >>> >> seems that there were some major reworks in the network handling in
> >> >>> >> the kernel to efficiently handle that network rate. If I remember
> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
> >> >>> >> testing, but we could not tell where the dropping was happening. We
> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> >>> >> trying to congest things. We probably already saw this issue, just
> >> >>> >> didn't know it.
> >> >>> >> - ----------------
> >> >>> >> Robert LeBlanc
> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>> >>
> >> >>> >>
> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >> >>> >>> drivers might cause problems though.
> >> >>> >>>
> >> >>> >>> Here's ifconfig from one of the nodes:
> >> >>> >>>
> >> >>> >>> ens513f1: flags=4163  mtu 1500
> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >> >>> >>>
> >> >>> >>> Mark
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >> >>> >>>>
> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> >>>> Hash: SHA256
> >> >>> >>>>
> >> >>> >>>> OK, here is the update on the saga...
> >> >>> >>>>
> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> >> >>> >>>>
> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >> >>> >>>> need the network enhancements in the 4.x series to work well.
> >> >>> >>>>
> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
> >> >>> >>>> differing hardware and network configs.
> >> >>> >>>>
> >> >>> >>>> Thanks,
> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >> >>> >>>> Version: Mailvelope v1.1.0
> >> >>> >>>> Comment: https://www.mailvelope.com
> >> >>> >>>>
> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >> >>> >>>> 4OEo
> >> >>> >>>> =P33I
> >> >>> >>>> -----END PGP SIGNATURE-----
> >> >>> >>>> ----------------
> >> >>> >>>> Robert LeBlanc
> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >> >>> >>>> wrote:
> >> >>> >>>>>
> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> >>>>> Hash: SHA256
> >> >>> >>>>>
> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >> >>> >>>>> blocked I/O.
> >> >>> >>>>>
> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >> >>> >>>>> the blocked I/O.
> >> >>> >>>>> - ----------------
> >> >>> >>>>> Robert LeBlanc
> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >> >>> >>>>>>
> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >> >>> >>>>>>>
> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >> >>> >>>>>>> delayed for many 10s of seconds?
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >> >>> >>>>>> has
> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >> >>> >>>>>>
> >> >>> >>>>>> sage
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>> What kernel are you running?
> >> >>> >>>>>>> -Sam
> >> >>> >>>>>>>
> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> >>>>>>>> Hash: SHA256
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >> >>> >>>>>>>> transfer).
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >> >>> >>>>>>>> thread.
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >> >>> >>>>>>>> some help.
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> Single Test started about
> >> >>> >>>>>>>> 2015-09-22 12:52:36
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >>> >>>>>>>> 30.439150 secs
> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >> >>> >>>>>>>>   currently waiting for subops from 13,16
> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >> >>> >>>>>>>> 30.379680 secs
> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >> >>> >>>>>>>> 12:55:06.406303:
> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >> >>> >>>>>>>> 12:55:06.318144:
> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >> >>> >>>>>>>>   currently waiting for subops from 13,14
> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >>> >>>>>>>> 30.954212 secs
> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >> >>> >>>>>>>>   currently waiting for subops from 16,17
> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >>> >>>>>>>> 30.704367 secs
> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> Server   IP addr              OSD
> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> fio job:
> >> >>> >>>>>>>> [rbd-test]
> >> >>> >>>>>>>> readwrite=write
> >> >>> >>>>>>>> blocksize=4M
> >> >>> >>>>>>>> #runtime=60
> >> >>> >>>>>>>> name=rbd-test
> >> >>> >>>>>>>> #readwrite=randwrite
> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >> >>> >>>>>>>> #rwmixread=72
> >> >>> >>>>>>>> #norandommap
> >> >>> >>>>>>>> #size=1T
> >> >>> >>>>>>>> #blocksize=4k
> >> >>> >>>>>>>> ioengine=rbd
> >> >>> >>>>>>>> rbdname=test2
> >> >>> >>>>>>>> pool=rbd
> >> >>> >>>>>>>> clientname=admin
> >> >>> >>>>>>>> iodepth=8
> >> >>> >>>>>>>> #numjobs=4
> >> >>> >>>>>>>> #thread
> >> >>> >>>>>>>> #group_reporting
> >> >>> >>>>>>>> #time_based
> >> >>> >>>>>>>> #direct=1
> >> >>> >>>>>>>> #ramp_time=60
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> Thanks,
> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >> >>> >>>>>>>> J3hS
> >> >>> >>>>>>>> =0J7F
> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
> >> >>> >>>>>>>> ----------------
> >> >>> >>>>>>>> Robert LeBlanc
> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >> >>> >>>>>>>>>>
> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >>> >>>>>>>>>> Hash: SHA256
> >> >>> >>>>>>>>>>
> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> >> >>> >>>>>>>>>>
> >> >>> >>>>>>>>>> I'm not
> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >> >>> >>>>>>>>> this, it was discussed not too long ago.
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >> >>> >>>>>>>>>>
> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >> >>> >>>>>>>>>> the
> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
> >> >>> >>>>>>>>>> on
> >> >>> >>>>>>>>>> my test cluster with no other load.
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>>
> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >> >>> >>>>>>>>> 20",
> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >> >>> >>>>>>>>> out
> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >> >>> >>>>>>>>> -Greg
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> --
> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> >>> >>>>>>>> in
> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>
> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >> >>> >>>>> Version: Mailvelope v1.1.0
> >> >>> >>>>> Comment: https://www.mailvelope.com
> >> >>> >>>>>
> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >> >>> >>>>> gcZm
> >> >>> >>>>> =CjwB
> >> >>> >>>>> -----END PGP SIGNATURE-----
> >> >>> >>>>
> >> >>> >>>> --
> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >>>>
> >> >>> >>>
> >> >>> >>
> >> >>> >> -----BEGIN PGP SIGNATURE-----
> >> >>> >> Version: Mailvelope v1.1.0
> >> >>> >> Comment: https://www.mailvelope.com
> >> >>> >>
> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> >>> >> ae22
> >> >>> >> =AX+L
> >> >>> >> -----END PGP SIGNATURE-----
> >> >>> _______________________________________________
> >> >>> ceph-users mailing list
> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>>
> >> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                           ` <alpine.DEB.2.00.1510061122370.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-06 19:06                                                                             ` Robert LeBlanc
  2015-10-06 19:34                                                                               ` [ceph-users] " Sage Weil
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 19:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I can't think of anything. In my dev cluster the only thing that has
changed is the Ceph versions (no reboot). What I like is even though
the disks are 100% utilized, it is preforming as I expect now. Client
I/O is slightly degraded during the recovery, but no blocked I/O when
the OSD boots or during the recovery period. This is with
max_backfills set to 20, one backfill max in our production cluster is
painful on OSD boot/recovery. I was able to reproduce this issue on
our dev cluster very easily and very quickly with these settings. So
far two tests and an hour later, only the blocked I/O when the OSD is
marked out. We would love to see that go away too, but this is far
better than what we have now. This dev cluster also has
osd_client_message_cap set to default (100).

We need to stay on the Hammer version of Ceph and I'm willing to take
the time to bisect this. If this is not a problem in Firefly/Giant,
you you prefer a bisect to find the introduction of the problem
(Firefly/Giant -> Hammer) or the introduction of the resolution
(Hammer -> Infernalis)? Do you have some hints to reduce hitting a
commit that prevents a clean build as that is my most limiting factor?

Thanks,
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> messages when the OSD was marked out:
>>
>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> 34.476006 secs
>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> cluster [WRN] slow request 32.913474 seconds old, received at
>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>> ack+read+known_if_redirected e58744) currently waiting for peered
>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> cluster [WRN] slow request 32.697545 seconds old, received at
>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>> ack+read+known_if_redirected e58744) currently waiting for peered
>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> cluster [WRN] slow request 32.668006 seconds old, received at
>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>> ack+read+known_if_redirected e58744) currently waiting for peered
>>
>> But I'm not seeing the blocked messages when the OSD came back in. The
>> OSD spindles have been running at 100% during this test. I have seen
>> slowed I/O from the clients as expected from the extra load, but so
>> far no blocked messages. I'm going to run some more tests.
>
> Good to hear.
>
> FWIW I looked through the logs and all of the slow request no flag point
> messages came from osd.163... and the logs don't show when they arrived.
> My guess is this OSD has a slower disk than the others, or something else
> funny is going on?
>
> I spot checked another OSD at random (60) where I saw a slow request.  It
> was stuck peering for 10s of seconds... waiting on a pg log message from
> osd.163.
>
> sage
>
>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>> fo5a
>> =ahEi
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> With some off-list help, we have adjusted
>> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>> >> have seen some OSDs have a value up to 4,000 for client messages. But
>> >> it does not solve the problem with the blocked I/O.
>> >>
>> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>> >> between an OSD boots and the first blocked I/O message. I don't know
>> >> if the OSD doesn't have time to get it's brain right about a PG before
>> >> it starts servicing it or what exactly.
>> >
>> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>> >
>> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> >> to master and things didn't go so well. The OSDs would not start
>> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> >> and all OSDs and the OSD then started, but never became active in the
>> >> cluster. It just sat there after reading all the PGs. There were
>> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> >> downgrading to the Infernalis branch and still no luck getting the
>> >> OSDs to come up. The OSD processes were idle after the initial boot.
>> >> All packages were installed from gitbuilder.
>> >
>> > Did you chown -R ?
>> >
>> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>> >
>> > My guess is you only chowned the root dir, and the OSD didn't throw
>> > an error when it encountered the other files?  If you can generate a debug
>> > osd = 20 log, that would be helpful.. thanks!
>> >
>> > sage
>> >
>> >
>> >>
>> >> Thanks,
>> >> -----BEGIN PGP SIGNATURE-----
>> >> Version: Mailvelope v1.2.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> >> GdXC
>> >> =Aigq
>> >> -----END PGP SIGNATURE-----
>> >> ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>> >> > -----BEGIN PGP SIGNED MESSAGE-----
>> >> > Hash: SHA256
>> >> >
>> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>> >> > volumes. I've included the CRUSH map in the tarball.
>> >> >
>> >> > I stopped one OSD process and marked it out. I let it recover for a
>> >> > few minutes and then I started the process again and marked it in. I
>> >> > started getting block I/O messages during the recovery.
>> >> >
>> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>> >> >
>> >> > Thanks,
>> >> > -----BEGIN PGP SIGNATURE-----
>> >> > Version: Mailvelope v1.2.0
>> >> > Comment: https://www.mailvelope.com
>> >> >
>> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> >> > 3EPx
>> >> > =UDIV
>> >> > -----END PGP SIGNATURE-----
>> >> >
>> >> > ----------------
>> >> > Robert LeBlanc
>> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >
>> >> >
>> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> Hash: SHA256
>> >> >>>
>> >> >>> We are still struggling with this and have tried a lot of different
>> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> >> >>> consulting services for non-Red Hat systems. If there are some
>> >> >>> certified Ceph consultants in the US that we can do both remote and
>> >> >>> on-site engagements, please let us know.
>> >> >>>
>> >> >>> This certainly seems to be network related, but somewhere in the
>> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> >> >>> network admins have verified that packets are not being dropped in the
>> >> >>> switches for these nodes. We have tried different kernels including
>> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> >> >>> (from CentOS 7.1) with similar results.
>> >> >>>
>> >> >>> The messages seem slightly different:
>> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> >> >>> 100.087155 secs
>> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> >> >>> points reached
>> >> >>>
>> >> >>> I don't know what "no flag points reached" means.
>> >> >>
>> >> >> Just that the op hasn't been marked as reaching any interesting points
>> >> >> (op->mark_*() calls).
>> >> >>
>> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> >> >> It's extremely verbose but it'll let us see where the op is getting
>> >> >> blocked.  If you see the "slow request" message it means the op in
>> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>> >> >> something we can blame on the network stack.
>> >> >>
>> >> >> sage
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>> >> >>> the load better across the disks, but we are having to do this 10 PGs
>> >> >>> at a time as we get blocked I/O. We have max_backfills and
>> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>> >> >>> priority. We tried increasing the number of op threads but this didn't
>> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>> >> >>> become active and could be the cause for slow I/O while the other PGs
>> >> >>> are being checked.
>> >> >>>
>> >> >>> What I don't understand is that the messages are delayed. As soon as
>> >> >>> the message is received by Ceph OSD process, it is very quickly
>> >> >>> committed to the journal and a response is sent back to the primary
>> >> >>> OSD which is received very quickly as well. I've adjust
>> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> >> >>> of RAM per nodes for 10 OSDs.
>> >> >>>
>> >> >>> Is there something that could cause the kernel to get a packet but not
>> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>> >> >>> the Ceph process?
>> >> >>>
>> >> >>> We can really use some pointers no matter how outrageous. We've have
>> >> >>> over 6 people looking into this for weeks now and just can't think of
>> >> >>> anything else.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> -----BEGIN PGP SIGNATURE-----
>> >> >>> Version: Mailvelope v1.1.0
>> >> >>> Comment: https://www.mailvelope.com
>> >> >>>
>> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >> >>> l7OF
>> >> >>> =OI++
>> >> >>> -----END PGP SIGNATURE-----
>> >> >>> ----------------
>> >> >>> Robert LeBlanc
>> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>>
>> >> >>>
>> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>> >> >>> > last 12 hours). This makes me believe that there is some issue with
>> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> >> >>> > processes and 16K system wide.
>> >> >>> >
>> >> >>> > Does this seem like the right spot to be looking? What are some
>> >> >>> > configuration items we should be looking at?
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > ----------------
>> >> >>> > Robert LeBlanc
>> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>> >
>> >> >>> >
>> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> >> Hash: SHA256
>> >> >>> >>
>> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >> >>> >> seems that there were some major reworks in the network handling in
>> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>> >> >>> >> testing, but we could not tell where the dropping was happening. We
>> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >> >>> >> trying to congest things. We probably already saw this issue, just
>> >> >>> >> didn't know it.
>> >> >>> >> - ----------------
>> >> >>> >> Robert LeBlanc
>> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >> >>> >>> drivers might cause problems though.
>> >> >>> >>>
>> >> >>> >>> Here's ifconfig from one of the nodes:
>> >> >>> >>>
>> >> >>> >>> ens513f1: flags=4163  mtu 1500
>> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >> >>> >>>
>> >> >>> >>> Mark
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >> >>> >>>>
>> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> >>>> Hash: SHA256
>> >> >>> >>>>
>> >> >>> >>>> OK, here is the update on the saga...
>> >> >>> >>>>
>> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >> >>> >>>>
>> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>> >> >>> >>>>
>> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >> >>> >>>> differing hardware and network configs.
>> >> >>> >>>>
>> >> >>> >>>> Thanks,
>> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >> >>> >>>> Version: Mailvelope v1.1.0
>> >> >>> >>>> Comment: https://www.mailvelope.com
>> >> >>> >>>>
>> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >> >>> >>>> 4OEo
>> >> >>> >>>> =P33I
>> >> >>> >>>> -----END PGP SIGNATURE-----
>> >> >>> >>>> ----------------
>> >> >>> >>>> Robert LeBlanc
>> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >> >>> >>>> wrote:
>> >> >>> >>>>>
>> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> >>>>> Hash: SHA256
>> >> >>> >>>>>
>> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >> >>> >>>>> blocked I/O.
>> >> >>> >>>>>
>> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >> >>> >>>>> the blocked I/O.
>> >> >>> >>>>> - ----------------
>> >> >>> >>>>> Robert LeBlanc
>> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >> >>> >>>>>>
>> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >> >>> >>>>>>> delayed for many 10s of seconds?
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >> >>> >>>>>> has
>> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >> >>> >>>>>>
>> >> >>> >>>>>> sage
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> What kernel are you running?
>> >> >>> >>>>>>> -Sam
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> >>>>>>>> Hash: SHA256
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >> >>> >>>>>>>> transfer).
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >> >>> >>>>>>>> thread.
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >> >>> >>>>>>>> some help.
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Single Test started about
>> >> >>> >>>>>>>> 2015-09-22 12:52:36
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >>> >>>>>>>> 30.439150 secs
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >> >>> >>>>>>>> 30.379680 secs
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >> >>> >>>>>>>> 12:55:06.406303:
>> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >> >>> >>>>>>>> 12:55:06.318144:
>> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >>> >>>>>>>> 30.954212 secs
>> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >>> >>>>>>>> 30.704367 secs
>> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Server   IP addr              OSD
>> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> fio job:
>> >> >>> >>>>>>>> [rbd-test]
>> >> >>> >>>>>>>> readwrite=write
>> >> >>> >>>>>>>> blocksize=4M
>> >> >>> >>>>>>>> #runtime=60
>> >> >>> >>>>>>>> name=rbd-test
>> >> >>> >>>>>>>> #readwrite=randwrite
>> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >> >>> >>>>>>>> #rwmixread=72
>> >> >>> >>>>>>>> #norandommap
>> >> >>> >>>>>>>> #size=1T
>> >> >>> >>>>>>>> #blocksize=4k
>> >> >>> >>>>>>>> ioengine=rbd
>> >> >>> >>>>>>>> rbdname=test2
>> >> >>> >>>>>>>> pool=rbd
>> >> >>> >>>>>>>> clientname=admin
>> >> >>> >>>>>>>> iodepth=8
>> >> >>> >>>>>>>> #numjobs=4
>> >> >>> >>>>>>>> #thread
>> >> >>> >>>>>>>> #group_reporting
>> >> >>> >>>>>>>> #time_based
>> >> >>> >>>>>>>> #direct=1
>> >> >>> >>>>>>>> #ramp_time=60
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Thanks,
>> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >> >>> >>>>>>>> J3hS
>> >> >>> >>>>>>>> =0J7F
>> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >> >>> >>>>>>>> ----------------
>> >> >>> >>>>>>>> Robert LeBlanc
>> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >> >>> >>>>>>>>>>
>> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >>> >>>>>>>>>> Hash: SHA256
>> >> >>> >>>>>>>>>>
>> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >> >>> >>>>>>>>>>
>> >> >>> >>>>>>>>>> I'm not
>> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >> >>> >>>>>>>>>>
>> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >> >>> >>>>>>>>>> the
>> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >> >>> >>>>>>>>>> on
>> >> >>> >>>>>>>>>> my test cluster with no other load.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>>
>> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >> >>> >>>>>>>>> 20",
>> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >> >>> >>>>>>>>> out
>> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >> >>> >>>>>>>>> -Greg
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> --
>> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >> >>> >>>>>>>> in
>> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >> >>> >>>>> Version: Mailvelope v1.1.0
>> >> >>> >>>>> Comment: https://www.mailvelope.com
>> >> >>> >>>>>
>> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >> >>> >>>>> gcZm
>> >> >>> >>>>> =CjwB
>> >> >>> >>>>> -----END PGP SIGNATURE-----
>> >> >>> >>>>
>> >> >>> >>>> --
>> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >>>>
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >> -----BEGIN PGP SIGNATURE-----
>> >> >>> >> Version: Mailvelope v1.1.0
>> >> >>> >> Comment: https://www.mailvelope.com
>> >> >>> >>
>> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >> >>> >> ae22
>> >> >>> >> =AX+L
>> >> >>> >> -----END PGP SIGNATURE-----
>> >> >>> _______________________________________________
>> >> >>> ceph-users mailing list
>> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>>
>> >> >>>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
/PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
6Kfk
=/gR6
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 19:06                                                                             ` Robert LeBlanc
@ 2015-10-06 19:34                                                                               ` Sage Weil
       [not found]                                                                                 ` <alpine.DEB.2.00.1510061232360.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-06 19:34 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users, ceph-devel

On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> I can't think of anything. In my dev cluster the only thing that has
> changed is the Ceph versions (no reboot). What I like is even though
> the disks are 100% utilized, it is preforming as I expect now. Client
> I/O is slightly degraded during the recovery, but no blocked I/O when
> the OSD boots or during the recovery period. This is with
> max_backfills set to 20, one backfill max in our production cluster is
> painful on OSD boot/recovery. I was able to reproduce this issue on
> our dev cluster very easily and very quickly with these settings. So
> far two tests and an hour later, only the blocked I/O when the OSD is
> marked out. We would love to see that go away too, but this is far
                                            (me too!)
> better than what we have now. This dev cluster also has
> osd_client_message_cap set to default (100).
> 
> We need to stay on the Hammer version of Ceph and I'm willing to take
> the time to bisect this. If this is not a problem in Firefly/Giant,
> you you prefer a bisect to find the introduction of the problem
> (Firefly/Giant -> Hammer) or the introduction of the resolution
> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> commit that prevents a clean build as that is my most limiting factor?

Nothing comes to mind.  I think the best way to find this is still to see 
it happen in the logs with hammer.  The frustrating thing with that log 
dump you sent is that although I see plenty of slow request warnings in 
the osd logs, I don't see the requests arriving.  Maybe the logs weren't 
turned up for long enough?

sage



> Thanks,
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> >> messages when the OSD was marked out:
> >>
> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> >> 34.476006 secs
> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> >> cluster [WRN] slow request 32.913474 seconds old, received at
> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> >> cluster [WRN] slow request 32.697545 seconds old, received at
> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> >> cluster [WRN] slow request 32.668006 seconds old, received at
> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>
> >> But I'm not seeing the blocked messages when the OSD came back in. The
> >> OSD spindles have been running at 100% during this test. I have seen
> >> slowed I/O from the clients as expected from the extra load, but so
> >> far no blocked messages. I'm going to run some more tests.
> >
> > Good to hear.
> >
> > FWIW I looked through the logs and all of the slow request no flag point
> > messages came from osd.163... and the logs don't show when they arrived.
> > My guess is this OSD has a slower disk than the others, or something else
> > funny is going on?
> >
> > I spot checked another OSD at random (60) where I saw a slow request.  It
> > was stuck peering for 10s of seconds... waiting on a pg log message from
> > osd.163.
> >
> > sage
> >
> >
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.2.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
> >> fo5a
> >> =ahEi
> >> -----END PGP SIGNATURE-----
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> Hash: SHA256
> >> >>
> >> >> With some off-list help, we have adjusted
> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
> >> >> it does not solve the problem with the blocked I/O.
> >> >>
> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
> >> >> between an OSD boots and the first blocked I/O message. I don't know
> >> >> if the OSD doesn't have time to get it's brain right about a PG before
> >> >> it starts servicing it or what exactly.
> >> >
> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
> >> >
> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
> >> >> to master and things didn't go so well. The OSDs would not start
> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
> >> >> and all OSDs and the OSD then started, but never became active in the
> >> >> cluster. It just sat there after reading all the PGs. There were
> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
> >> >> downgrading to the Infernalis branch and still no luck getting the
> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
> >> >> All packages were installed from gitbuilder.
> >> >
> >> > Did you chown -R ?
> >> >
> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
> >> >
> >> > My guess is you only chowned the root dir, and the OSD didn't throw
> >> > an error when it encountered the other files?  If you can generate a debug
> >> > osd = 20 log, that would be helpful.. thanks!
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >> Thanks,
> >> >> -----BEGIN PGP SIGNATURE-----
> >> >> Version: Mailvelope v1.2.0
> >> >> Comment: https://www.mailvelope.com
> >> >>
> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >> >> GdXC
> >> >> =Aigq
> >> >> -----END PGP SIGNATURE-----
> >> >> ----------------
> >> >> Robert LeBlanc
> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >>
> >> >>
> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
> >> >> > Hash: SHA256
> >> >> >
> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
> >> >> > volumes. I've included the CRUSH map in the tarball.
> >> >> >
> >> >> > I stopped one OSD process and marked it out. I let it recover for a
> >> >> > few minutes and then I started the process again and marked it in. I
> >> >> > started getting block I/O messages during the recovery.
> >> >> >
> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >> >> >
> >> >> > Thanks,
> >> >> > -----BEGIN PGP SIGNATURE-----
> >> >> > Version: Mailvelope v1.2.0
> >> >> > Comment: https://www.mailvelope.com
> >> >> >
> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> >> >> > 3EPx
> >> >> > =UDIV
> >> >> > -----END PGP SIGNATURE-----
> >> >> >
> >> >> > ----------------
> >> >> > Robert LeBlanc
> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >
> >> >> >
> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> Hash: SHA256
> >> >> >>>
> >> >> >>> We are still struggling with this and have tried a lot of different
> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >> >> >>> consulting services for non-Red Hat systems. If there are some
> >> >> >>> certified Ceph consultants in the US that we can do both remote and
> >> >> >>> on-site engagements, please let us know.
> >> >> >>>
> >> >> >>> This certainly seems to be network related, but somewhere in the
> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> >> >> >>> network admins have verified that packets are not being dropped in the
> >> >> >>> switches for these nodes. We have tried different kernels including
> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> >> >> >>> (from CentOS 7.1) with similar results.
> >> >> >>>
> >> >> >>> The messages seem slightly different:
> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> >> >> >>> 100.087155 secs
> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> >> >> >>> points reached
> >> >> >>>
> >> >> >>> I don't know what "no flag points reached" means.
> >> >> >>
> >> >> >> Just that the op hasn't been marked as reaching any interesting points
> >> >> >> (op->mark_*() calls).
> >> >> >>
> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> >> >> >> It's extremely verbose but it'll let us see where the op is getting
> >> >> >> blocked.  If you see the "slow request" message it means the op in
> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
> >> >> >> something we can blame on the network stack.
> >> >> >>
> >> >> >> sage
> >> >> >>
> >> >> >>
> >> >> >>>
> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
> >> >> >>> priority. We tried increasing the number of op threads but this didn't
> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
> >> >> >>> become active and could be the cause for slow I/O while the other PGs
> >> >> >>> are being checked.
> >> >> >>>
> >> >> >>> What I don't understand is that the messages are delayed. As soon as
> >> >> >>> the message is received by Ceph OSD process, it is very quickly
> >> >> >>> committed to the journal and a response is sent back to the primary
> >> >> >>> OSD which is received very quickly as well. I've adjust
> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> >> >> >>> of RAM per nodes for 10 OSDs.
> >> >> >>>
> >> >> >>> Is there something that could cause the kernel to get a packet but not
> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
> >> >> >>> the Ceph process?
> >> >> >>>
> >> >> >>> We can really use some pointers no matter how outrageous. We've have
> >> >> >>> over 6 people looking into this for weeks now and just can't think of
> >> >> >>> anything else.
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> -----BEGIN PGP SIGNATURE-----
> >> >> >>> Version: Mailvelope v1.1.0
> >> >> >>> Comment: https://www.mailvelope.com
> >> >> >>>
> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >> >> >>> l7OF
> >> >> >>> =OI++
> >> >> >>> -----END PGP SIGNATURE-----
> >> >> >>> ----------------
> >> >> >>> Robert LeBlanc
> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>>
> >> >> >>>
> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> >> >> >>> > processes and 16K system wide.
> >> >> >>> >
> >> >> >>> > Does this seem like the right spot to be looking? What are some
> >> >> >>> > configuration items we should be looking at?
> >> >> >>> >
> >> >> >>> > Thanks,
> >> >> >>> > ----------------
> >> >> >>> > Robert LeBlanc
> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> >> Hash: SHA256
> >> >> >>> >>
> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> >> >>> >> seems that there were some major reworks in the network handling in
> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> >> >>> >> trying to congest things. We probably already saw this issue, just
> >> >> >>> >> didn't know it.
> >> >> >>> >> - ----------------
> >> >> >>> >> Robert LeBlanc
> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >> >> >>> >>> drivers might cause problems though.
> >> >> >>> >>>
> >> >> >>> >>> Here's ifconfig from one of the nodes:
> >> >> >>> >>>
> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >> >> >>> >>>
> >> >> >>> >>> Mark
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >> >> >>> >>>>
> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> >>>> Hash: SHA256
> >> >> >>> >>>>
> >> >> >>> >>>> OK, here is the update on the saga...
> >> >> >>> >>>>
> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> >> >> >>> >>>>
> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
> >> >> >>> >>>>
> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
> >> >> >>> >>>> differing hardware and network configs.
> >> >> >>> >>>>
> >> >> >>> >>>> Thanks,
> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >> >> >>> >>>> Version: Mailvelope v1.1.0
> >> >> >>> >>>> Comment: https://www.mailvelope.com
> >> >> >>> >>>>
> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >> >> >>> >>>> 4OEo
> >> >> >>> >>>> =P33I
> >> >> >>> >>>> -----END PGP SIGNATURE-----
> >> >> >>> >>>> ----------------
> >> >> >>> >>>> Robert LeBlanc
> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >> >> >>> >>>> wrote:
> >> >> >>> >>>>>
> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> >>>>> Hash: SHA256
> >> >> >>> >>>>>
> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >> >> >>> >>>>> blocked I/O.
> >> >> >>> >>>>>
> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >> >> >>> >>>>> the blocked I/O.
> >> >> >>> >>>>> - ----------------
> >> >> >>> >>>>> Robert LeBlanc
> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>> >>>>>
> >> >> >>> >>>>>
> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >> >> >>> >>>>>>> delayed for many 10s of seconds?
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >> >> >>> >>>>>> has
> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> sage
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> What kernel are you running?
> >> >> >>> >>>>>>> -Sam
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> >>>>>>>> Hash: SHA256
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >> >> >>> >>>>>>>> transfer).
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >> >> >>> >>>>>>>> thread.
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >> >> >>> >>>>>>>> some help.
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> Single Test started about
> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >> >>> >>>>>>>> 30.439150 secs
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >> >> >>> >>>>>>>> 30.379680 secs
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >> >> >>> >>>>>>>> 12:55:06.406303:
> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >> >> >>> >>>>>>>> 12:55:06.318144:
> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >> >>> >>>>>>>> 30.954212 secs
> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >> >> >>> >>>>>>>> 30.704367 secs
> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> Server   IP addr              OSD
> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> fio job:
> >> >> >>> >>>>>>>> [rbd-test]
> >> >> >>> >>>>>>>> readwrite=write
> >> >> >>> >>>>>>>> blocksize=4M
> >> >> >>> >>>>>>>> #runtime=60
> >> >> >>> >>>>>>>> name=rbd-test
> >> >> >>> >>>>>>>> #readwrite=randwrite
> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >> >> >>> >>>>>>>> #rwmixread=72
> >> >> >>> >>>>>>>> #norandommap
> >> >> >>> >>>>>>>> #size=1T
> >> >> >>> >>>>>>>> #blocksize=4k
> >> >> >>> >>>>>>>> ioengine=rbd
> >> >> >>> >>>>>>>> rbdname=test2
> >> >> >>> >>>>>>>> pool=rbd
> >> >> >>> >>>>>>>> clientname=admin
> >> >> >>> >>>>>>>> iodepth=8
> >> >> >>> >>>>>>>> #numjobs=4
> >> >> >>> >>>>>>>> #thread
> >> >> >>> >>>>>>>> #group_reporting
> >> >> >>> >>>>>>>> #time_based
> >> >> >>> >>>>>>>> #direct=1
> >> >> >>> >>>>>>>> #ramp_time=60
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> Thanks,
> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >> >> >>> >>>>>>>> J3hS
> >> >> >>> >>>>>>>> =0J7F
> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
> >> >> >>> >>>>>>>> ----------------
> >> >> >>> >>>>>>>> Robert LeBlanc
> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >> >> >>> >>>>>>>>>>
> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> >>> >>>>>>>>>> Hash: SHA256
> >> >> >>> >>>>>>>>>>
> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> >> >> >>> >>>>>>>>>>
> >> >> >>> >>>>>>>>>> I'm not
> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >> >> >>> >>>>>>>>>>
> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >> >> >>> >>>>>>>>>> the
> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
> >> >> >>> >>>>>>>>>> on
> >> >> >>> >>>>>>>>>> my test cluster with no other load.
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>>
> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >> >> >>> >>>>>>>>> 20",
> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >> >> >>> >>>>>>>>> out
> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >> >> >>> >>>>>>>>> -Greg
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> --
> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> >> >>> >>>>>>>> in
> >> >> >>> >>>>>>>> the body of a message to majordomo@vger.kernel.org
> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>
> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >> >> >>> >>>>> Version: Mailvelope v1.1.0
> >> >> >>> >>>>> Comment: https://www.mailvelope.com
> >> >> >>> >>>>>
> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >> >> >>> >>>>> gcZm
> >> >> >>> >>>>> =CjwB
> >> >> >>> >>>>> -----END PGP SIGNATURE-----
> >> >> >>> >>>>
> >> >> >>> >>>> --
> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >>> >>>> the body of a message to majordomo@vger.kernel.org
> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >>>>
> >> >> >>> >>>
> >> >> >>> >>
> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
> >> >> >>> >> Version: Mailvelope v1.1.0
> >> >> >>> >> Comment: https://www.mailvelope.com
> >> >> >>> >>
> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> >> >>> >> ae22
> >> >> >>> >> =AX+L
> >> >> >>> >> -----END PGP SIGNATURE-----
> >> >> >>> _______________________________________________
> >> >> >>> ceph-users mailing list
> >> >> >>> ceph-users@lists.ceph.com
> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >>>
> >> >> >>>
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-users@lists.ceph.com
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
> 6Kfk
> =/gR6
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                 ` <alpine.DEB.2.00.1510061232360.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-06 20:00                                                                                   ` Robert LeBlanc
       [not found]                                                                                     ` <CAANLjFqnhS5fYdS_2h-5hz0x_TiyYjhMUCZSdc9991kdAzxeqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 20:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'll capture another set of logs. Is there any other debugging you
want turned up? I've seen the same thing where I see the message
dispatched to the secondary OSD, but the message just doesn't show up
for 30+ seconds in the secondary OSD logs.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I can't think of anything. In my dev cluster the only thing that has
>> changed is the Ceph versions (no reboot). What I like is even though
>> the disks are 100% utilized, it is preforming as I expect now. Client
>> I/O is slightly degraded during the recovery, but no blocked I/O when
>> the OSD boots or during the recovery period. This is with
>> max_backfills set to 20, one backfill max in our production cluster is
>> painful on OSD boot/recovery. I was able to reproduce this issue on
>> our dev cluster very easily and very quickly with these settings. So
>> far two tests and an hour later, only the blocked I/O when the OSD is
>> marked out. We would love to see that go away too, but this is far
>                                             (me too!)
>> better than what we have now. This dev cluster also has
>> osd_client_message_cap set to default (100).
>>
>> We need to stay on the Hammer version of Ceph and I'm willing to take
>> the time to bisect this. If this is not a problem in Firefly/Giant,
>> you you prefer a bisect to find the introduction of the problem
>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>> commit that prevents a clean build as that is my most limiting factor?
>
> Nothing comes to mind.  I think the best way to find this is still to see
> it happen in the logs with hammer.  The frustrating thing with that log
> dump you sent is that although I see plenty of slow request warnings in
> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
> turned up for long enough?
>
> sage
>
>
>
>> Thanks,
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> >> messages when the OSD was marked out:
>> >>
>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> >> 34.476006 secs
>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>
>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>> >> OSD spindles have been running at 100% during this test. I have seen
>> >> slowed I/O from the clients as expected from the extra load, but so
>> >> far no blocked messages. I'm going to run some more tests.
>> >
>> > Good to hear.
>> >
>> > FWIW I looked through the logs and all of the slow request no flag point
>> > messages came from osd.163... and the logs don't show when they arrived.
>> > My guess is this OSD has a slower disk than the others, or something else
>> > funny is going on?
>> >
>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>> > osd.163.
>> >
>> > sage
>> >
>> >
>> >>
>> >> -----BEGIN PGP SIGNATURE-----
>> >> Version: Mailvelope v1.2.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>> >> fo5a
>> >> =ahEi
>> >> -----END PGP SIGNATURE-----
>> >> ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> Hash: SHA256
>> >> >>
>> >> >> With some off-list help, we have adjusted
>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>> >> >> it does not solve the problem with the blocked I/O.
>> >> >>
>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>> >> >> it starts servicing it or what exactly.
>> >> >
>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>> >> >
>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> >> >> to master and things didn't go so well. The OSDs would not start
>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> >> >> and all OSDs and the OSD then started, but never became active in the
>> >> >> cluster. It just sat there after reading all the PGs. There were
>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> >> >> downgrading to the Infernalis branch and still no luck getting the
>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>> >> >> All packages were installed from gitbuilder.
>> >> >
>> >> > Did you chown -R ?
>> >> >
>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>> >> >
>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>> >> > an error when it encountered the other files?  If you can generate a debug
>> >> > osd = 20 log, that would be helpful.. thanks!
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> Thanks,
>> >> >> -----BEGIN PGP SIGNATURE-----
>> >> >> Version: Mailvelope v1.2.0
>> >> >> Comment: https://www.mailvelope.com
>> >> >>
>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> >> >> GdXC
>> >> >> =Aigq
>> >> >> -----END PGP SIGNATURE-----
>> >> >> ----------------
>> >> >> Robert LeBlanc
>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >>
>> >> >>
>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> > Hash: SHA256
>> >> >> >
>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>> >> >> > volumes. I've included the CRUSH map in the tarball.
>> >> >> >
>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>> >> >> > few minutes and then I started the process again and marked it in. I
>> >> >> > started getting block I/O messages during the recovery.
>> >> >> >
>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>> >> >> >
>> >> >> > Thanks,
>> >> >> > -----BEGIN PGP SIGNATURE-----
>> >> >> > Version: Mailvelope v1.2.0
>> >> >> > Comment: https://www.mailvelope.com
>> >> >> >
>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> >> >> > 3EPx
>> >> >> > =UDIV
>> >> >> > -----END PGP SIGNATURE-----
>> >> >> >
>> >> >> > ----------------
>> >> >> > Robert LeBlanc
>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >
>> >> >> >
>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> Hash: SHA256
>> >> >> >>>
>> >> >> >>> We are still struggling with this and have tried a lot of different
>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>> >> >> >>> on-site engagements, please let us know.
>> >> >> >>>
>> >> >> >>> This certainly seems to be network related, but somewhere in the
>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> >> >> >>> network admins have verified that packets are not being dropped in the
>> >> >> >>> switches for these nodes. We have tried different kernels including
>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> >> >> >>> (from CentOS 7.1) with similar results.
>> >> >> >>>
>> >> >> >>> The messages seem slightly different:
>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> >> >> >>> 100.087155 secs
>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> >> >> >>> points reached
>> >> >> >>>
>> >> >> >>> I don't know what "no flag points reached" means.
>> >> >> >>
>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>> >> >> >> (op->mark_*() calls).
>> >> >> >>
>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>> >> >> >> something we can blame on the network stack.
>> >> >> >>
>> >> >> >> sage
>> >> >> >>
>> >> >> >>
>> >> >> >>>
>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>> >> >> >>> are being checked.
>> >> >> >>>
>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>> >> >> >>> committed to the journal and a response is sent back to the primary
>> >> >> >>> OSD which is received very quickly as well. I've adjust
>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> >> >> >>> of RAM per nodes for 10 OSDs.
>> >> >> >>>
>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>> >> >> >>> the Ceph process?
>> >> >> >>>
>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>> >> >> >>> anything else.
>> >> >> >>>
>> >> >> >>> Thanks,
>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>> >> >> >>> Version: Mailvelope v1.1.0
>> >> >> >>> Comment: https://www.mailvelope.com
>> >> >> >>>
>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >> >> >>> l7OF
>> >> >> >>> =OI++
>> >> >> >>> -----END PGP SIGNATURE-----
>> >> >> >>> ----------------
>> >> >> >>> Robert LeBlanc
>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> >> >> >>> > processes and 16K system wide.
>> >> >> >>> >
>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>> >> >> >>> > configuration items we should be looking at?
>> >> >> >>> >
>> >> >> >>> > Thanks,
>> >> >> >>> > ----------------
>> >> >> >>> > Robert LeBlanc
>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> >> Hash: SHA256
>> >> >> >>> >>
>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >> >> >>> >> seems that there were some major reworks in the network handling in
>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>> >> >> >>> >> didn't know it.
>> >> >> >>> >> - ----------------
>> >> >> >>> >> Robert LeBlanc
>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >> >> >>> >>> drivers might cause problems though.
>> >> >> >>> >>>
>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>> >> >> >>> >>>
>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >> >> >>> >>>
>> >> >> >>> >>> Mark
>> >> >> >>> >>>
>> >> >> >>> >>>
>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >> >> >>> >>>>
>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> >>>> Hash: SHA256
>> >> >> >>> >>>>
>> >> >> >>> >>>> OK, here is the update on the saga...
>> >> >> >>> >>>>
>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >> >> >>> >>>>
>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>> >> >> >>> >>>>
>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >> >> >>> >>>> differing hardware and network configs.
>> >> >> >>> >>>>
>> >> >> >>> >>>> Thanks,
>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>> >> >> >>> >>>>
>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >> >> >>> >>>> 4OEo
>> >> >> >>> >>>> =P33I
>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>> >> >> >>> >>>> ----------------
>> >> >> >>> >>>> Robert LeBlanc
>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>> >>>>
>> >> >> >>> >>>>
>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >> >> >>> >>>> wrote:
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> >>>>> Hash: SHA256
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >> >> >>> >>>>> blocked I/O.
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >> >> >>> >>>>> the blocked I/O.
>> >> >> >>> >>>>> - ----------------
>> >> >> >>> >>>>> Robert LeBlanc
>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>> >>>>>
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >> >> >>> >>>>>> has
>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>> sage
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>>
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>>> What kernel are you running?
>> >> >> >>> >>>>>>> -Sam
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> >>>>>>>> Hash: SHA256
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >> >> >>> >>>>>>>> transfer).
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >> >> >>> >>>>>>>> thread.
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >> >> >>> >>>>>>>> some help.
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> Single Test started about
>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >> >>> >>>>>>>> 30.439150 secs
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >> >> >>> >>>>>>>> 30.379680 secs
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >> >> >>> >>>>>>>> 12:55:06.406303:
>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >> >> >>> >>>>>>>> 12:55:06.318144:
>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >> >>> >>>>>>>> 30.954212 secs
>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >> >> >>> >>>>>>>> 30.704367 secs
>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> fio job:
>> >> >> >>> >>>>>>>> [rbd-test]
>> >> >> >>> >>>>>>>> readwrite=write
>> >> >> >>> >>>>>>>> blocksize=4M
>> >> >> >>> >>>>>>>> #runtime=60
>> >> >> >>> >>>>>>>> name=rbd-test
>> >> >> >>> >>>>>>>> #readwrite=randwrite
>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >> >> >>> >>>>>>>> #rwmixread=72
>> >> >> >>> >>>>>>>> #norandommap
>> >> >> >>> >>>>>>>> #size=1T
>> >> >> >>> >>>>>>>> #blocksize=4k
>> >> >> >>> >>>>>>>> ioengine=rbd
>> >> >> >>> >>>>>>>> rbdname=test2
>> >> >> >>> >>>>>>>> pool=rbd
>> >> >> >>> >>>>>>>> clientname=admin
>> >> >> >>> >>>>>>>> iodepth=8
>> >> >> >>> >>>>>>>> #numjobs=4
>> >> >> >>> >>>>>>>> #thread
>> >> >> >>> >>>>>>>> #group_reporting
>> >> >> >>> >>>>>>>> #time_based
>> >> >> >>> >>>>>>>> #direct=1
>> >> >> >>> >>>>>>>> #ramp_time=60
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> Thanks,
>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >> >> >>> >>>>>>>> J3hS
>> >> >> >>> >>>>>>>> =0J7F
>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >> >> >>> >>>>>>>> ----------------
>> >> >> >>> >>>>>>>> Robert LeBlanc
>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >> >> >>> >>>>>>>>>>
>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >> >> >>> >>>>>>>>>> Hash: SHA256
>> >> >> >>> >>>>>>>>>>
>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >> >> >>> >>>>>>>>>>
>> >> >> >>> >>>>>>>>>> I'm not
>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >> >> >>> >>>>>>>>>>
>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >> >> >>> >>>>>>>>>> the
>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >> >> >>> >>>>>>>>>> on
>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>>
>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>>
>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >> >> >>> >>>>>>>>> 20",
>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >> >> >>> >>>>>>>>> out
>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >> >> >>> >>>>>>>>> -Greg
>> >> >> >>> >>>>>>>>
>> >> >> >>> >>>>>>>> --
>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >> >> >>> >>>>>>>> in
>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>>>
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>> >> >> >>> >>>>>
>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >> >> >>> >>>>> gcZm
>> >> >> >>> >>>>> =CjwB
>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>> >> >> >>> >>>>
>> >> >> >>> >>>> --
>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >>>>
>> >> >> >>> >>>
>> >> >> >>> >>
>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>> >> >> >>> >> Version: Mailvelope v1.1.0
>> >> >> >>> >> Comment: https://www.mailvelope.com
>> >> >> >>> >>
>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >> >> >>> >> ae22
>> >> >> >>> >> =AX+L
>> >> >> >>> >> -----END PGP SIGNATURE-----
>> >> >> >>> _______________________________________________
>> >> >> >>> ceph-users mailing list
>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >>>
>> >> >> >>>
>> >> >> _______________________________________________
>> >> >> ceph-users mailing list
>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>> 6Kfk
>> =/gR6
>> -----END PGP SIGNATURE-----
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
JFPi
=ofgq
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                     ` <CAANLjFqnhS5fYdS_2h-5hz0x_TiyYjhMUCZSdc9991kdAzxeqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 20:36                                                                                       ` Robert LeBlanc
       [not found]                                                                                         ` <CAANLjFqXvWdHBVZUMVFMiQg_-55_ZQ_jxsFr9YquohnHR7M1cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 20:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On my second test (a much longer one), it took nearly an hour, but a
few messages have popped up over a 20 window. Still far less than I
have been seeing.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I'll capture another set of logs. Is there any other debugging you
> want turned up? I've seen the same thing where I see the message
> dispatched to the secondary OSD, but the message just doesn't show up
> for 30+ seconds in the secondary OSD logs.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I can't think of anything. In my dev cluster the only thing that has
>>> changed is the Ceph versions (no reboot). What I like is even though
>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>> the OSD boots or during the recovery period. This is with
>>> max_backfills set to 20, one backfill max in our production cluster is
>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>> our dev cluster very easily and very quickly with these settings. So
>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>> marked out. We would love to see that go away too, but this is far
>>                                             (me too!)
>>> better than what we have now. This dev cluster also has
>>> osd_client_message_cap set to default (100).
>>>
>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>> you you prefer a bisect to find the introduction of the problem
>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>> commit that prevents a clean build as that is my most limiting factor?
>>
>> Nothing comes to mind.  I think the best way to find this is still to see
>> it happen in the logs with hammer.  The frustrating thing with that log
>> dump you sent is that although I see plenty of slow request warnings in
>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>> turned up for long enough?
>>
>> sage
>>
>>
>>
>>> Thanks,
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> Hash: SHA256
>>> >>
>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>> >> messages when the OSD was marked out:
>>> >>
>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>> >> 34.476006 secs
>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >>
>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>> >> OSD spindles have been running at 100% during this test. I have seen
>>> >> slowed I/O from the clients as expected from the extra load, but so
>>> >> far no blocked messages. I'm going to run some more tests.
>>> >
>>> > Good to hear.
>>> >
>>> > FWIW I looked through the logs and all of the slow request no flag point
>>> > messages came from osd.163... and the logs don't show when they arrived.
>>> > My guess is this OSD has a slower disk than the others, or something else
>>> > funny is going on?
>>> >
>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>>> > osd.163.
>>> >
>>> > sage
>>> >
>>> >
>>> >>
>>> >> -----BEGIN PGP SIGNATURE-----
>>> >> Version: Mailvelope v1.2.0
>>> >> Comment: https://www.mailvelope.com
>>> >>
>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>> >> fo5a
>>> >> =ahEi
>>> >> -----END PGP SIGNATURE-----
>>> >> ----------------
>>> >> Robert LeBlanc
>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>
>>> >>
>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> Hash: SHA256
>>> >> >>
>>> >> >> With some off-list help, we have adjusted
>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>>> >> >> it does not solve the problem with the blocked I/O.
>>> >> >>
>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>>> >> >> it starts servicing it or what exactly.
>>> >> >
>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>>> >> >
>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>> >> >> to master and things didn't go so well. The OSDs would not start
>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>> >> >> and all OSDs and the OSD then started, but never became active in the
>>> >> >> cluster. It just sat there after reading all the PGs. There were
>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>>> >> >> All packages were installed from gitbuilder.
>>> >> >
>>> >> > Did you chown -R ?
>>> >> >
>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>> >> >
>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>>> >> > an error when it encountered the other files?  If you can generate a debug
>>> >> > osd = 20 log, that would be helpful.. thanks!
>>> >> >
>>> >> > sage
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Thanks,
>>> >> >> -----BEGIN PGP SIGNATURE-----
>>> >> >> Version: Mailvelope v1.2.0
>>> >> >> Comment: https://www.mailvelope.com
>>> >> >>
>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>> >> >> GdXC
>>> >> >> =Aigq
>>> >> >> -----END PGP SIGNATURE-----
>>> >> >> ----------------
>>> >> >> Robert LeBlanc
>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >>
>>> >> >>
>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> > Hash: SHA256
>>> >> >> >
>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>>> >> >> >
>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>>> >> >> > few minutes and then I started the process again and marked it in. I
>>> >> >> > started getting block I/O messages during the recovery.
>>> >> >> >
>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> > -----BEGIN PGP SIGNATURE-----
>>> >> >> > Version: Mailvelope v1.2.0
>>> >> >> > Comment: https://www.mailvelope.com
>>> >> >> >
>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>> >> >> > 3EPx
>>> >> >> > =UDIV
>>> >> >> > -----END PGP SIGNATURE-----
>>> >> >> >
>>> >> >> > ----------------
>>> >> >> > Robert LeBlanc
>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >
>>> >> >> >
>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> Hash: SHA256
>>> >> >> >>>
>>> >> >> >>> We are still struggling with this and have tried a lot of different
>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>>> >> >> >>> on-site engagements, please let us know.
>>> >> >> >>>
>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>> >> >> >>> network admins have verified that packets are not being dropped in the
>>> >> >> >>> switches for these nodes. We have tried different kernels including
>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>> >> >> >>> (from CentOS 7.1) with similar results.
>>> >> >> >>>
>>> >> >> >>> The messages seem slightly different:
>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>> >> >> >>> 100.087155 secs
>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>> >> >> >>> points reached
>>> >> >> >>>
>>> >> >> >>> I don't know what "no flag points reached" means.
>>> >> >> >>
>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>>> >> >> >> (op->mark_*() calls).
>>> >> >> >>
>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>>> >> >> >> something we can blame on the network stack.
>>> >> >> >>
>>> >> >> >> sage
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>>
>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>>> >> >> >>> are being checked.
>>> >> >> >>>
>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>>> >> >> >>> committed to the journal and a response is sent back to the primary
>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>> >> >> >>> of RAM per nodes for 10 OSDs.
>>> >> >> >>>
>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>>> >> >> >>> the Ceph process?
>>> >> >> >>>
>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>>> >> >> >>> anything else.
>>> >> >> >>>
>>> >> >> >>> Thanks,
>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>>> >> >> >>> Version: Mailvelope v1.1.0
>>> >> >> >>> Comment: https://www.mailvelope.com
>>> >> >> >>>
>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>> >> >> >>> l7OF
>>> >> >> >>> =OI++
>>> >> >> >>> -----END PGP SIGNATURE-----
>>> >> >> >>> ----------------
>>> >> >> >>> Robert LeBlanc
>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>> >> >> >>> > processes and 16K system wide.
>>> >> >> >>> >
>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>>> >> >> >>> > configuration items we should be looking at?
>>> >> >> >>> >
>>> >> >> >>> > Thanks,
>>> >> >> >>> > ----------------
>>> >> >> >>> > Robert LeBlanc
>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> >> Hash: SHA256
>>> >> >> >>> >>
>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>>> >> >> >>> >> didn't know it.
>>> >> >> >>> >> - ----------------
>>> >> >> >>> >> Robert LeBlanc
>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>> >> >> >>> >>> drivers might cause problems though.
>>> >> >> >>> >>>
>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>>> >> >> >>> >>>
>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>> >> >> >>> >>>
>>> >> >> >>> >>> Mark
>>> >> >> >>> >>>
>>> >> >> >>> >>>
>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> >>>> Hash: SHA256
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> OK, here is the update on the saga...
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>>> >> >> >>> >>>> differing hardware and network configs.
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> Thanks,
>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>> >> >> >>> >>>> 4OEo
>>> >> >> >>> >>>> =P33I
>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>>> >> >> >>> >>>> ----------------
>>> >> >> >>> >>>> Robert LeBlanc
>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>> >>>>
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>> >> >> >>> >>>> wrote:
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> >>>>> Hash: SHA256
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>> >> >> >>> >>>>> blocked I/O.
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> >> >> >>> >>>>> the blocked I/O.
>>> >> >> >>> >>>>> - ----------------
>>> >> >> >>> >>>>> Robert LeBlanc
>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>> >> >> >>> >>>>>> has
>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>> sage
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>>
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>>> What kernel are you running?
>>> >> >> >>> >>>>>>> -Sam
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> >>>>>>>> Hash: SHA256
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>> >> >> >>> >>>>>>>> transfer).
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>> >> >> >>> >>>>>>>> thread.
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>> >> >> >>> >>>>>>>> some help.
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> Single Test started about
>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >> >> >>> >>>>>>>> 30.439150 secs
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>> >> >> >>> >>>>>>>> 30.379680 secs
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >> >> >>> >>>>>>>> 30.954212 secs
>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >> >> >>> >>>>>>>> 30.704367 secs
>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> fio job:
>>> >> >> >>> >>>>>>>> [rbd-test]
>>> >> >> >>> >>>>>>>> readwrite=write
>>> >> >> >>> >>>>>>>> blocksize=4M
>>> >> >> >>> >>>>>>>> #runtime=60
>>> >> >> >>> >>>>>>>> name=rbd-test
>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>> >> >> >>> >>>>>>>> #rwmixread=72
>>> >> >> >>> >>>>>>>> #norandommap
>>> >> >> >>> >>>>>>>> #size=1T
>>> >> >> >>> >>>>>>>> #blocksize=4k
>>> >> >> >>> >>>>>>>> ioengine=rbd
>>> >> >> >>> >>>>>>>> rbdname=test2
>>> >> >> >>> >>>>>>>> pool=rbd
>>> >> >> >>> >>>>>>>> clientname=admin
>>> >> >> >>> >>>>>>>> iodepth=8
>>> >> >> >>> >>>>>>>> #numjobs=4
>>> >> >> >>> >>>>>>>> #thread
>>> >> >> >>> >>>>>>>> #group_reporting
>>> >> >> >>> >>>>>>>> #time_based
>>> >> >> >>> >>>>>>>> #direct=1
>>> >> >> >>> >>>>>>>> #ramp_time=60
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> Thanks,
>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>> >> >> >>> >>>>>>>> J3hS
>>> >> >> >>> >>>>>>>> =0J7F
>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>>> >> >> >>> >>>>>>>> ----------------
>>> >> >> >>> >>>>>>>> Robert LeBlanc
>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>> >> >> >>> >>>>>>>>>>
>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>>> >> >> >>> >>>>>>>>>>
>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>> >> >> >>> >>>>>>>>>>
>>> >> >> >>> >>>>>>>>>> I'm not
>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>> >> >> >>> >>>>>>>>>>
>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>> >> >> >>> >>>>>>>>>> the
>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>> >> >> >>> >>>>>>>>>> on
>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>>
>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>>
>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>> >> >> >>> >>>>>>>>> 20",
>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>> >> >> >>> >>>>>>>>> out
>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>> >> >> >>> >>>>>>>>> -Greg
>>> >> >> >>> >>>>>>>>
>>> >> >> >>> >>>>>>>> --
>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >> >> >>> >>>>>>>> in
>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>>>
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>> >> >> >>> >>>>>
>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>> >> >> >>> >>>>> gcZm
>>> >> >> >>> >>>>> =CjwB
>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> --
>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >> >>> >>>>
>>> >> >> >>> >>>
>>> >> >> >>> >>
>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>>> >> >> >>> >> Version: Mailvelope v1.1.0
>>> >> >> >>> >> Comment: https://www.mailvelope.com
>>> >> >> >>> >>
>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>> >> >> >>> >> ae22
>>> >> >> >>> >> =AX+L
>>> >> >> >>> >> -----END PGP SIGNATURE-----
>>> >> >> >>> _______________________________________________
>>> >> >> >>> ceph-users mailing list
>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> _______________________________________________
>>> >> >> ceph-users mailing list
>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >> >>
>>> >> >>
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>
>>> >>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>> 6Kfk
>>> =/gR6
>>> -----END PGP SIGNATURE-----
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
> JFPi
> =ofgq
> -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
/BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
JYNo
=msX2
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                         ` <CAANLjFqXvWdHBVZUMVFMiQg_-55_ZQ_jxsFr9YquohnHR7M1cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-06 23:40                                                                                           ` Robert LeBlanc
  2015-10-07 19:25                                                                                             ` [ceph-users] " Robert LeBlanc
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-06 23:40 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I upped the debug on about everything and ran the test for about 40
minutes. I took OSD.19 on ceph1 doen and then brought it back in.
There was at least one op on osd.19 that was blocked for over 1,000
seconds. Hopefully this will have something that will cast a light on
what is going on.

We are going to upgrade this cluster to Infernalis tomorrow and rerun
the test to verify the results from the dev cluster. This cluster
matches the hardware of our production cluster but is not yet in
production so we can safely wipe it to downgrade back to Hammer.

Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/

Let me know what else we can do to help.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
EDrG
=BZVw
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On my second test (a much longer one), it took nearly an hour, but a
> few messages have popped up over a 20 window. Still far less than I
> have been seeing.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I'll capture another set of logs. Is there any other debugging you
>> want turned up? I've seen the same thing where I see the message
>> dispatched to the secondary OSD, but the message just doesn't show up
>> for 30+ seconds in the secondary OSD logs.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> I can't think of anything. In my dev cluster the only thing that has
>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>> the OSD boots or during the recovery period. This is with
>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>> our dev cluster very easily and very quickly with these settings. So
>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>> marked out. We would love to see that go away too, but this is far
>>>                                             (me too!)
>>>> better than what we have now. This dev cluster also has
>>>> osd_client_message_cap set to default (100).
>>>>
>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>> you you prefer a bisect to find the introduction of the problem
>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>> commit that prevents a clean build as that is my most limiting factor?
>>>
>>> Nothing comes to mind.  I think the best way to find this is still to see
>>> it happen in the logs with hammer.  The frustrating thing with that log
>>> dump you sent is that although I see plenty of slow request warnings in
>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>> turned up for long enough?
>>>
>>> sage
>>>
>>>
>>>
>>>> Thanks,
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> Hash: SHA256
>>>> >>
>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>> >> messages when the OSD was marked out:
>>>> >>
>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>> >> 34.476006 secs
>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>> >>
>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>>> >> OSD spindles have been running at 100% during this test. I have seen
>>>> >> slowed I/O from the clients as expected from the extra load, but so
>>>> >> far no blocked messages. I'm going to run some more tests.
>>>> >
>>>> > Good to hear.
>>>> >
>>>> > FWIW I looked through the logs and all of the slow request no flag point
>>>> > messages came from osd.163... and the logs don't show when they arrived.
>>>> > My guess is this OSD has a slower disk than the others, or something else
>>>> > funny is going on?
>>>> >
>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>>>> > osd.163.
>>>> >
>>>> > sage
>>>> >
>>>> >
>>>> >>
>>>> >> -----BEGIN PGP SIGNATURE-----
>>>> >> Version: Mailvelope v1.2.0
>>>> >> Comment: https://www.mailvelope.com
>>>> >>
>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>> >> fo5a
>>>> >> =ahEi
>>>> >> -----END PGP SIGNATURE-----
>>>> >> ----------------
>>>> >> Robert LeBlanc
>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >>
>>>> >>
>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> Hash: SHA256
>>>> >> >>
>>>> >> >> With some off-list help, we have adjusted
>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>>>> >> >> it does not solve the problem with the blocked I/O.
>>>> >> >>
>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>>>> >> >> it starts servicing it or what exactly.
>>>> >> >
>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>> >> >
>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>> >> >> to master and things didn't go so well. The OSDs would not start
>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>>>> >> >> All packages were installed from gitbuilder.
>>>> >> >
>>>> >> > Did you chown -R ?
>>>> >> >
>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>> >> >
>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>>>> >> > an error when it encountered the other files?  If you can generate a debug
>>>> >> > osd = 20 log, that would be helpful.. thanks!
>>>> >> >
>>>> >> > sage
>>>> >> >
>>>> >> >
>>>> >> >>
>>>> >> >> Thanks,
>>>> >> >> -----BEGIN PGP SIGNATURE-----
>>>> >> >> Version: Mailvelope v1.2.0
>>>> >> >> Comment: https://www.mailvelope.com
>>>> >> >>
>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>> >> >> GdXC
>>>> >> >> =Aigq
>>>> >> >> -----END PGP SIGNATURE-----
>>>> >> >> ----------------
>>>> >> >> Robert LeBlanc
>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >>
>>>> >> >>
>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> > Hash: SHA256
>>>> >> >> >
>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>>>> >> >> >
>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>>>> >> >> > few minutes and then I started the process again and marked it in. I
>>>> >> >> > started getting block I/O messages during the recovery.
>>>> >> >> >
>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>> >> >> >
>>>> >> >> > Thanks,
>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>>>> >> >> > Version: Mailvelope v1.2.0
>>>> >> >> > Comment: https://www.mailvelope.com
>>>> >> >> >
>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>> >> >> > 3EPx
>>>> >> >> > =UDIV
>>>> >> >> > -----END PGP SIGNATURE-----
>>>> >> >> >
>>>> >> >> > ----------------
>>>> >> >> > Robert LeBlanc
>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> Hash: SHA256
>>>> >> >> >>>
>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>>>> >> >> >>> on-site engagements, please let us know.
>>>> >> >> >>>
>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>> >> >> >>> (from CentOS 7.1) with similar results.
>>>> >> >> >>>
>>>> >> >> >>> The messages seem slightly different:
>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>> >> >> >>> 100.087155 secs
>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>> >> >> >>> points reached
>>>> >> >> >>>
>>>> >> >> >>> I don't know what "no flag points reached" means.
>>>> >> >> >>
>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>>>> >> >> >> (op->mark_*() calls).
>>>> >> >> >>
>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>>>> >> >> >> something we can blame on the network stack.
>>>> >> >> >>
>>>> >> >> >> sage
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>>
>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>>>> >> >> >>> are being checked.
>>>> >> >> >>>
>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>>>> >> >> >>>
>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>>>> >> >> >>> the Ceph process?
>>>> >> >> >>>
>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>>>> >> >> >>> anything else.
>>>> >> >> >>>
>>>> >> >> >>> Thanks,
>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>>>> >> >> >>> Version: Mailvelope v1.1.0
>>>> >> >> >>> Comment: https://www.mailvelope.com
>>>> >> >> >>>
>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>> >> >> >>> l7OF
>>>> >> >> >>> =OI++
>>>> >> >> >>> -----END PGP SIGNATURE-----
>>>> >> >> >>> ----------------
>>>> >> >> >>> Robert LeBlanc
>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>>
>>>> >> >> >>>
>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>> >> >> >>> > processes and 16K system wide.
>>>> >> >> >>> >
>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>>>> >> >> >>> > configuration items we should be looking at?
>>>> >> >> >>> >
>>>> >> >> >>> > Thanks,
>>>> >> >> >>> > ----------------
>>>> >> >> >>> > Robert LeBlanc
>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>> >
>>>> >> >> >>> >
>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> >> Hash: SHA256
>>>> >> >> >>> >>
>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>>>> >> >> >>> >> didn't know it.
>>>> >> >> >>> >> - ----------------
>>>> >> >> >>> >> Robert LeBlanc
>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>> >>
>>>> >> >> >>> >>
>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>> >> >> >>> >>> drivers might cause problems though.
>>>> >> >> >>> >>>
>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>>>> >> >> >>> >>>
>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>> >> >> >>> >>>
>>>> >> >> >>> >>> Mark
>>>> >> >> >>> >>>
>>>> >> >> >>> >>>
>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> >>>> Hash: SHA256
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> OK, here is the update on the saga...
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>>>> >> >> >>> >>>> differing hardware and network configs.
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> Thanks,
>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>> >> >> >>> >>>> 4OEo
>>>> >> >> >>> >>>> =P33I
>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>>>> >> >> >>> >>>> ----------------
>>>> >> >> >>> >>>> Robert LeBlanc
>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>> >> >> >>> >>>> wrote:
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> >>>>> Hash: SHA256
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>> >> >> >>> >>>>> blocked I/O.
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>> >> >> >>> >>>>> the blocked I/O.
>>>> >> >> >>> >>>>> - ----------------
>>>> >> >> >>> >>>>> Robert LeBlanc
>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>> >> >> >>> >>>>>> has
>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>> sage
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>>
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>>> What kernel are you running?
>>>> >> >> >>> >>>>>>> -Sam
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> >>>>>>>> Hash: SHA256
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>> >> >> >>> >>>>>>>> transfer).
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>> >> >> >>> >>>>>>>> thread.
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>> >> >> >>> >>>>>>>> some help.
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> Single Test started about
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>> >> >> >>> >>>>>>>> 30.439150 secs
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>> >> >> >>> >>>>>>>> 30.379680 secs
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>> >> >> >>> >>>>>>>> 30.954212 secs
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>> >> >> >>> >>>>>>>> 30.704367 secs
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> fio job:
>>>> >> >> >>> >>>>>>>> [rbd-test]
>>>> >> >> >>> >>>>>>>> readwrite=write
>>>> >> >> >>> >>>>>>>> blocksize=4M
>>>> >> >> >>> >>>>>>>> #runtime=60
>>>> >> >> >>> >>>>>>>> name=rbd-test
>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>> >> >> >>> >>>>>>>> #rwmixread=72
>>>> >> >> >>> >>>>>>>> #norandommap
>>>> >> >> >>> >>>>>>>> #size=1T
>>>> >> >> >>> >>>>>>>> #blocksize=4k
>>>> >> >> >>> >>>>>>>> ioengine=rbd
>>>> >> >> >>> >>>>>>>> rbdname=test2
>>>> >> >> >>> >>>>>>>> pool=rbd
>>>> >> >> >>> >>>>>>>> clientname=admin
>>>> >> >> >>> >>>>>>>> iodepth=8
>>>> >> >> >>> >>>>>>>> #numjobs=4
>>>> >> >> >>> >>>>>>>> #thread
>>>> >> >> >>> >>>>>>>> #group_reporting
>>>> >> >> >>> >>>>>>>> #time_based
>>>> >> >> >>> >>>>>>>> #direct=1
>>>> >> >> >>> >>>>>>>> #ramp_time=60
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> Thanks,
>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>> >> >> >>> >>>>>>>> J3hS
>>>> >> >> >>> >>>>>>>> =0J7F
>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>>>> >> >> >>> >>>>>>>> ----------------
>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>> >> >> >>> >>>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>>>> >> >> >>> >>>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>>> >> >> >>> >>>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> I'm not
>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>> >> >> >>> >>>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>> >> >> >>> >>>>>>>>>> the
>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>>> >> >> >>> >>>>>>>>>> on
>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>>
>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>>
>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>> >> >> >>> >>>>>>>>> 20",
>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>> >> >> >>> >>>>>>>>> out
>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>>> >> >> >>> >>>>>>>>> -Greg
>>>> >> >> >>> >>>>>>>>
>>>> >> >> >>> >>>>>>>> --
>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> >> >> >>> >>>>>>>> in
>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>>>
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>>> >> >> >>> >>>>>
>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>> >> >> >>> >>>>> gcZm
>>>> >> >> >>> >>>>> =CjwB
>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>> --
>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >> >> >>> >>>>
>>>> >> >> >>> >>>
>>>> >> >> >>> >>
>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>>>> >> >> >>> >>
>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>> >> >> >>> >> ae22
>>>> >> >> >>> >> =AX+L
>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>>>> >> >> >>> _______________________________________________
>>>> >> >> >>> ceph-users mailing list
>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >> >> >>>
>>>> >> >> >>>
>>>> >> >> _______________________________________________
>>>> >> >> ceph-users mailing list
>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >> >>
>>>> >> >>
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >>
>>>> >>
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>> 6Kfk
>>>> =/gR6
>>>> -----END PGP SIGNATURE-----
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>> JFPi
>> =ofgq
>> -----END PGP SIGNATURE-----
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
> JYNo
> =msX2
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
  2015-10-06 23:40                                                                                           ` Robert LeBlanc
@ 2015-10-07 19:25                                                                                             ` Robert LeBlanc
       [not found]                                                                                               ` <CAANLjFr=MEOyqUqjUVkZcNcW7KeN4rMUe9oJxu7ZiL64pGmT4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-07 19:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We forgot to upload the ceph.log yesterday. It is there now.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I upped the debug on about everything and ran the test for about 40
> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> There was at least one op on osd.19 that was blocked for over 1,000
> seconds. Hopefully this will have something that will cast a light on
> what is going on.
>
> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> the test to verify the results from the dev cluster. This cluster
> matches the hardware of our production cluster but is not yet in
> production so we can safely wipe it to downgrade back to Hammer.
>
> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>
> Let me know what else we can do to help.
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
> EDrG
> =BZVw
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> On my second test (a much longer one), it took nearly an hour, but a
>> few messages have popped up over a 20 window. Still far less than I
>> have been seeing.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I'll capture another set of logs. Is there any other debugging you
>>> want turned up? I've seen the same thing where I see the message
>>> dispatched to the secondary OSD, but the message just doesn't show up
>>> for 30+ seconds in the secondary OSD logs.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>> the OSD boots or during the recovery period. This is with
>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>> marked out. We would love to see that go away too, but this is far
>>>>                                             (me too!)
>>>>> better than what we have now. This dev cluster also has
>>>>> osd_client_message_cap set to default (100).
>>>>>
>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>> you you prefer a bisect to find the introduction of the problem
>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>
>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>> dump you sent is that although I see plenty of slow request warnings in
>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>> turned up for long enough?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> Hash: SHA256
>>>>> >>
>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>> >> messages when the OSD was marked out:
>>>>> >>
>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>> >> 34.476006 secs
>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>> >>
>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>> >> OSD spindles have been running at 100% during this test. I have seen
>>>>> >> slowed I/O from the clients as expected from the extra load, but so
>>>>> >> far no blocked messages. I'm going to run some more tests.
>>>>> >
>>>>> > Good to hear.
>>>>> >
>>>>> > FWIW I looked through the logs and all of the slow request no flag point
>>>>> > messages came from osd.163... and the logs don't show when they arrived.
>>>>> > My guess is this OSD has a slower disk than the others, or something else
>>>>> > funny is going on?
>>>>> >
>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>> > osd.163.
>>>>> >
>>>>> > sage
>>>>> >
>>>>> >
>>>>> >>
>>>>> >> -----BEGIN PGP SIGNATURE-----
>>>>> >> Version: Mailvelope v1.2.0
>>>>> >> Comment: https://www.mailvelope.com
>>>>> >>
>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>> >> fo5a
>>>>> >> =ahEi
>>>>> >> -----END PGP SIGNATURE-----
>>>>> >> ----------------
>>>>> >> Robert LeBlanc
>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> Hash: SHA256
>>>>> >> >>
>>>>> >> >> With some off-list help, we have adjusted
>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>> >> >> it does not solve the problem with the blocked I/O.
>>>>> >> >>
>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>>>>> >> >> it starts servicing it or what exactly.
>>>>> >> >
>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>> >> >
>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>> >> >> to master and things didn't go so well. The OSDs would not start
>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>> >> >> All packages were installed from gitbuilder.
>>>>> >> >
>>>>> >> > Did you chown -R ?
>>>>> >> >
>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>> >> >
>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>>>>> >> > an error when it encountered the other files?  If you can generate a debug
>>>>> >> > osd = 20 log, that would be helpful.. thanks!
>>>>> >> >
>>>>> >> > sage
>>>>> >> >
>>>>> >> >
>>>>> >> >>
>>>>> >> >> Thanks,
>>>>> >> >> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> Version: Mailvelope v1.2.0
>>>>> >> >> Comment: https://www.mailvelope.com
>>>>> >> >>
>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>> >> >> GdXC
>>>>> >> >> =Aigq
>>>>> >> >> -----END PGP SIGNATURE-----
>>>>> >> >> ----------------
>>>>> >> >> Robert LeBlanc
>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> > Hash: SHA256
>>>>> >> >> >
>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>>>>> >> >> >
>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>>>>> >> >> > few minutes and then I started the process again and marked it in. I
>>>>> >> >> > started getting block I/O messages during the recovery.
>>>>> >> >> >
>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>> >> >> >
>>>>> >> >> > Thanks,
>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>>>>> >> >> > Version: Mailvelope v1.2.0
>>>>> >> >> > Comment: https://www.mailvelope.com
>>>>> >> >> >
>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>> >> >> > 3EPx
>>>>> >> >> > =UDIV
>>>>> >> >> > -----END PGP SIGNATURE-----
>>>>> >> >> >
>>>>> >> >> > ----------------
>>>>> >> >> > Robert LeBlanc
>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> Hash: SHA256
>>>>> >> >> >>>
>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>>>>> >> >> >>> on-site engagements, please let us know.
>>>>> >> >> >>>
>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>> >> >> >>> (from CentOS 7.1) with similar results.
>>>>> >> >> >>>
>>>>> >> >> >>> The messages seem slightly different:
>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>> >> >> >>> 100.087155 secs
>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>> >> >> >>> points reached
>>>>> >> >> >>>
>>>>> >> >> >>> I don't know what "no flag points reached" means.
>>>>> >> >> >>
>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>>>>> >> >> >> (op->mark_*() calls).
>>>>> >> >> >>
>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>>>>> >> >> >> something we can blame on the network stack.
>>>>> >> >> >>
>>>>> >> >> >> sage
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >>>
>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>>>>> >> >> >>> are being checked.
>>>>> >> >> >>>
>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>>>>> >> >> >>>
>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>>>>> >> >> >>> the Ceph process?
>>>>> >> >> >>>
>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>>>>> >> >> >>> anything else.
>>>>> >> >> >>>
>>>>> >> >> >>> Thanks,
>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> >>> Version: Mailvelope v1.1.0
>>>>> >> >> >>> Comment: https://www.mailvelope.com
>>>>> >> >> >>>
>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>> >> >> >>> l7OF
>>>>> >> >> >>> =OI++
>>>>> >> >> >>> -----END PGP SIGNATURE-----
>>>>> >> >> >>> ----------------
>>>>> >> >> >>> Robert LeBlanc
>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>>
>>>>> >> >> >>>
>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>> >> >> >>> > processes and 16K system wide.
>>>>> >> >> >>> >
>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>>>>> >> >> >>> > configuration items we should be looking at?
>>>>> >> >> >>> >
>>>>> >> >> >>> > Thanks,
>>>>> >> >> >>> > ----------------
>>>>> >> >> >>> > Robert LeBlanc
>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>> >
>>>>> >> >> >>> >
>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> >> Hash: SHA256
>>>>> >> >> >>> >>
>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>>>>> >> >> >>> >> didn't know it.
>>>>> >> >> >>> >> - ----------------
>>>>> >> >> >>> >> Robert LeBlanc
>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>> >>
>>>>> >> >> >>> >>
>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>> >> >> >>> >>> drivers might cause problems though.
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>> Mark
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> >>>> Hash: SHA256
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> OK, here is the update on the saga...
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>>>>> >> >> >>> >>>> differing hardware and network configs.
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> Thanks,
>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>> >> >> >>> >>>> 4OEo
>>>>> >> >> >>> >>>> =P33I
>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>>>>> >> >> >>> >>>> ----------------
>>>>> >> >> >>> >>>> Robert LeBlanc
>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>> >> >> >>> >>>> wrote:
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> >>>>> Hash: SHA256
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>> >> >> >>> >>>>> blocked I/O.
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>> >> >> >>> >>>>> the blocked I/O.
>>>>> >> >> >>> >>>>> - ----------------
>>>>> >> >> >>> >>>>> Robert LeBlanc
>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>> >> >> >>> >>>>>> has
>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>> sage
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>>
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>>> What kernel are you running?
>>>>> >> >> >>> >>>>>>> -Sam
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> >>>>>>>> Hash: SHA256
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>> >> >> >>> >>>>>>>> transfer).
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>> >> >> >>> >>>>>>>> thread.
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>> >> >> >>> >>>>>>>> some help.
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> Single Test started about
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> >> >> >>> >>>>>>>> 30.439150 secs
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>> >> >> >>> >>>>>>>> 30.379680 secs
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> >> >> >>> >>>>>>>> 30.954212 secs
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>> >> >> >>> >>>>>>>> 30.704367 secs
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> fio job:
>>>>> >> >> >>> >>>>>>>> [rbd-test]
>>>>> >> >> >>> >>>>>>>> readwrite=write
>>>>> >> >> >>> >>>>>>>> blocksize=4M
>>>>> >> >> >>> >>>>>>>> #runtime=60
>>>>> >> >> >>> >>>>>>>> name=rbd-test
>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>> >> >> >>> >>>>>>>> #rwmixread=72
>>>>> >> >> >>> >>>>>>>> #norandommap
>>>>> >> >> >>> >>>>>>>> #size=1T
>>>>> >> >> >>> >>>>>>>> #blocksize=4k
>>>>> >> >> >>> >>>>>>>> ioengine=rbd
>>>>> >> >> >>> >>>>>>>> rbdname=test2
>>>>> >> >> >>> >>>>>>>> pool=rbd
>>>>> >> >> >>> >>>>>>>> clientname=admin
>>>>> >> >> >>> >>>>>>>> iodepth=8
>>>>> >> >> >>> >>>>>>>> #numjobs=4
>>>>> >> >> >>> >>>>>>>> #thread
>>>>> >> >> >>> >>>>>>>> #group_reporting
>>>>> >> >> >>> >>>>>>>> #time_based
>>>>> >> >> >>> >>>>>>>> #direct=1
>>>>> >> >> >>> >>>>>>>> #ramp_time=60
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> Thanks,
>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>> >> >> >>> >>>>>>>> J3hS
>>>>> >> >> >>> >>>>>>>> =0J7F
>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>>>>> >> >> >>> >>>>>>>> ----------------
>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>> >> >> >>> >>>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>>>>> >> >> >>> >>>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>>>> >> >> >>> >>>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> I'm not
>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>> >> >> >>> >>>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>> >> >> >>> >>>>>>>>>> the
>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>>>> >> >> >>> >>>>>>>>>> on
>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>>
>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>> >> >> >>> >>>>>>>>> 20",
>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>> >> >> >>> >>>>>>>>> out
>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>> >> >> >>> >>>>>>>>> -Greg
>>>>> >> >> >>> >>>>>>>>
>>>>> >> >> >>> >>>>>>>> --
>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> >> >> >>> >>>>>>>> in
>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>>>
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>>>> >> >> >>> >>>>>
>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>> >> >> >>> >>>>> gcZm
>>>>> >> >> >>> >>>>> =CjwB
>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>> --
>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> >> >> >>> >>>> the body of a message to majordomo@vger.kernel.org
>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> >> >> >>> >>>>
>>>>> >> >> >>> >>>
>>>>> >> >> >>> >>
>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>>>>> >> >> >>> >>
>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>> >> >> >>> >> ae22
>>>>> >> >> >>> >> =AX+L
>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>>>>> >> >> >>> _______________________________________________
>>>>> >> >> >>> ceph-users mailing list
>>>>> >> >> >>> ceph-users@lists.ceph.com
>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >> >> >>>
>>>>> >> >> >>>
>>>>> >> >> _______________________________________________
>>>>> >> >> ceph-users mailing list
>>>>> >> >> ceph-users@lists.ceph.com
>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >> >>
>>>>> >> >>
>>>>> >> --
>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> >> the body of a message to majordomo@vger.kernel.org
>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> >>
>>>>> >>
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>> 6Kfk
>>>>> =/gR6
>>>>> -----END PGP SIGNATURE-----
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>> JFPi
>>> =ofgq
>>> -----END PGP SIGNATURE-----
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>> JYNo
>> =msX2
>> -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
rZW5
=dKn9
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                               ` <CAANLjFr=MEOyqUqjUVkZcNcW7KeN4rMUe9oJxu7ZiL64pGmT4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-09  5:44                                                                                                 ` Robert LeBlanc
       [not found]                                                                                                   ` <CAANLjFoL2+wvP12v-ryg7Va6d7Cix_JFdVQ3ysSEtfxobkoCVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]                                                                                                   ` <CAANLjFquEvjDDT94ZL2mXQh5r_XWCxw3X=eFZ=c29gNHKt=2tw@mail.gmail.com>
  0 siblings, 2 replies; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-09  5:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Sage,

After trying to bisect this issue (all test moved the bisect towards
Infernalis) and eventually testing the Infernalis branch again, it
looks like the problem still exists although it is handled a tad
better in Infernalis. I'm going to test against Firefly/Giant next
week and then try and dive into the code to see if I can expose any
thing.

If I can do anything to provide you with information, please let me know.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
BCFo
=GJL4
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We forgot to upload the ceph.log yesterday. It is there now.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I upped the debug on about everything and ran the test for about 40
>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>> There was at least one op on osd.19 that was blocked for over 1,000
>> seconds. Hopefully this will have something that will cast a light on
>> what is going on.
>>
>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>> the test to verify the results from the dev cluster. This cluster
>> matches the hardware of our production cluster but is not yet in
>> production so we can safely wipe it to downgrade back to Hammer.
>>
>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>
>> Let me know what else we can do to help.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>> EDrG
>> =BZVw
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> On my second test (a much longer one), it took nearly an hour, but a
>>> few messages have popped up over a 20 window. Still far less than I
>>> have been seeing.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> I'll capture another set of logs. Is there any other debugging you
>>>> want turned up? I've seen the same thing where I see the message
>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>> for 30+ seconds in the secondary OSD logs.
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>>
>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>> the OSD boots or during the recovery period. This is with
>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>                                             (me too!)
>>>>>> better than what we have now. This dev cluster also has
>>>>>> osd_client_message_cap set to default (100).
>>>>>>
>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>
>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>> turned up for long enough?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> Hash: SHA256
>>>>>> >>
>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>> >> messages when the OSD was marked out:
>>>>>> >>
>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>> >> 34.476006 secs
>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >>
>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
>>>>>> >> far no blocked messages. I'm going to run some more tests.
>>>>>> >
>>>>>> > Good to hear.
>>>>>> >
>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
>>>>>> > My guess is this OSD has a slower disk than the others, or something else
>>>>>> > funny is going on?
>>>>>> >
>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>> > osd.163.
>>>>>> >
>>>>>> > sage
>>>>>> >
>>>>>> >
>>>>>> >>
>>>>>> >> -----BEGIN PGP SIGNATURE-----
>>>>>> >> Version: Mailvelope v1.2.0
>>>>>> >> Comment: https://www.mailvelope.com
>>>>>> >>
>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>> >> fo5a
>>>>>> >> =ahEi
>>>>>> >> -----END PGP SIGNATURE-----
>>>>>> >> ----------------
>>>>>> >> Robert LeBlanc
>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> Hash: SHA256
>>>>>> >> >>
>>>>>> >> >> With some off-list help, we have adjusted
>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>> >> >> it does not solve the problem with the blocked I/O.
>>>>>> >> >>
>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>> >> >> it starts servicing it or what exactly.
>>>>>> >> >
>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>> >> >
>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>> >> >> All packages were installed from gitbuilder.
>>>>>> >> >
>>>>>> >> > Did you chown -R ?
>>>>>> >> >
>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>> >> >
>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
>>>>>> >> >
>>>>>> >> > sage
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >>
>>>>>> >> >> Thanks,
>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> Version: Mailvelope v1.2.0
>>>>>> >> >> Comment: https://www.mailvelope.com
>>>>>> >> >>
>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>> >> >> GdXC
>>>>>> >> >> =Aigq
>>>>>> >> >> -----END PGP SIGNATURE-----
>>>>>> >> >> ----------------
>>>>>> >> >> Robert LeBlanc
>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> > Hash: SHA256
>>>>>> >> >> >
>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>>>>>> >> >> >
>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
>>>>>> >> >> > started getting block I/O messages during the recovery.
>>>>>> >> >> >
>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>> >> >> >
>>>>>> >> >> > Thanks,
>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> > Version: Mailvelope v1.2.0
>>>>>> >> >> > Comment: https://www.mailvelope.com
>>>>>> >> >> >
>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>> >> >> > 3EPx
>>>>>> >> >> > =UDIV
>>>>>> >> >> > -----END PGP SIGNATURE-----
>>>>>> >> >> >
>>>>>> >> >> > ----------------
>>>>>> >> >> > Robert LeBlanc
>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> Hash: SHA256
>>>>>> >> >> >>>
>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>>>>>> >> >> >>> on-site engagements, please let us know.
>>>>>> >> >> >>>
>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
>>>>>> >> >> >>>
>>>>>> >> >> >>> The messages seem slightly different:
>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> 100.087155 secs
>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>> >> >> >>> points reached
>>>>>> >> >> >>>
>>>>>> >> >> >>> I don't know what "no flag points reached" means.
>>>>>> >> >> >>
>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>>>>>> >> >> >> (op->mark_*() calls).
>>>>>> >> >> >>
>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>> >> >> >> something we can blame on the network stack.
>>>>>> >> >> >>
>>>>>> >> >> >> sage
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>>
>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>>>>>> >> >> >>> are being checked.
>>>>>> >> >> >>>
>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>> >> >> >>> the Ceph process?
>>>>>> >> >> >>>
>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>>>>>> >> >> >>> anything else.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Thanks,
>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>>
>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>> >> >> >>> l7OF
>>>>>> >> >> >>> =OI++
>>>>>> >> >> >>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> ----------------
>>>>>> >> >> >>> Robert LeBlanc
>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>> >> >> >>> > processes and 16K system wide.
>>>>>> >> >> >>> >
>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>>>>>> >> >> >>> > configuration items we should be looking at?
>>>>>> >> >> >>> >
>>>>>> >> >> >>> > Thanks,
>>>>>> >> >> >>> > ----------------
>>>>>> >> >> >>> > Robert LeBlanc
>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >
>>>>>> >> >> >>> >
>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >> Hash: SHA256
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>>>>>> >> >> >>> >> didn't know it.
>>>>>> >> >> >>> >> - ----------------
>>>>>> >> >> >>> >> Robert LeBlanc
>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>> >> >> >>> >>> drivers might cause problems though.
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> Mark
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>> Hash: SHA256
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>>>>>> >> >> >>> >>>> differing hardware and network configs.
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> Thanks,
>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>> >> >> >>> >>>> 4OEo
>>>>>> >> >> >>> >>>> =P33I
>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>> ----------------
>>>>>> >> >> >>> >>>> Robert LeBlanc
>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>> >> >> >>> >>>> wrote:
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>> >> >> >>> >>>>> blocked I/O.
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>> >> >> >>> >>>>> the blocked I/O.
>>>>>> >> >> >>> >>>>> - ----------------
>>>>>> >> >> >>> >>>>> Robert LeBlanc
>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>> >> >> >>> >>>>>> has
>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> sage
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> What kernel are you running?
>>>>>> >> >> >>> >>>>>>> -Sam
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>> >> >> >>> >>>>>>>> transfer).
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>> >> >> >>> >>>>>>>> thread.
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>> >> >> >>> >>>>>>>> some help.
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> Single Test started about
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> fio job:
>>>>>> >> >> >>> >>>>>>>> [rbd-test]
>>>>>> >> >> >>> >>>>>>>> readwrite=write
>>>>>> >> >> >>> >>>>>>>> blocksize=4M
>>>>>> >> >> >>> >>>>>>>> #runtime=60
>>>>>> >> >> >>> >>>>>>>> name=rbd-test
>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
>>>>>> >> >> >>> >>>>>>>> #norandommap
>>>>>> >> >> >>> >>>>>>>> #size=1T
>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
>>>>>> >> >> >>> >>>>>>>> rbdname=test2
>>>>>> >> >> >>> >>>>>>>> pool=rbd
>>>>>> >> >> >>> >>>>>>>> clientname=admin
>>>>>> >> >> >>> >>>>>>>> iodepth=8
>>>>>> >> >> >>> >>>>>>>> #numjobs=4
>>>>>> >> >> >>> >>>>>>>> #thread
>>>>>> >> >> >>> >>>>>>>> #group_reporting
>>>>>> >> >> >>> >>>>>>>> #time_based
>>>>>> >> >> >>> >>>>>>>> #direct=1
>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> Thanks,
>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>> >> >> >>> >>>>>>>> J3hS
>>>>>> >> >> >>> >>>>>>>> =0J7F
>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>>>> ----------------
>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> I'm not
>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>> >> >> >>> >>>>>>>>>> the
>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>> >> >> >>> >>>>>>>>>> on
>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>> >> >> >>> >>>>>>>>> 20",
>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>> >> >> >>> >>>>>>>>> out
>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>> >> >> >>> >>>>>>>>> -Greg
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> --
>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> >> >> >>> >>>>>>>> in
>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>> >> >> >>> >>>>> gcZm
>>>>>> >> >> >>> >>>>> =CjwB
>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> --
>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>> >> >> >>> >> ae22
>>>>>> >> >> >>> >> =AX+L
>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> _______________________________________________
>>>>>> >> >> >>> ceph-users mailing list
>>>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> _______________________________________________
>>>>>> >> >> ceph-users mailing list
>>>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> --
>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>>
>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>> 6Kfk
>>>>>> =/gR6
>>>>>> -----END PGP SIGNATURE-----
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>> JFPi
>>>> =ofgq
>>>> -----END PGP SIGNATURE-----
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>> JYNo
>>> =msX2
>>> -----END PGP SIGNATURE-----
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
> rZW5
> =dKn9
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                   ` <CAANLjFoL2+wvP12v-ryg7Va6d7Cix_JFdVQ3ysSEtfxobkoCVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-09  8:48                                                                                                     ` Max A. Krasilnikov
       [not found]                                                                                                       ` <20151009084843.GL86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Max A. Krasilnikov @ 2015-10-09  8:48 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Hello!

On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256

> Sage,

> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still exists although it is handled a tad
> better in Infernalis. I'm going to test against Firefly/Giant next
> week and then try and dive into the code to see if I can expose any
> thing.

> If I can do anything to provide you with information, please let me know.

I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
switch with Jumbo frames enabled i have performance drop and slow requests. When
setting 1500 on nodes and not touching Nexus all problems are fixed.

I have rebooted all my ceph services when changing MTU and changing things to
9000 and 1500 several times in order to be sure. It is reproducable in my
environment.

> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com

> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> BCFo
> =GJL4
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> We forgot to upload the ceph.log yesterday. It is there now.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I upped the debug on about everything and ran the test for about 40
>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>> There was at least one op on osd.19 that was blocked for over 1,000
>>> seconds. Hopefully this will have something that will cast a light on
>>> what is going on.
>>>
>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>> the test to verify the results from the dev cluster. This cluster
>>> matches the hardware of our production cluster but is not yet in
>>> production so we can safely wipe it to downgrade back to Hammer.
>>>
>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>
>>> Let me know what else we can do to help.
>>>
>>> Thanks,
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>> EDrG
>>> =BZVw
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>> few messages have popped up over a 20 window. Still far less than I
>>>> have been seeing.
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>> want turned up? I've seen the same thing where I see the message
>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>> for 30+ seconds in the secondary OSD logs.
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>>
>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>                                             (me too!)
>>>>>>> better than what we have now. This dev cluster also has
>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>
>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>
>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>> turned up for long enough?
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> - ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>> >> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >>> Hash: SHA256
>>>>>> >>>
>>>>>> >>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>> >>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>> >>> messages when the OSD was marked out:
>>>>>> >>>
>>>>>> >>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>> >>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>> >>> 34.476006 secs
>>>>>> >>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>> >>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>> >>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>> >>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>> >>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>> >>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>> >>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>> >>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>> >>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>> >>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>> >>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>> >>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>> >>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>> >>>
>>>>>> >>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>> >>> OSD spindles have been running at 100% during this test. I have seen
>>>>>> >>> slowed I/O from the clients as expected from the extra load, but so
>>>>>> >>> far no blocked messages. I'm going to run some more tests.
>>>>>> >>
>>>>>> >> Good to hear.
>>>>>> >>
>>>>>> >> FWIW I looked through the logs and all of the slow request no flag point
>>>>>> >> messages came from osd.163... and the logs don't show when they arrived.
>>>>>> >> My guess is this OSD has a slower disk than the others, or something else
>>>>>> >> funny is going on?
>>>>>> >>
>>>>>> >> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>> >> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>> >> osd.163.
>>>>>> >>
>>>>>> >> sage
>>>>>> >>
>>>>>> >>
>>>>>> >>>
>>>>>> >>> -----BEGIN PGP SIGNATURE-----
>>>>>> >>> Version: Mailvelope v1.2.0
>>>>>> >>> Comment: https://www.mailvelope.com
>>>>>> >>>
>>>>>> >>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>> >>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>> >>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>> >>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>> >>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>> >>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>> >>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>> >>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>> >>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>> >>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>> >>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>> >>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>> >>> fo5a
>>>>>> >>> =ahEi
>>>>>> >>> -----END PGP SIGNATURE-----
>>>>>> >>> ----------------
>>>>>> >>> Robert LeBlanc
>>>>>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>> >> >> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >>> Hash: SHA256
>>>>>> >> >>>
>>>>>> >> >>> With some off-list help, we have adjusted
>>>>>> >> >>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>> >> >>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>> >> >>> it does not solve the problem with the blocked I/O.
>>>>>> >> >>>
>>>>>> >> >>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>> >> >>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>> >> >>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>> >> >>> it starts servicing it or what exactly.
>>>>>> >> >>
>>>>>> >> >> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>> >> >>
>>>>>> >> >>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>> >> >>> to master and things didn't go so well. The OSDs would not start
>>>>>> >> >>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>> >> >>> and all OSDs and the OSD then started, but never became active in the
>>>>>> >> >>> cluster. It just sat there after reading all the PGs. There were
>>>>>> >> >>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>> >> >>> downgrading to the Infernalis branch and still no luck getting the
>>>>>> >> >>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>> >> >>> All packages were installed from gitbuilder.
>>>>>> >> >>
>>>>>> >> >> Did you chown -R ?
>>>>>> >> >>
>>>>>> >> >>         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>> >> >>
>>>>>> >> >> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>> >> >> an error when it encountered the other files?  If you can generate a debug
>>>>>> >> >> osd = 20 log, that would be helpful.. thanks!
>>>>>> >> >>
>>>>>> >> >> sage
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>>
>>>>>> >> >>> Thanks,
>>>>>> >> >>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >>> Version: Mailvelope v1.2.0
>>>>>> >> >>> Comment: https://www.mailvelope.com
>>>>>> >> >>>
>>>>>> >> >>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>> >> >>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>> >> >>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>> >> >>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>> >> >>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>> >> >>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>> >> >>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>> >> >>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>> >> >>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>> >> >>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>> >> >>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>> >> >>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>> >> >>> GdXC
>>>>>> >> >>> =Aigq
>>>>>> >> >>> -----END PGP SIGNATURE-----
>>>>>> >> >>> ----------------
>>>>>> >> >>> Robert LeBlanc
>>>>>> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >> Hash: SHA256
>>>>>> >> >> >>
>>>>>> >> >> >> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>> >> >> >> volumes. I've included the CRUSH map in the tarball.
>>>>>> >> >> >>
>>>>>> >> >> >> I stopped one OSD process and marked it out. I let it recover for a
>>>>>> >> >> >> few minutes and then I started the process again and marked it in. I
>>>>>> >> >> >> started getting block I/O messages during the recovery.
>>>>>> >> >> >>
>>>>>> >> >> >> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>> >> >> >>
>>>>>> >> >> >> Thanks,
>>>>>> >> >> >> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >> Version: Mailvelope v1.2.0
>>>>>> >> >> >> Comment: https://www.mailvelope.com
>>>>>> >> >> >>
>>>>>> >> >> >> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>> >> >> >> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>> >> >> >> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>> >> >> >> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>> >> >> >> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>> >> >> >> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>> >> >> >> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>> >> >> >> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>> >> >> >> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>> >> >> >> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>> >> >> >> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>> >> >> >> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>> >> >> >> 3EPx
>>>>>> >> >> >> =UDIV
>>>>>> >> >> >> -----END PGP SIGNATURE-----
>>>>>> >> >> >>
>>>>>> >> >> >> ----------------
>>>>>> >> >> >> Robert LeBlanc
>>>>>> >> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>> >> >> >>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>> >> >> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>>> Hash: SHA256
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> We are still struggling with this and have tried a lot of different
>>>>>> >> >> >>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>> >> >> >>>> consulting services for non-Red Hat systems. If there are some
>>>>>> >> >> >>>> certified Ceph consultants in the US that we can do both remote and
>>>>>> >> >> >>>> on-site engagements, please let us know.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> This certainly seems to be network related, but somewhere in the
>>>>>> >> >> >>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>> >> >> >>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>> >> >> >>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>> >> >> >>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>> >> >> >>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>> >> >> >>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>> >> >> >>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>> >> >> >>>> network admins have verified that packets are not being dropped in the
>>>>>> >> >> >>>> switches for these nodes. We have tried different kernels including
>>>>>> >> >> >>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>> >> >> >>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>> >> >> >>>> (from CentOS 7.1) with similar results.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> The messages seem slightly different:
>>>>>> >> >> >>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>> >> >> >>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>>> 100.087155 secs
>>>>>> >> >> >>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>> >> >> >>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>> >> >> >>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>> >> >> >>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>> >> >> >>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>> >> >> >>>> points reached
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> I don't know what "no flag points reached" means.
>>>>>> >> >> >>>
>>>>>> >> >> >>> Just that the op hasn't been marked as reaching any interesting points
>>>>>> >> >> >>> (op->mark_*() calls).
>>>>>> >> >> >>>
>>>>>> >> >> >>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>> >> >> >>> It's extremely verbose but it'll let us see where the op is getting
>>>>>> >> >> >>> blocked.  If you see the "slow request" message it means the op in
>>>>>> >> >> >>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>> >> >> >>> something we can blame on the network stack.
>>>>>> >> >> >>>
>>>>>> >> >> >>> sage
>>>>>> >> >> >>>
>>>>>> >> >> >>>
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>> >> >> >>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>> >> >> >>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>> >> >> >>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>> >> >> >>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>> >> >> >>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>> >> >> >>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>> >> >> >>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>> >> >> >>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>> >> >> >>>> priority. We tried increasing the number of op threads but this didn't
>>>>>> >> >> >>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>> >> >> >>>> become active and could be the cause for slow I/O while the other PGs
>>>>>> >> >> >>>> are being checked.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> What I don't understand is that the messages are delayed. As soon as
>>>>>> >> >> >>>> the message is received by Ceph OSD process, it is very quickly
>>>>>> >> >> >>>> committed to the journal and a response is sent back to the primary
>>>>>> >> >> >>>> OSD which is received very quickly as well. I've adjust
>>>>>> >> >> >>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>> >> >> >>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>> >> >> >>>> of RAM per nodes for 10 OSDs.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> Is there something that could cause the kernel to get a packet but not
>>>>>> >> >> >>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>> >> >> >>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>> >> >> >>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>> >> >> >>>> the Ceph process?
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> We can really use some pointers no matter how outrageous. We've have
>>>>>> >> >> >>>> over 6 people looking into this for weeks now and just can't think of
>>>>>> >> >> >>>> anything else.
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> Thanks,
>>>>>> >> >> >>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>> >> >> >>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>> >> >> >>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>> >> >> >>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>> >> >> >>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>> >> >> >>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>> >> >> >>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>> >> >> >>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>> >> >> >>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>> >> >> >>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>> >> >> >>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>> >> >> >>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>> >> >> >>>> l7OF
>>>>>> >> >> >>>> =OI++
>>>>>> >> >> >>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>>> ----------------
>>>>>> >> >> >>>> Robert LeBlanc
>>>>>> >> >> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>>>
>>>>>> >> >> >>>>
>>>>>> >> >> >>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>> >> >> >>> >> like all the blocked I/O has stopped (no entries in the log for the
>>>>>> >> >> >>> >> last 12 hours). This makes me believe that there is some issue with
>>>>>> >> >> >>> >> the number of sockets or some other TCP issue. We have not messed with
>>>>>> >> >> >>> >> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>> >> >> >>> >> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>> >> >> >>> >> processes and 16K system wide.
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> Does this seem like the right spot to be looking? What are some
>>>>>> >> >> >>> >> configuration items we should be looking at?
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> Thanks,
>>>>>> >> >> >>> >> ----------------
>>>>>> >> >> >>> >> Robert LeBlanc
>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >>
>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>> Hash: SHA256
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>> >> >> >>> >>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>> >> >> >>> >>> seems that there were some major reworks in the network handling in
>>>>>> >> >> >>> >>> the kernel to efficiently handle that network rate. If I remember
>>>>>> >> >> >>> >>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>> >> >> >>> >>> that we did see packet loss while congesting our ISLs in our initial
>>>>>> >> >> >>> >>> testing, but we could not tell where the dropping was happening. We
>>>>>> >> >> >>> >>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>> >> >> >>> >>> trying to congest things. We probably already saw this issue, just
>>>>>> >> >> >>> >>> didn't know it.
>>>>>> >> >> >>> >>> - ----------------
>>>>>> >> >> >>> >>> Robert LeBlanc
>>>>>> >> >> >>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>> >> >> >>> >>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>> >> >> >>> >>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>> >> >> >>> >>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>> >> >> >>> >>>> drivers might cause problems though.
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> Here's ifconfig from one of the nodes:
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> ens513f1: flags=4163  mtu 1500
>>>>>> >> >> >>> >>>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>> >> >> >>> >>>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>> >> >> >>> >>>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>> >> >> >>> >>>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>> >> >> >>> >>>>         RX errors 0  dropped 0  overruns 0  frame 0
>>>>>> >> >> >>> >>>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>> >> >> >>> >>>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> Mark
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> OK, here is the update on the saga...
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>> >> >> >>> >>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>> >> >> >>> >>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>> >> >> >>> >>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>> >> >> >>> >>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>> >> >> >>> >>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>> >> >> >>> >>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>> >> >> >>> >>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>> >> >> >>> >>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>> >> >> >>> >>>>> need the network enhancements in the 4.x series to work well.
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>> >> >> >>> >>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>> >> >> >>> >>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>> >> >> >>> >>>>> differing hardware and network configs.
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> Thanks,
>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>> >> >> >>> >>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>> >> >> >>> >>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>> >> >> >>> >>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>> >> >> >>> >>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>> >> >> >>> >>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>> >> >> >>> >>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>> >> >> >>> >>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>> >> >> >>> >>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>> >> >> >>> >>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>> >> >> >>> >>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>> >> >> >>> >>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>> >> >> >>> >>>>> 4OEo
>>>>>> >> >> >>> >>>>> =P33I
>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>> ----------------
>>>>>> >> >> >>> >>>>> Robert LeBlanc
>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>> >> >> >>> >>>>> wrote:
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>> >> >> >>> >>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>> >> >> >>> >>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>> >> >> >>> >>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>> >> >> >>> >>>>>> blocked I/O.
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>> >> >> >>> >>>>>> the blocked I/O.
>>>>>> >> >> >>> >>>>>> - ----------------
>>>>>> >> >> >>> >>>>>> Robert LeBlanc
>>>>>> >> >> >>> >>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>> >> >> >>> >>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>> >> >> >>> >>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>> >> >> >>> >>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>> >> >> >>> >>>>>>>> delayed for many 10s of seconds?
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>> >> >> >>> >>>>>>> has
>>>>>> >> >> >>> >>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>> sage
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>>
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> What kernel are you running?
>>>>>> >> >> >>> >>>>>>>> -Sam
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>>>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>> >> >> >>> >>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>> >> >> >>> >>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>> >> >> >>> >>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>> >> >> >>> >>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>> >> >> >>> >>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>> >> >> >>> >>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>> >> >> >>> >>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>> >> >> >>> >>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>> >> >> >>> >>>>>>>>> transfer).
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>> >> >> >>> >>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>> >> >> >>> >>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>> >> >> >>> >>>>>>>>> passed to another thread right away or something. This test was done
>>>>>> >> >> >>> >>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>> >> >> >>> >>>>>>>>> thread.
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>> >> >> >>> >>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>> >> >> >>> >>>>>>>>> some help.
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> Single Test started about
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:52:36
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>>> 30.439150 secs
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>> >> >> >>> >>>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>> >> >> >>> >>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>>   currently waiting for subops from 13,16
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>> >> >> >>> >>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>>> 30.379680 secs
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>> >> >> >>> >>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>> >> >> >>> >>>>>>>>> 12:55:06.406303:
>>>>>> >> >> >>> >>>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>> >> >> >>> >>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>>   currently waiting for subops from 13,17
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>> >> >> >>> >>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>> >> >> >>> >>>>>>>>> 12:55:06.318144:
>>>>>> >> >> >>> >>>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>> >> >> >>> >>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>>   currently waiting for subops from 13,14
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>>> 30.954212 secs
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>> >> >> >>> >>>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>> >> >> >>> >>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>>   currently waiting for subops from 16,17
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>> >> >> >>> >>>>>>>>> 30.704367 secs
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>> >> >> >>> >>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>> >> >> >>> >>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>> >> >> >>> >>>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>> >> >> >>> >>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>> >> >> >>> >>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>> >> >> >>> >>>>>>>>>   currently waiting for subops from 13,17
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> Server   IP addr              OSD
>>>>>> >> >> >>> >>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>> >> >> >>> >>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>> >> >> >>> >>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>> >> >> >>> >>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>> >> >> >>> >>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>> >> >> >>> >>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> fio job:
>>>>>> >> >> >>> >>>>>>>>> [rbd-test]
>>>>>> >> >> >>> >>>>>>>>> readwrite=write
>>>>>> >> >> >>> >>>>>>>>> blocksize=4M
>>>>>> >> >> >>> >>>>>>>> ##runtime=60
>>>>>> >> >> >>> >>>>>>>>> name=rbd-test
>>>>>> >> >> >>> >>>>>>>> ##readwrite=randwrite
>>>>>> >> >> >>> >>>>>>>> ##bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>> >> >> >>> >>>>>>>> ##rwmixread=72
>>>>>> >> >> >>> >>>>>>>> ##norandommap
>>>>>> >> >> >>> >>>>>>>> ##size=1T
>>>>>> >> >> >>> >>>>>>>> ##blocksize=4k
>>>>>> >> >> >>> >>>>>>>>> ioengine=rbd
>>>>>> >> >> >>> >>>>>>>>> rbdname=test2
>>>>>> >> >> >>> >>>>>>>>> pool=rbd
>>>>>> >> >> >>> >>>>>>>>> clientname=admin
>>>>>> >> >> >>> >>>>>>>>> iodepth=8
>>>>>> >> >> >>> >>>>>>>> ##numjobs=4
>>>>>> >> >> >>> >>>>>>>> ##thread
>>>>>> >> >> >>> >>>>>>>> ##group_reporting
>>>>>> >> >> >>> >>>>>>>> ##time_based
>>>>>> >> >> >>> >>>>>>>> ##direct=1
>>>>>> >> >> >>> >>>>>>>> ##ramp_time=60
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> Thanks,
>>>>>> >> >> >>> >>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>>>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>>>>>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>> >> >> >>> >>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>> >> >> >>> >>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>> >> >> >>> >>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>> >> >> >>> >>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>> >> >> >>> >>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>> >> >> >>> >>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>> >> >> >>> >>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>> >> >> >>> >>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>> >> >> >>> >>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>> >> >> >>> >>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>> >> >> >>> >>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>> >> >> >>> >>>>>>>>> J3hS
>>>>>> >> >> >>> >>>>>>>>> =0J7F
>>>>>> >> >> >>> >>>>>>>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>>>>> ----------------
>>>>>> >> >> >>> >>>>>>>>> Robert LeBlanc
>>>>>> >> >> >>> >>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>> >> >> >>> >>>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >> >> >>> >>>>>>>>>>> Hash: SHA256
>>>>>> >> >> >>> >>>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>> >> >> >>> >>>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> I'm not
>>>>>> >> >> >>> >>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>> >> >> >>> >>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>> >> >> >>> >>>>>>>>>> this, it was discussed not too long ago.
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>> >> >> >>> >>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>> >> >> >>> >>>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>> >> >> >>> >>>>>>>>>>> the
>>>>>> >> >> >>> >>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>> >> >> >>> >>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>> >> >> >>> >>>>>>>>>>> on
>>>>>> >> >> >>> >>>>>>>>>>> my test cluster with no other load.
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>> >> >> >>> >>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>> >> >> >>> >>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>> >> >> >>> >>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>> >> >> >>> >>>>>>>>>> 20",
>>>>>> >> >> >>> >>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>> >> >> >>> >>>>>>>>>> out
>>>>>> >> >> >>> >>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>> >> >> >>> >>>>>>>>>> -Greg
>>>>>> >> >> >>> >>>>>>>>>
>>>>>> >> >> >>> >>>>>>>>> --
>>>>>> >> >> >>> >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> >> >> >>> >>>>>>>>> in
>>>>>> >> >> >>> >>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >> >> >>> >>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>>>
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>>>>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>>>>
>>>>>> >> >> >>> >>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>> >> >> >>> >>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>> >> >> >>> >>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>> >> >> >>> >>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>> >> >> >>> >>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>> >> >> >>> >>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>> >> >> >>> >>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>> >> >> >>> >>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>> >> >> >>> >>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>> >> >> >>> >>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>> >> >> >>> >>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>> >> >> >>> >>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>> >> >> >>> >>>>>> gcZm
>>>>>> >> >> >>> >>>>>> =CjwB
>>>>>> >> >> >>> >>>>>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>> --
>>>>>> >> >> >>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> >> >> >>> >>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >> >> >>> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >> >> >>> >>>>>
>>>>>> >> >> >>> >>>>
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> -----BEGIN PGP SIGNATURE-----
>>>>>> >> >> >>> >>> Version: Mailvelope v1.1.0
>>>>>> >> >> >>> >>> Comment: https://www.mailvelope.com
>>>>>> >> >> >>> >>>
>>>>>> >> >> >>> >>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>> >> >> >>> >>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>> >> >> >>> >>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>> >> >> >>> >>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>> >> >> >>> >>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>> >> >> >>> >>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>> >> >> >>> >>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>> >> >> >>> >>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>> >> >> >>> >>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>> >> >> >>> >>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>> >> >> >>> >>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>> >> >> >>> >>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>> >> >> >>> >>> ae22
>>>>>> >> >> >>> >>> =AX+L
>>>>>> >> >> >>> >>> -----END PGP SIGNATURE-----
>>>>>> >> >> >>>> _______________________________________________
>>>>>> >> >> >>>> ceph-users mailing list
>>>>>> >> >> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> >> >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> >> >> >>>>
>>>>>> >> >> >>>>
>>>>>> >> >>> _______________________________________________
>>>>>> >> >>> ceph-users mailing list
>>>>>> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >>> --
>>>>>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> >>>
>>>>>> >>>
>>>>>>>
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.2.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>
>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>> 6Kfk
>>>>>>> =/gR6
>>>>>>> -----END PGP SIGNATURE-----
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>> JFPi
>>>>> =ofgq
>>>>> -----END PGP SIGNATURE-----
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>> JYNo
>>>> =msX2
>>>> -----END PGP SIGNATURE-----
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>> rZW5
>> =dKn9
>> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
WBR, Max A. Krasilnikov
ColoCall Data Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                       ` <20151009084843.GL86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
@ 2015-10-09  9:05                                                                                                         ` Jan Schermer
       [not found]                                                                                                           ` <F2832DEB-FB8D-47EE-B364-F92DAF711D35-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Schermer @ 2015-10-09  9:05 UTC (permalink / raw)
  To: Max A. Krasilnikov
  Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Are there any errors on the NICs? (ethtool -s ethX)
Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled?
We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happen with a cluster like Ceph. It's better to let the frame drop/retransmit in this case (and you should size it so it doesn't happen in any case).
And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put my money on that...

Jan


> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
> 
> Hello!
> 
> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
> 
>> Sage,
> 
>> After trying to bisect this issue (all test moved the bisect towards
>> Infernalis) and eventually testing the Infernalis branch again, it
>> looks like the problem still exists although it is handled a tad
>> better in Infernalis. I'm going to test against Firefly/Giant next
>> week and then try and dive into the code to see if I can expose any
>> thing.
> 
>> If I can do anything to provide you with information, please let me know.
> 
> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
> switch with Jumbo frames enabled i have performance drop and slow requests. When
> setting 1500 on nodes and not touching Nexus all problems are fixed.
> 
> I have rebooted all my ceph services when changing MTU and changing things to
> 9000 and 1500 several times in order to be sure. It is reproducable in my
> environment.
> 
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
> 
>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>> BCFo
>> =GJL4
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>> 
>>> We forgot to upload the ceph.log yesterday. It is there now.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>> 
>>>> I upped the debug on about everything and ran the test for about 40
>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>> seconds. Hopefully this will have something that will cast a light on
>>>> what is going on.
>>>> 
>>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>>> the test to verify the results from the dev cluster. This cluster
>>>> matches the hardware of our production cluster but is not yet in
>>>> production so we can safely wipe it to downgrade back to Hammer.
>>>> 
>>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>> 
>>>> Let me know what else we can do to help.
>>>> 
>>>> Thanks,
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>>> 
>>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>>> EDrG
>>>> =BZVw
>>>> -----END PGP SIGNATURE-----
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> 
>>>> 
>>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>> 
>>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>>> few messages have popped up over a 20 window. Still far less than I
>>>>> have been seeing.
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> 
>>>>> 
>>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>> 
>>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>>> want turned up? I've seen the same thing where I see the message
>>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>>> for 30+ seconds in the secondary OSD logs.
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>> 
>>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>>                                            (me too!)
>>>>>>>> better than what we have now. This dev cluster also has
>>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>> 
>>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>> 
>>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>>> turned up for long enough?
>>>>>>> 
>>>>>>> sage
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> - ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>> Hash: SHA256
>>>>>>>>>> 
>>>>>>>>>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>>>>>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>>>>>> messages when the OSD was marked out:
>>>>>>>>>> 
>>>>>>>>>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>>>>>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>>>>>> 34.476006 secs
>>>>>>>>>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>>>>>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>>>>>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>>>>>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>>>>>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>>>>>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>>>>>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>>>>>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>>>>>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>>>>>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>> 
>>>>>>>>>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>>>>>> OSD spindles have been running at 100% during this test. I have seen
>>>>>>>>>> slowed I/O from the clients as expected from the extra load, but so
>>>>>>>>>> far no blocked messages. I'm going to run some more tests.
>>>>>>>>> 
>>>>>>>>> Good to hear.
>>>>>>>>> 
>>>>>>>>> FWIW I looked through the logs and all of the slow request no flag point
>>>>>>>>> messages came from osd.163... and the logs don't show when they arrived.
>>>>>>>>> My guess is this OSD has a slower disk than the others, or something else
>>>>>>>>> funny is going on?
>>>>>>>>> 
>>>>>>>>> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>>>>> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>>>>> osd.163.
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>> 
>>>>>>>>>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>>>>>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>>>>>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>>>>>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>>>>>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>>>>>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>>>>>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>>>>>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>>>>>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>>>>>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>>>>>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>>>>>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>>>>>> fo5a
>>>>>>>>>> =ahEi
>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>> ----------------
>>>>>>>>>> Robert LeBlanc
>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>>>>>>> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>> 
>>>>>>>>>>>> With some off-list help, we have adjusted
>>>>>>>>>>>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>>>>>>>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>>>>>>>> it does not solve the problem with the blocked I/O.
>>>>>>>>>>>> 
>>>>>>>>>>>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>>>>>>>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>>>>>>>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>>>>>>>> it starts servicing it or what exactly.
>>>>>>>>>>> 
>>>>>>>>>>> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>>>>>>> 
>>>>>>>>>>>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>>>>>>>> to master and things didn't go so well. The OSDs would not start
>>>>>>>>>>>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>>>>>>>> and all OSDs and the OSD then started, but never became active in the
>>>>>>>>>>>> cluster. It just sat there after reading all the PGs. There were
>>>>>>>>>>>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>>>>>>>> downgrading to the Infernalis branch and still no luck getting the
>>>>>>>>>>>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>>>>>>>> All packages were installed from gitbuilder.
>>>>>>>>>>> 
>>>>>>>>>>> Did you chown -R ?
>>>>>>>>>>> 
>>>>>>>>>>>        https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>>>>>>> 
>>>>>>>>>>> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>>>>>>> an error when it encountered the other files?  If you can generate a debug
>>>>>>>>>>> osd = 20 log, that would be helpful.. thanks!
>>>>>>>>>>> 
>>>>>>>>>>> sage
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>> 
>>>>>>>>>>>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>>>>>>>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>>>>>>>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>>>>>>>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>>>>>>>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>>>>>>>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>>>>>>>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>>>>>>>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>>>>>>>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>>>>>>>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>>>>>>>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>>>>>>>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>>>>>>>> GdXC
>>>>>>>>>>>> =Aigq
>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>> ----------------
>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>>>>>>>>> volumes. I've included the CRUSH map in the tarball.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I stopped one OSD process and marked it out. I let it recover for a
>>>>>>>>>>>>> few minutes and then I started the process again and marked it in. I
>>>>>>>>>>>>> started getting block I/O messages during the recovery.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>>>>>>>>> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>>>>>>>>> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>>>>>>>>> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>>>>>>>>> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>>>>>>>>> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>>>>>>>>> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>>>>>>>>> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>>>>>>>>> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>>>>>>>>> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>>>>>>>>> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>>>>>>>>> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>>>>>>>>> 3EPx
>>>>>>>>>>>>> =UDIV
>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>>>>>>>>>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We are still struggling with this and have tried a lot of different
>>>>>>>>>>>>>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>>>>>>>>>>> consulting services for non-Red Hat systems. If there are some
>>>>>>>>>>>>>>> certified Ceph consultants in the US that we can do both remote and
>>>>>>>>>>>>>>> on-site engagements, please let us know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This certainly seems to be network related, but somewhere in the
>>>>>>>>>>>>>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>>>>>>>>>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>>>>>>>>>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>>>>>>>>>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>>>>>>>>>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>>>>>>>>>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>>>>>>>>>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>>>>>>>>>>> network admins have verified that packets are not being dropped in the
>>>>>>>>>>>>>>> switches for these nodes. We have tried different kernels including
>>>>>>>>>>>>>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>>>>>>>>>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>>>>>>>>>>> (from CentOS 7.1) with similar results.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The messages seem slightly different:
>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>>>>>>>>>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>> 100.087155 secs
>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>>>>>>>>>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>>>>>>>>>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>>>>>>>>>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>>>>>>>>>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>>>>>>>>>>> points reached
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I don't know what "no flag points reached" means.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Just that the op hasn't been marked as reaching any interesting points
>>>>>>>>>>>>>> (op->mark_*() calls).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>>>>>>>>>> It's extremely verbose but it'll let us see where the op is getting
>>>>>>>>>>>>>> blocked.  If you see the "slow request" message it means the op in
>>>>>>>>>>>>>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>>>>>>>>>> something we can blame on the network stack.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>>>>>>>>>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>>>>>>>>>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>>>>>>>>>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>>>>>>>>>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>>>>>>>>>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>>>>>>>>>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>>>>>>>>>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>>>>>>>>>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>>>>>>>>>>> priority. We tried increasing the number of op threads but this didn't
>>>>>>>>>>>>>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>>>>>>>>>>> become active and could be the cause for slow I/O while the other PGs
>>>>>>>>>>>>>>> are being checked.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What I don't understand is that the messages are delayed. As soon as
>>>>>>>>>>>>>>> the message is received by Ceph OSD process, it is very quickly
>>>>>>>>>>>>>>> committed to the journal and a response is sent back to the primary
>>>>>>>>>>>>>>> OSD which is received very quickly as well. I've adjust
>>>>>>>>>>>>>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>>>>>>>>>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>>>>>>>>>>> of RAM per nodes for 10 OSDs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is there something that could cause the kernel to get a packet but not
>>>>>>>>>>>>>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>>>>>>>>>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>>>>>>>>>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>>>>>>>>>>> the Ceph process?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We can really use some pointers no matter how outrageous. We've have
>>>>>>>>>>>>>>> over 6 people looking into this for weeks now and just can't think of
>>>>>>>>>>>>>>> anything else.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>>>>>>>>>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>>>>>>>>>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>>>>>>>>>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>>>>>>>>>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>>>>>>>>>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>>>>>>>>>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>>>>>>>>>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>>>>>>>>>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>>>>>>>>>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>>>>>>>>>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>>>>>>>>>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>>>>>>>>>>> l7OF
>>>>>>>>>>>>>>> =OI++
>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>>>>>>>>>>>> like all the blocked I/O has stopped (no entries in the log for the
>>>>>>>>>>>>>>>> last 12 hours). This makes me believe that there is some issue with
>>>>>>>>>>>>>>>> the number of sockets or some other TCP issue. We have not messed with
>>>>>>>>>>>>>>>> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>>>>>>>>>>>> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>>>>>>>>>>>> processes and 16K system wide.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Does this seem like the right spot to be looking? What are some
>>>>>>>>>>>>>>>> configuration items we should be looking at?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>>>>>>>>>>>>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>>>>>>>>>>>>> seems that there were some major reworks in the network handling in
>>>>>>>>>>>>>>>>> the kernel to efficiently handle that network rate. If I remember
>>>>>>>>>>>>>>>>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>>>>>>>>>>>>> that we did see packet loss while congesting our ISLs in our initial
>>>>>>>>>>>>>>>>> testing, but we could not tell where the dropping was happening. We
>>>>>>>>>>>>>>>>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>>>>>>>>>>>>> trying to congest things. We probably already saw this issue, just
>>>>>>>>>>>>>>>>> didn't know it.
>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>>>>>>>>>>>>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>>>>>>>>>>>>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>>>>>>>>>>>>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>>>>>>>>>>>>>> drivers might cause problems though.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Here's ifconfig from one of the nodes:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ens513f1: flags=4163  mtu 1500
>>>>>>>>>>>>>>>>>>        inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>>>>>>>>>>>>>>        inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>>>>>>>>>>>>>>        ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>>>>>>>>>>>>>>        RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>>>>>>>>>>>>>>        RX errors 0  dropped 0  overruns 0  frame 0
>>>>>>>>>>>>>>>>>>        TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>>>>>>>>>>>>>>        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> OK, here is the update on the saga...
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>>>>>>>>>>>>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>>>>>>>>>>>>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>>>>>>>>>>>>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>>>>>>>>>>>>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>>>>>>>>>>>>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>>>>>>>>>>>>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>>>>>>>>>>>>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>>>>>>>>>>>>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>>>>>>>>>>>>>>> need the network enhancements in the 4.x series to work well.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>>>>>>>>>>>>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>>>>>>>>>>>>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>>>>>>>>>>>>>>> differing hardware and network configs.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>>>>>>>>>>>>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>>>>>>>>>>>>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>>>>>>>>>>>>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>>>>>>>>>>>>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>>>>>>>>>>>>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>>>>>>>>>>>>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>>>>>>>>>>>>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>>>>>>>>>>>>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>>>>>>>>>>>>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>>>>>>>>>>>>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>>>>>>>>>>>>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>>>>>>>>>>>>>>> 4OEo
>>>>>>>>>>>>>>>>>>> =P33I
>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>>>>>>>>>>>>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>>>>>>>>>>>>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>>>>>>>>>>>>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>>>>>>>>>>>>>>>> blocked I/O.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>>>>>>>>>>>>>>>> the blocked I/O.
>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>>>>>>>>>>>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>>>>>>>>>>>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>>>>>>>>>>>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>>>>>>>>>>>>>>>>> delayed for many 10s of seconds?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> What kernel are you running?
>>>>>>>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>>>>>>>>>>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>>>>>>>>>>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>>>>>>>>>>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>>>>>>>>>>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>>>>>>>>>>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>>>>>>>>>>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>>>>>>>>>>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>>>>>>>>>>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>>>>>>>>>>>>>>>>> transfer).
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>>>>>>>>>>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>>>>>>>>>>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>>>>>>>>>>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>>>>>>>>>>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>>>>>>>>>>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>>>>>>>>>>>>>>>>> some help.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Single Test started about
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>> 30.439150 secs
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,16
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>>>>>>>>>>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>> 30.379680 secs
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.406303:
>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.318144:
>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,14
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>> 30.954212 secs
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 16,17
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>> 30.704367 secs
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Server   IP addr              OSD
>>>>>>>>>>>>>>>>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>>>>>>>>>>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>>>>>>>>>>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>>>>>>>>>>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>>>>>>>>>>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>>>>>>>>>>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> fio job:
>>>>>>>>>>>>>>>>>>>>>>> [rbd-test]
>>>>>>>>>>>>>>>>>>>>>>> readwrite=write
>>>>>>>>>>>>>>>>>>>>>>> blocksize=4M
>>>>>>>>>>>>>>>>>>>>>> ##runtime=60
>>>>>>>>>>>>>>>>>>>>>>> name=rbd-test
>>>>>>>>>>>>>>>>>>>>>> ##readwrite=randwrite
>>>>>>>>>>>>>>>>>>>>>> ##bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>>>>>>>>>>>>>>>> ##rwmixread=72
>>>>>>>>>>>>>>>>>>>>>> ##norandommap
>>>>>>>>>>>>>>>>>>>>>> ##size=1T
>>>>>>>>>>>>>>>>>>>>>> ##blocksize=4k
>>>>>>>>>>>>>>>>>>>>>>> ioengine=rbd
>>>>>>>>>>>>>>>>>>>>>>> rbdname=test2
>>>>>>>>>>>>>>>>>>>>>>> pool=rbd
>>>>>>>>>>>>>>>>>>>>>>> clientname=admin
>>>>>>>>>>>>>>>>>>>>>>> iodepth=8
>>>>>>>>>>>>>>>>>>>>>> ##numjobs=4
>>>>>>>>>>>>>>>>>>>>>> ##thread
>>>>>>>>>>>>>>>>>>>>>> ##group_reporting
>>>>>>>>>>>>>>>>>>>>>> ##time_based
>>>>>>>>>>>>>>>>>>>>>> ##direct=1
>>>>>>>>>>>>>>>>>>>>>> ##ramp_time=60
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>>>>>>>>>>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>>>>>>>>>>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>>>>>>>>>>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>>>>>>>>>>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>>>>>>>>>>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>>>>>>>>>>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>>>>>>>>>>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>>>>>>>>>>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>>>>>>>>>>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>>>>>>>>>>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>>>>>>>>>>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>>>>>>>>>>>>>>>>> J3hS
>>>>>>>>>>>>>>>>>>>>>>> =0J7F
>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I'm not
>>>>>>>>>>>>>>>>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>>>>>>>>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>>>>>>>>>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>>>>>>>>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>>>>>>>>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>>>>>>>>>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>>>>>>>>>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>>>>>>>>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>>>>>>>>>>>>>>>>> 20",
>>>>>>>>>>>>>>>>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>>>>>>>>>>>>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>>>>>>>>>>>>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>>>>>>>>>>>>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>>>>>>>>>>>>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>>>>>>>>>>>>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>>>>>>>>>>>>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>>>>>>>>>>>>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>>>>>>>>>>>>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>>>>>>>>>>>>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>>>>>>>>>>>>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>>>>>>>>>>>>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>>>>>>>>>>>>>>>> gcZm
>>>>>>>>>>>>>>>>>>>> =CjwB
>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>>>>>>>>>>>>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>>>>>>>>>>>>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>>>>>>>>>>>>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>>>>>>>>>>>>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>>>>>>>>>>>>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>>>>>>>>>>>>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>>>>>>>>>>>>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>>>>>>>>>>>>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>>>>>>>>>>>>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>>>>>>>>>>>>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>>>>>>>>>>>>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>>>>>>>>>>>>> ae22
>>>>>>>>>>>>>>>>> =AX+L
>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>> 
>>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>>> 6Kfk
>>>>>>>> =/gR6
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>> 
>>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>>> JFPi
>>>>>> =ofgq
>>>>>> -----END PGP SIGNATURE-----
>>>>> 
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>>> 
>>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>>> JYNo
>>>>> =msX2
>>>>> -----END PGP SIGNATURE-----
>>> 
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>> 
>>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>> rZW5
>>> =dKn9
>>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> WBR, Max A. Krasilnikov
> ColoCall Data Center
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                           ` <F2832DEB-FB8D-47EE-B364-F92DAF711D35-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
@ 2015-10-09 11:21                                                                                                             ` Max A. Krasilnikov
       [not found]                                                                                                               ` <20151009112124.GM86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Max A. Krasilnikov @ 2015-10-09 11:21 UTC (permalink / raw)
  To: Jan Schermer; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Hello!

On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:

> Are there any errors on the NICs? (ethtool -s ethX)

No errors. Neither on nodes, nor on switches.

> Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled?

flow control disabled everywhere.

> We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happen with a cluster like Ceph. It's better to let the frame drop/retransmit in this case (and you should size it so it doesn't happen in any case).
> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put my money on that...

I tried to completely disable all offloads and setting mtu back to 9000 after.
No luck.
I am speaking with my NOC about MTU in 10G network. If I have update, I will
write here. I can hardly beleave that it is ceph side, but nothing is
impossible.

> Jan


>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
>> 
>> Hello!
>> 
>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>> 
>>> Sage,
>> 
>>> After trying to bisect this issue (all test moved the bisect towards
>>> Infernalis) and eventually testing the Infernalis branch again, it
>>> looks like the problem still exists although it is handled a tad
>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>> week and then try and dive into the code to see if I can expose any
>>> thing.
>> 
>>> If I can do anything to provide you with information, please let me know.
>> 
>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
>> switch with Jumbo frames enabled i have performance drop and slow requests. When
>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>> 
>> I have rebooted all my ceph services when changing MTU and changing things to
>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>> environment.
>> 
>>> Thanks,
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>> 
>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>> BCFo
>>> =GJL4
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>> 
>>>> We forgot to upload the ceph.log yesterday. It is there now.
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> 
>>>> 
>>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>> 
>>>>> I upped the debug on about everything and ran the test for about 40
>>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>>> seconds. Hopefully this will have something that will cast a light on
>>>>> what is going on.
>>>>> 
>>>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>>>> the test to verify the results from the dev cluster. This cluster
>>>>> matches the hardware of our production cluster but is not yet in
>>>>> production so we can safely wipe it to downgrade back to Hammer.
>>>>> 
>>>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>>> 
>>>>> Let me know what else we can do to help.
>>>>> 
>>>>> Thanks,
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>>> 
>>>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>>>> EDrG
>>>>> =BZVw
>>>>> -----END PGP SIGNATURE-----
>>>>> ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> 
>>>>> 
>>>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>> 
>>>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>>>> few messages have popped up over a 20 window. Still far less than I
>>>>>> have been seeing.
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>> 
>>>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>>>> want turned up? I've seen the same thing where I see the message
>>>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>>>> for 30+ seconds in the secondary OSD logs.
>>>>>>> - ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>> Hash: SHA256
>>>>>>>>> 
>>>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>>>                                            (me too!)
>>>>>>>>> better than what we have now. This dev cluster also has
>>>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>>> 
>>>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>>> 
>>>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>>>> turned up for long enough?
>>>>>>>> 
>>>>>>>> sage
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> - ----------------
>>>>>>>>> Robert LeBlanc
>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>> 
>>>>>>>>>>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>>>>>>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>>>>>>> messages when the OSD was marked out:
>>>>>>>>>>> 
>>>>>>>>>>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>>>>>>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>>>>>>> 34.476006 secs
>>>>>>>>>>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>>>>>>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>>>>>>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>>>>>>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>>>>>>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>>>>>>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>>>>>>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>>>>>>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>>>>>>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>>>>>>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>> 
>>>>>>>>>>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>>>>>>> OSD spindles have been running at 100% during this test. I have seen
>>>>>>>>>>> slowed I/O from the clients as expected from the extra load, but so
>>>>>>>>>>> far no blocked messages. I'm going to run some more tests.
>>>>>>>>>> 
>>>>>>>>>> Good to hear.
>>>>>>>>>> 
>>>>>>>>>> FWIW I looked through the logs and all of the slow request no flag point
>>>>>>>>>> messages came from osd.163... and the logs don't show when they arrived.
>>>>>>>>>> My guess is this OSD has a slower disk than the others, or something else
>>>>>>>>>> funny is going on?
>>>>>>>>>> 
>>>>>>>>>> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>>>>>> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>>>>>> osd.163.
>>>>>>>>>> 
>>>>>>>>>> sage
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>> 
>>>>>>>>>>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>>>>>>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>>>>>>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>>>>>>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>>>>>>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>>>>>>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>>>>>>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>>>>>>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>>>>>>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>>>>>>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>>>>>>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>>>>>>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>>>>>>> fo5a
>>>>>>>>>>> =ahEi
>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>> ----------------
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>>>>>>>> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>> 
>>>>>>>>>>>>> With some off-list help, we have adjusted
>>>>>>>>>>>>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>>>>>>>>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>>>>>>>>> it does not solve the problem with the blocked I/O.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>>>>>>>>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>>>>>>>>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>>>>>>>>> it starts servicing it or what exactly.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>>>>>>>>> to master and things didn't go so well. The OSDs would not start
>>>>>>>>>>>>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>>>>>>>>> and all OSDs and the OSD then started, but never became active in the
>>>>>>>>>>>>> cluster. It just sat there after reading all the PGs. There were
>>>>>>>>>>>>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>>>>>>>>> downgrading to the Infernalis branch and still no luck getting the
>>>>>>>>>>>>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>>>>>>>>> All packages were installed from gitbuilder.
>>>>>>>>>>>> 
>>>>>>>>>>>> Did you chown -R ?
>>>>>>>>>>>> 
>>>>>>>>>>>>        https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>>>>>>>> 
>>>>>>>>>>>> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>>>>>>>> an error when it encountered the other files?  If you can generate a debug
>>>>>>>>>>>> osd = 20 log, that would be helpful.. thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> sage
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>>>>>>>>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>>>>>>>>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>>>>>>>>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>>>>>>>>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>>>>>>>>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>>>>>>>>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>>>>>>>>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>>>>>>>>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>>>>>>>>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>>>>>>>>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>>>>>>>>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>>>>>>>>> GdXC
>>>>>>>>>>>>> =Aigq
>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>>>>>>>>>> volumes. I've included the CRUSH map in the tarball.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I stopped one OSD process and marked it out. I let it recover for a
>>>>>>>>>>>>>> few minutes and then I started the process again and marked it in. I
>>>>>>>>>>>>>> started getting block I/O messages during the recovery.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>>>>>>>>>> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>>>>>>>>>> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>>>>>>>>>> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>>>>>>>>>> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>>>>>>>>>> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>>>>>>>>>> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>>>>>>>>>> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>>>>>>>>>> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>>>>>>>>>> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>>>>>>>>>> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>>>>>>>>>> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>>>>>>>>>> 3EPx
>>>>>>>>>>>>>> =UDIV
>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>>>>>>>>>>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We are still struggling with this and have tried a lot of different
>>>>>>>>>>>>>>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>>>>>>>>>>>> consulting services for non-Red Hat systems. If there are some
>>>>>>>>>>>>>>>> certified Ceph consultants in the US that we can do both remote and
>>>>>>>>>>>>>>>> on-site engagements, please let us know.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This certainly seems to be network related, but somewhere in the
>>>>>>>>>>>>>>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>>>>>>>>>>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>>>>>>>>>>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>>>>>>>>>>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>>>>>>>>>>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>>>>>>>>>>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>>>>>>>>>>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>>>>>>>>>>>> network admins have verified that packets are not being dropped in the
>>>>>>>>>>>>>>>> switches for these nodes. We have tried different kernels including
>>>>>>>>>>>>>>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>>>>>>>>>>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>>>>>>>>>>>> (from CentOS 7.1) with similar results.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The messages seem slightly different:
>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>>>>>>>>>>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>> 100.087155 secs
>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>>>>>>>>>>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>>>>>>>>>>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>>>>>>>>>>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>>>>>>>>>>>> points reached
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don't know what "no flag points reached" means.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Just that the op hasn't been marked as reaching any interesting points
>>>>>>>>>>>>>>> (op->mark_*() calls).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>>>>>>>>>>> It's extremely verbose but it'll let us see where the op is getting
>>>>>>>>>>>>>>> blocked.  If you see the "slow request" message it means the op in
>>>>>>>>>>>>>>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>>>>>>>>>>> something we can blame on the network stack.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>>>>>>>>>>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>>>>>>>>>>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>>>>>>>>>>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>>>>>>>>>>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>>>>>>>>>>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>>>>>>>>>>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>>>>>>>>>>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>>>>>>>>>>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>>>>>>>>>>>> priority. We tried increasing the number of op threads but this didn't
>>>>>>>>>>>>>>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>>>>>>>>>>>> become active and could be the cause for slow I/O while the other PGs
>>>>>>>>>>>>>>>> are being checked.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What I don't understand is that the messages are delayed. As soon as
>>>>>>>>>>>>>>>> the message is received by Ceph OSD process, it is very quickly
>>>>>>>>>>>>>>>> committed to the journal and a response is sent back to the primary
>>>>>>>>>>>>>>>> OSD which is received very quickly as well. I've adjust
>>>>>>>>>>>>>>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>>>>>>>>>>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>>>>>>>>>>>> of RAM per nodes for 10 OSDs.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Is there something that could cause the kernel to get a packet but not
>>>>>>>>>>>>>>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>>>>>>>>>>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>>>>>>>>>>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>>>>>>>>>>>> the Ceph process?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We can really use some pointers no matter how outrageous. We've have
>>>>>>>>>>>>>>>> over 6 people looking into this for weeks now and just can't think of
>>>>>>>>>>>>>>>> anything else.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>>>>>>>>>>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>>>>>>>>>>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>>>>>>>>>>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>>>>>>>>>>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>>>>>>>>>>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>>>>>>>>>>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>>>>>>>>>>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>>>>>>>>>>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>>>>>>>>>>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>>>>>>>>>>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>>>>>>>>>>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>>>>>>>>>>>> l7OF
>>>>>>>>>>>>>>>> =OI++
>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>>>>>>>>>>>>> like all the blocked I/O has stopped (no entries in the log for the
>>>>>>>>>>>>>>>>> last 12 hours). This makes me believe that there is some issue with
>>>>>>>>>>>>>>>>> the number of sockets or some other TCP issue. We have not messed with
>>>>>>>>>>>>>>>>> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>>>>>>>>>>>>> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>>>>>>>>>>>>> processes and 16K system wide.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Does this seem like the right spot to be looking? What are some
>>>>>>>>>>>>>>>>> configuration items we should be looking at?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>>>>>>>>>>>>>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>>>>>>>>>>>>>> seems that there were some major reworks in the network handling in
>>>>>>>>>>>>>>>>>> the kernel to efficiently handle that network rate. If I remember
>>>>>>>>>>>>>>>>>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>>>>>>>>>>>>>> that we did see packet loss while congesting our ISLs in our initial
>>>>>>>>>>>>>>>>>> testing, but we could not tell where the dropping was happening. We
>>>>>>>>>>>>>>>>>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>>>>>>>>>>>>>> trying to congest things. We probably already saw this issue, just
>>>>>>>>>>>>>>>>>> didn't know it.
>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>>>>>>>>>>>>>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>>>>>>>>>>>>>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>>>>>>>>>>>>>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>>>>>>>>>>>>>>> drivers might cause problems though.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Here's ifconfig from one of the nodes:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ens513f1: flags=4163  mtu 1500
>>>>>>>>>>>>>>>>>>>        inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>>>>>>>>>>>>>>>        inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>>>>>>>>>>>>>>>        ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>>>>>>>>>>>>>>>        RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>>>>>>>>>>>>>>>        RX errors 0  dropped 0  overruns 0  frame 0
>>>>>>>>>>>>>>>>>>>        TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>>>>>>>>>>>>>>>        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> OK, here is the update on the saga...
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>>>>>>>>>>>>>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>>>>>>>>>>>>>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>>>>>>>>>>>>>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>>>>>>>>>>>>>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>>>>>>>>>>>>>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>>>>>>>>>>>>>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>>>>>>>>>>>>>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>>>>>>>>>>>>>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>>>>>>>>>>>>>>>> need the network enhancements in the 4.x series to work well.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>>>>>>>>>>>>>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>>>>>>>>>>>>>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>>>>>>>>>>>>>>>> differing hardware and network configs.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>>>>>>>>>>>>>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>>>>>>>>>>>>>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>>>>>>>>>>>>>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>>>>>>>>>>>>>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>>>>>>>>>>>>>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>>>>>>>>>>>>>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>>>>>>>>>>>>>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>>>>>>>>>>>>>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>>>>>>>>>>>>>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>>>>>>>>>>>>>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>>>>>>>>>>>>>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>>>>>>>>>>>>>>>> 4OEo
>>>>>>>>>>>>>>>>>>>> =P33I
>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>>>>>>>>>>>>>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>>>>>>>>>>>>>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>>>>>>>>>>>>>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>>>>>>>>>>>>>>>>> blocked I/O.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>>>>>>>>>>>>>>>>> the blocked I/O.
>>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>>>>>>>>>>>>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>>>>>>>>>>>>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>>>>>>>>>>>>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>>>>>>>>>>>>>>>>>> delayed for many 10s of seconds?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> What kernel are you running?
>>>>>>>>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>>>>>>>>>>>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>>>>>>>>>>>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>>>>>>>>>>>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>>>>>>>>>>>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>>>>>>>>>>>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>>>>>>>>>>>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>>>>>>>>>>>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>>>>>>>>>>>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>>>>>>>>>>>>>>>>>> transfer).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>>>>>>>>>>>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>>>>>>>>>>>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>>>>>>>>>>>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>>>>>>>>>>>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>>>>>>>>>>>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>>>>>>>>>>>>>>>>>> some help.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Single Test started about
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>> 30.439150 secs
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,16
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>>>>>>>>>>>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>> 30.379680 secs
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.406303:
>>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.318144:
>>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,14
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>> 30.954212 secs
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 16,17
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>> 30.704367 secs
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>>>>>>>>>>>>>>>>>  osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>  currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Server   IP addr              OSD
>>>>>>>>>>>>>>>>>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>>>>>>>>>>>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>>>>>>>>>>>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>>>>>>>>>>>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>>>>>>>>>>>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>>>>>>>>>>>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> fio job:
>>>>>>>>>>>>>>>>>>>>>>>> [rbd-test]
>>>>>>>>>>>>>>>>>>>>>>>> readwrite=write
>>>>>>>>>>>>>>>>>>>>>>>> blocksize=4M
>>>>>>>>>>>>>>>>>>>>>> ###runtime=60
>>>>>>>>>>>>>>>>>>>>>>>> name=rbd-test
>>>>>>>>>>>>>>>>>>>>>> ###readwrite=randwrite
>>>>>>>>>>>>>>>>>>>>>> ###bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>>>>>>>>>>>>>>>> ###rwmixread=72
>>>>>>>>>>>>>>>>>>>>>> ###norandommap
>>>>>>>>>>>>>>>>>>>>>> ###size=1T
>>>>>>>>>>>>>>>>>>>>>> ###blocksize=4k
>>>>>>>>>>>>>>>>>>>>>>>> ioengine=rbd
>>>>>>>>>>>>>>>>>>>>>>>> rbdname=test2
>>>>>>>>>>>>>>>>>>>>>>>> pool=rbd
>>>>>>>>>>>>>>>>>>>>>>>> clientname=admin
>>>>>>>>>>>>>>>>>>>>>>>> iodepth=8
>>>>>>>>>>>>>>>>>>>>>> ###numjobs=4
>>>>>>>>>>>>>>>>>>>>>> ###thread
>>>>>>>>>>>>>>>>>>>>>> ###group_reporting
>>>>>>>>>>>>>>>>>>>>>> ###time_based
>>>>>>>>>>>>>>>>>>>>>> ###direct=1
>>>>>>>>>>>>>>>>>>>>>> ###ramp_time=60
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>>>>>>>>>>>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>>>>>>>>>>>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>>>>>>>>>>>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>>>>>>>>>>>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>>>>>>>>>>>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>>>>>>>>>>>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>>>>>>>>>>>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>>>>>>>>>>>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>>>>>>>>>>>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>>>>>>>>>>>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>>>>>>>>>>>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>>>>>>>>>>>>>>>>>> J3hS
>>>>>>>>>>>>>>>>>>>>>>>> =0J7F
>>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>>>>>>>>>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>>>>>>>>>>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>>>>>>>>>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>>>>>>>>>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>>>>>>>>>>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>>>>>>>>>>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>>>>>>>>>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>>>>>>>>>>>>>>>>>> 20",
>>>>>>>>>>>>>>>>>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>>>>>>>>>>>>>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>>>>>>>>>>>>>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>>>>>>>>>>>>>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>>>>>>>>>>>>>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>>>>>>>>>>>>>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>>>>>>>>>>>>>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>>>>>>>>>>>>>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>>>>>>>>>>>>>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>>>>>>>>>>>>>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>>>>>>>>>>>>>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>>>>>>>>>>>>>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>>>>>>>>>>>>>>>>> gcZm
>>>>>>>>>>>>>>>>>>>>> =CjwB
>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>>>>>>>>>>>>>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>>>>>>>>>>>>>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>>>>>>>>>>>>>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>>>>>>>>>>>>>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>>>>>>>>>>>>>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>>>>>>>>>>>>>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>>>>>>>>>>>>>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>>>>>>>>>>>>>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>>>>>>>>>>>>>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>>>>>>>>>>>>>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>>>>>>>>>>>>>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>>>>>>>>>>>>>> ae22
>>>>>>>>>>>>>>>>>> =AX+L
>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>> 
>>>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>>>> 6Kfk
>>>>>>>>> =/gR6
>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.2.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>> 
>>>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>>>> JFPi
>>>>>>> =ofgq
>>>>>>> -----END PGP SIGNATURE-----
>>>>>> 
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>> 
>>>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>>>> JYNo
>>>>>> =msX2
>>>>>> -----END PGP SIGNATURE-----
>>>> 
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>>> 
>>>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>>> rZW5
>>>> =dKn9
>>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> -- 
>> WBR, Max A. Krasilnikov
>> ColoCall Data Center
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
WBR, Max A. Krasilnikov
ColoCall Data Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                               ` <20151009112124.GM86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
@ 2015-10-09 11:45                                                                                                                 ` Jan Schermer
       [not found]                                                                                                                   ` <2FD6AADF-88BB-4DFF-B6C2-E103F16B55A8-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Schermer @ 2015-10-09 11:45 UTC (permalink / raw)
  To: Max A. Krasilnikov
  Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Have you tried running iperf between the nodes? Capturing a pcap of the (failing) Ceph comms from both sides could help narrow it down.
Is there any SDN layer involved that could add overhead/padding to the frames?

What about some intermediate MTU like 8000 - does that work?
Oh and if there's any bonding/trunking involved, beware that you need to set the same MTU and offloads on all interfaces on certains kernels - flags like MTU/offloads should propagate between the master/slave interfaces but in reality it's not the case and they get reset even if you unplug/replug the ethernet cable.

Jan

> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
> 
> Hello!
> 
> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
> 
>> Are there any errors on the NICs? (ethtool -s ethX)
> 
> No errors. Neither on nodes, nor on switches.
> 
>> Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled?
> 
> flow control disabled everywhere.
> 
>> We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happen with a cluster like Ceph. It's better to let the frame drop/retransmit in this case (and you should size it so it doesn't happen in any case).
>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put my money on that...
> 
> I tried to completely disable all offloads and setting mtu back to 9000 after.
> No luck.
> I am speaking with my NOC about MTU in 10G network. If I have update, I will
> write here. I can hardly beleave that it is ceph side, but nothing is
> impossible.
> 
>> Jan
> 
> 
>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
>>> 
>>> Hello!
>>> 
>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>> 
>>>> Sage,
>>> 
>>>> After trying to bisect this issue (all test moved the bisect towards
>>>> Infernalis) and eventually testing the Infernalis branch again, it
>>>> looks like the problem still exists although it is handled a tad
>>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>>> week and then try and dive into the code to see if I can expose any
>>>> thing.
>>> 
>>>> If I can do anything to provide you with information, please let me know.
>>> 
>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
>>> switch with Jumbo frames enabled i have performance drop and slow requests. When
>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>> 
>>> I have rebooted all my ceph services when changing MTU and changing things to
>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>> environment.
>>> 
>>>> Thanks,
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.0
>>>> Comment: https://www.mailvelope.com
>>> 
>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>>> BCFo
>>>> =GJL4
>>>> -----END PGP SIGNATURE-----
>>>> ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>> 
>>>>> We forgot to upload the ceph.log yesterday. It is there now.
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> 
>>>>> 
>>>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>> 
>>>>>> I upped the debug on about everything and ran the test for about 40
>>>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>>>> seconds. Hopefully this will have something that will cast a light on
>>>>>> what is going on.
>>>>>> 
>>>>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>>>>> the test to verify the results from the dev cluster. This cluster
>>>>>> matches the hardware of our production cluster but is not yet in
>>>>>> production so we can safely wipe it to downgrade back to Hammer.
>>>>>> 
>>>>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>>>> 
>>>>>> Let me know what else we can do to help.
>>>>>> 
>>>>>> Thanks,
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>> 
>>>>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>>>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>>>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>>>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>>>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>>>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>>>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>>>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>>>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>>>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>>>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>>>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>>>>> EDrG
>>>>>> =BZVw
>>>>>> -----END PGP SIGNATURE-----
>>>>>> ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>> 
>>>>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>>>>> few messages have popped up over a 20 window. Still far less than I
>>>>>>> have been seeing.
>>>>>>> - ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>> 
>>>>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>>>>> want turned up? I've seen the same thing where I see the message
>>>>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>>>>> for 30+ seconds in the secondary OSD logs.
>>>>>>>> - ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>> Hash: SHA256
>>>>>>>>>> 
>>>>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>>>>                                           (me too!)
>>>>>>>>>> better than what we have now. This dev cluster also has
>>>>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>>>> 
>>>>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>>>> 
>>>>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>>>>> turned up for long enough?
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> - ----------------
>>>>>>>>>> Robert LeBlanc
>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>> 
>>>>>>>>>>>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>>>>>>>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>>>>>>>> messages when the OSD was marked out:
>>>>>>>>>>>> 
>>>>>>>>>>>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>>>>>>>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>>>>>>>> 34.476006 secs
>>>>>>>>>>>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>>>>>>>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>>>>>>>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>>>>>>>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>>>>>>>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>>>>>>>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>>>>>>>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>>>>>>>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>>>>>>>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>>>>>>>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>> 
>>>>>>>>>>>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>>>>>>>> OSD spindles have been running at 100% during this test. I have seen
>>>>>>>>>>>> slowed I/O from the clients as expected from the extra load, but so
>>>>>>>>>>>> far no blocked messages. I'm going to run some more tests.
>>>>>>>>>>> 
>>>>>>>>>>> Good to hear.
>>>>>>>>>>> 
>>>>>>>>>>> FWIW I looked through the logs and all of the slow request no flag point
>>>>>>>>>>> messages came from osd.163... and the logs don't show when they arrived.
>>>>>>>>>>> My guess is this OSD has a slower disk than the others, or something else
>>>>>>>>>>> funny is going on?
>>>>>>>>>>> 
>>>>>>>>>>> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>>>>>>> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>>>>>>> osd.163.
>>>>>>>>>>> 
>>>>>>>>>>> sage
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>> 
>>>>>>>>>>>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>>>>>>>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>>>>>>>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>>>>>>>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>>>>>>>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>>>>>>>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>>>>>>>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>>>>>>>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>>>>>>>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>>>>>>>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>>>>>>>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>>>>>>>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>>>>>>>> fo5a
>>>>>>>>>>>> =ahEi
>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>> ----------------
>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>>>>>>>>> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> With some off-list help, we have adjusted
>>>>>>>>>>>>>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>>>>>>>>>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>>>>>>>>>> it does not solve the problem with the blocked I/O.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>>>>>>>>>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>>>>>>>>>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>>>>>>>>>> it starts servicing it or what exactly.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>>>>>>>>>> to master and things didn't go so well. The OSDs would not start
>>>>>>>>>>>>>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>>>>>>>>>> and all OSDs and the OSD then started, but never became active in the
>>>>>>>>>>>>>> cluster. It just sat there after reading all the PGs. There were
>>>>>>>>>>>>>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>>>>>>>>>> downgrading to the Infernalis branch and still no luck getting the
>>>>>>>>>>>>>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>>>>>>>>>> All packages were installed from gitbuilder.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did you chown -R ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>>>>>>>>> an error when it encountered the other files?  If you can generate a debug
>>>>>>>>>>>>> osd = 20 log, that would be helpful.. thanks!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> sage
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>>>>>>>>>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>>>>>>>>>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>>>>>>>>>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>>>>>>>>>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>>>>>>>>>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>>>>>>>>>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>>>>>>>>>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>>>>>>>>>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>>>>>>>>>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>>>>>>>>>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>>>>>>>>>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>>>>>>>>>> GdXC
>>>>>>>>>>>>>> =Aigq
>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>>>>>>>>>>> volumes. I've included the CRUSH map in the tarball.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I stopped one OSD process and marked it out. I let it recover for a
>>>>>>>>>>>>>>> few minutes and then I started the process again and marked it in. I
>>>>>>>>>>>>>>> started getting block I/O messages during the recovery.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>>>>>>>>>>> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>>>>>>>>>>> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>>>>>>>>>>> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>>>>>>>>>>> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>>>>>>>>>>> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>>>>>>>>>>> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>>>>>>>>>>> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>>>>>>>>>>> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>>>>>>>>>>> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>>>>>>>>>>> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>>>>>>>>>>> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>>>>>>>>>>> 3EPx
>>>>>>>>>>>>>>> =UDIV
>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>>>>>>>>>>>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We are still struggling with this and have tried a lot of different
>>>>>>>>>>>>>>>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>>>>>>>>>>>>> consulting services for non-Red Hat systems. If there are some
>>>>>>>>>>>>>>>>> certified Ceph consultants in the US that we can do both remote and
>>>>>>>>>>>>>>>>> on-site engagements, please let us know.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This certainly seems to be network related, but somewhere in the
>>>>>>>>>>>>>>>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>>>>>>>>>>>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>>>>>>>>>>>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>>>>>>>>>>>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>>>>>>>>>>>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>>>>>>>>>>>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>>>>>>>>>>>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>>>>>>>>>>>>> network admins have verified that packets are not being dropped in the
>>>>>>>>>>>>>>>>> switches for these nodes. We have tried different kernels including
>>>>>>>>>>>>>>>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>>>>>>>>>>>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>>>>>>>>>>>>> (from CentOS 7.1) with similar results.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The messages seem slightly different:
>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>>>>>>>>>>>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>> 100.087155 secs
>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>>>>>>>>>>>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>>>>>>>>>>>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>>>>>>>>>>>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>>>>>>>>>>>>> points reached
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don't know what "no flag points reached" means.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Just that the op hasn't been marked as reaching any interesting points
>>>>>>>>>>>>>>>> (op->mark_*() calls).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>>>>>>>>>>>> It's extremely verbose but it'll let us see where the op is getting
>>>>>>>>>>>>>>>> blocked.  If you see the "slow request" message it means the op in
>>>>>>>>>>>>>>>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>>>>>>>>>>>> something we can blame on the network stack.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>>>>>>>>>>>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>>>>>>>>>>>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>>>>>>>>>>>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>>>>>>>>>>>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>>>>>>>>>>>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>>>>>>>>>>>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>>>>>>>>>>>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>>>>>>>>>>>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>>>>>>>>>>>>> priority. We tried increasing the number of op threads but this didn't
>>>>>>>>>>>>>>>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>>>>>>>>>>>>> become active and could be the cause for slow I/O while the other PGs
>>>>>>>>>>>>>>>>> are being checked.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What I don't understand is that the messages are delayed. As soon as
>>>>>>>>>>>>>>>>> the message is received by Ceph OSD process, it is very quickly
>>>>>>>>>>>>>>>>> committed to the journal and a response is sent back to the primary
>>>>>>>>>>>>>>>>> OSD which is received very quickly as well. I've adjust
>>>>>>>>>>>>>>>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>>>>>>>>>>>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>>>>>>>>>>>>> of RAM per nodes for 10 OSDs.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Is there something that could cause the kernel to get a packet but not
>>>>>>>>>>>>>>>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>>>>>>>>>>>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>>>>>>>>>>>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>>>>>>>>>>>>> the Ceph process?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We can really use some pointers no matter how outrageous. We've have
>>>>>>>>>>>>>>>>> over 6 people looking into this for weeks now and just can't think of
>>>>>>>>>>>>>>>>> anything else.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>>>>>>>>>>>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>>>>>>>>>>>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>>>>>>>>>>>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>>>>>>>>>>>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>>>>>>>>>>>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>>>>>>>>>>>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>>>>>>>>>>>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>>>>>>>>>>>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>>>>>>>>>>>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>>>>>>>>>>>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>>>>>>>>>>>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>>>>>>>>>>>>> l7OF
>>>>>>>>>>>>>>>>> =OI++
>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>>>>>>>>>>>>>> like all the blocked I/O has stopped (no entries in the log for the
>>>>>>>>>>>>>>>>>> last 12 hours). This makes me believe that there is some issue with
>>>>>>>>>>>>>>>>>> the number of sockets or some other TCP issue. We have not messed with
>>>>>>>>>>>>>>>>>> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>>>>>>>>>>>>>> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>>>>>>>>>>>>>> processes and 16K system wide.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Does this seem like the right spot to be looking? What are some
>>>>>>>>>>>>>>>>>> configuration items we should be looking at?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>>>>>>>>>>>>>>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>>>>>>>>>>>>>>> seems that there were some major reworks in the network handling in
>>>>>>>>>>>>>>>>>>> the kernel to efficiently handle that network rate. If I remember
>>>>>>>>>>>>>>>>>>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>>>>>>>>>>>>>>> that we did see packet loss while congesting our ISLs in our initial
>>>>>>>>>>>>>>>>>>> testing, but we could not tell where the dropping was happening. We
>>>>>>>>>>>>>>>>>>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>>>>>>>>>>>>>>> trying to congest things. We probably already saw this issue, just
>>>>>>>>>>>>>>>>>>> didn't know it.
>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>>>>>>>>>>>>>>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>>>>>>>>>>>>>>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>>>>>>>>>>>>>>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>>>>>>>>>>>>>>>> drivers might cause problems though.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Here's ifconfig from one of the nodes:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ens513f1: flags=4163  mtu 1500
>>>>>>>>>>>>>>>>>>>>       inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>>>>>>>>>>>>>>>>       inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>>>>>>>>>>>>>>>>       ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>>>>>>>>>>>>>>>>       RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>>>>>>>>>>>>>>>>       RX errors 0  dropped 0  overruns 0  frame 0
>>>>>>>>>>>>>>>>>>>>       TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>>>>>>>>>>>>>>>>       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> OK, here is the update on the saga...
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>>>>>>>>>>>>>>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>>>>>>>>>>>>>>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>>>>>>>>>>>>>>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>>>>>>>>>>>>>>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>>>>>>>>>>>>>>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>>>>>>>>>>>>>>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>>>>>>>>>>>>>>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>>>>>>>>>>>>>>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>>>>>>>>>>>>>>>>> need the network enhancements in the 4.x series to work well.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>>>>>>>>>>>>>>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>>>>>>>>>>>>>>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>>>>>>>>>>>>>>>>> differing hardware and network configs.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>>>>>>>>>>>>>>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>>>>>>>>>>>>>>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>>>>>>>>>>>>>>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>>>>>>>>>>>>>>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>>>>>>>>>>>>>>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>>>>>>>>>>>>>>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>>>>>>>>>>>>>>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>>>>>>>>>>>>>>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>>>>>>>>>>>>>>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>>>>>>>>>>>>>>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>>>>>>>>>>>>>>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>>>>>>>>>>>>>>>>> 4OEo
>>>>>>>>>>>>>>>>>>>>> =P33I
>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>>>>>>>>>>>>>>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>>>>>>>>>>>>>>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>>>>>>>>>>>>>>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>>>>>>>>>>>>>>>>>> blocked I/O.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>>>>>>>>>>>>>>>>>> the blocked I/O.
>>>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>>>>>>>>>>>>>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>>>>>>>>>>>>>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>>>>>>>>>>>>>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>>>>>>>>>>>>>>>>>>> delayed for many 10s of seconds?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> What kernel are you running?
>>>>>>>>>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>>>>>>>>>>>>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>>>>>>>>>>>>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>>>>>>>>>>>>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>>>>>>>>>>>>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>>>>>>>>>>>>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>>>>>>>>>>>>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>>>>>>>>>>>>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>>>>>>>>>>>>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>>>>>>>>>>>>>>>>>>> transfer).
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>>>>>>>>>>>>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>>>>>>>>>>>>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>>>>>>>>>>>>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>>>>>>>>>>>>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>>>>>>>>>>>>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>>>>>>>>>>>>>>>>>>> some help.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Single Test started about
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>> 30.439150 secs
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,16
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>> 30.379680 secs
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.406303:
>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.318144:
>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,14
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>> 30.954212 secs
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 16,17
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>> 30.704367 secs
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Server   IP addr              OSD
>>>>>>>>>>>>>>>>>>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>>>>>>>>>>>>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>>>>>>>>>>>>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>>>>>>>>>>>>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>>>>>>>>>>>>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>>>>>>>>>>>>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> fio job:
>>>>>>>>>>>>>>>>>>>>>>>>> [rbd-test]
>>>>>>>>>>>>>>>>>>>>>>>>> readwrite=write
>>>>>>>>>>>>>>>>>>>>>>>>> blocksize=4M
>>>>>>>>>>>>>>>>>>>>>>> ###runtime=60
>>>>>>>>>>>>>>>>>>>>>>>>> name=rbd-test
>>>>>>>>>>>>>>>>>>>>>>> ###readwrite=randwrite
>>>>>>>>>>>>>>>>>>>>>>> ###bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>>>>>>>>>>>>>>>>> ###rwmixread=72
>>>>>>>>>>>>>>>>>>>>>>> ###norandommap
>>>>>>>>>>>>>>>>>>>>>>> ###size=1T
>>>>>>>>>>>>>>>>>>>>>>> ###blocksize=4k
>>>>>>>>>>>>>>>>>>>>>>>>> ioengine=rbd
>>>>>>>>>>>>>>>>>>>>>>>>> rbdname=test2
>>>>>>>>>>>>>>>>>>>>>>>>> pool=rbd
>>>>>>>>>>>>>>>>>>>>>>>>> clientname=admin
>>>>>>>>>>>>>>>>>>>>>>>>> iodepth=8
>>>>>>>>>>>>>>>>>>>>>>> ###numjobs=4
>>>>>>>>>>>>>>>>>>>>>>> ###thread
>>>>>>>>>>>>>>>>>>>>>>> ###group_reporting
>>>>>>>>>>>>>>>>>>>>>>> ###time_based
>>>>>>>>>>>>>>>>>>>>>>> ###direct=1
>>>>>>>>>>>>>>>>>>>>>>> ###ramp_time=60
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>>>>>>>>>>>>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>>>>>>>>>>>>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>>>>>>>>>>>>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>>>>>>>>>>>>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>>>>>>>>>>>>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>>>>>>>>>>>>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>>>>>>>>>>>>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>>>>>>>>>>>>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>>>>>>>>>>>>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>>>>>>>>>>>>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>>>>>>>>>>>>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>>>>>>>>>>>>>>>>>>> J3hS
>>>>>>>>>>>>>>>>>>>>>>>>> =0J7F
>>>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>>>>>>>>>>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>>>>>>>>>>>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>>>>>>>>>>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>>>>>>>>>>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>>>>>>>>>>>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>>>>>>>>>>>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>>>>>>>>>>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>>>>>>>>>>>>>>>>>>> 20",
>>>>>>>>>>>>>>>>>>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>>>>>>>>>>>>>>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>>>>>>>>>>>>>>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>>>>>>>>>>>>>>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>>>>>>>>>>>>>>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>>>>>>>>>>>>>>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>>>>>>>>>>>>>>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>>>>>>>>>>>>>>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>>>>>>>>>>>>>>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>>>>>>>>>>>>>>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>>>>>>>>>>>>>>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>>>>>>>>>>>>>>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>>>>>>>>>>>>>>>>>> gcZm
>>>>>>>>>>>>>>>>>>>>>> =CjwB
>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>>>>>>>>>>>>>>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>>>>>>>>>>>>>>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>>>>>>>>>>>>>>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>>>>>>>>>>>>>>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>>>>>>>>>>>>>>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>>>>>>>>>>>>>>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>>>>>>>>>>>>>>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>>>>>>>>>>>>>>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>>>>>>>>>>>>>>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>>>>>>>>>>>>>>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>>>>>>>>>>>>>>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>>>>>>>>>>>>>>> ae22
>>>>>>>>>>>>>>>>>>> =AX+L
>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>> 
>>>>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>>>>> 6Kfk
>>>>>>>>>> =/gR6
>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>> 
>>>>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>>>>> JFPi
>>>>>>>> =ofgq
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>> 
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.2.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>> 
>>>>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>>>>> JYNo
>>>>>>> =msX2
>>>>>>> -----END PGP SIGNATURE-----
>>>>> 
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>>> 
>>>>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>>>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>>>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>>>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>>>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>>>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>>>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>>>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>>>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>>>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>>>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>>>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>>>> rZW5
>>>>> =dKn9
>>>>> -----END PGP SIGNATURE-----
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> -- 
>>> WBR, Max A. Krasilnikov
>>> ColoCall Data Center
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> WBR, Max A. Krasilnikov
> ColoCall Data Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                                   ` <2FD6AADF-88BB-4DFF-B6C2-E103F16B55A8-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
@ 2015-10-09 14:14                                                                                                                     ` Max A. Krasilnikov
  2015-10-16  8:21                                                                                                                     ` Max A. Krasilnikov
  1 sibling, 0 replies; 45+ messages in thread
From: Max A. Krasilnikov @ 2015-10-09 14:14 UTC (permalink / raw)
  To: Jan Schermer; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Здравствуйте! 

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

No other layers, only 2x Nexus 5020 with virtual portchannels. All other I will
check on Monday.

> What about some intermediate MTU like 8000 - does that work?

Not tested. I will.

> Oh and if there's any bonding/trunking involved, beware that you need to set the same MTU and offloads on all interfaces on certains kernels - flags like MTU/offloads should propagate between the master/slave interfaces but in reality it's not the case and they get reset even if you unplug/replug the ethernet cable.

Yes, I understand it :) I was setting parameters on both interfaces and checked
it out using "ip link".

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pseudo@colocall.net> wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happen with a cluster like Ceph. It's better to let the frame drop/retransmit in this case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
>>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pseudo@colocall.net> wrote:
>>>> 
>>>> Hello!
>>>> 
>>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>>> 
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>> 
>>>>> Sage,
>>>> 
>>>>> After trying to bisect this issue (all test moved the bisect towards
>>>>> Infernalis) and eventually testing the Infernalis branch again, it
>>>>> looks like the problem still exists although it is handled a tad
>>>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>>>> week and then try and dive into the code to see if I can expose any
>>>>> thing.
>>>> 
>>>>> If I can do anything to provide you with information, please let me know.
>>>> 
>>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
>>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
>>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
>>>> switch with Jumbo frames enabled i have performance drop and slow requests. When
>>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>>> 
>>>> I have rebooted all my ceph services when changing MTU and changing things to
>>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>>> environment.
>>>> 
>>>>> Thanks,
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>> 
>>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>>>> BCFo
>>>>> =GJL4
>>>>> -----END PGP SIGNATURE-----
>>>>> ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> 
>>>> 
>>>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>> 
>>>>>> We forgot to upload the ceph.log yesterday. It is there now.
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>> 
>>>>>>> I upped the debug on about everything and ran the test for about 40
>>>>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>>>>> seconds. Hopefully this will have something that will cast a light on
>>>>>>> what is going on.
>>>>>>> 
>>>>>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>>>>>> the test to verify the results from the dev cluster. This cluster
>>>>>>> matches the hardware of our production cluster but is not yet in
>>>>>>> production so we can safely wipe it to downgrade back to Hammer.
>>>>>>> 
>>>>>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>>>>> 
>>>>>>> Let me know what else we can do to help.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.2.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>> 
>>>>>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>>>>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>>>>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>>>>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>>>>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>>>>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>>>>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>>>>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>>>>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>>>>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>>>>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>>>>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>>>>>> EDrG
>>>>>>> =BZVw
>>>>>>> -----END PGP SIGNATURE-----
>>>>>>> ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>> 
>>>>>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>>>>>> few messages have popped up over a 20 window. Still far less than I
>>>>>>>> have been seeing.
>>>>>>>> - ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>> Hash: SHA256
>>>>>>>>> 
>>>>>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>>>>>> want turned up? I've seen the same thing where I see the message
>>>>>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>>>>>> for 30+ seconds in the secondary OSD logs.
>>>>>>>>> - ----------------
>>>>>>>>> Robert LeBlanc
>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>> 
>>>>>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>>>>>                                           (me too!)
>>>>>>>>>>> better than what we have now. This dev cluster also has
>>>>>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>>>>> 
>>>>>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>>>>> 
>>>>>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>>>>>> turned up for long enough?
>>>>>>>>>> 
>>>>>>>>>> sage
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> - ----------------
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>> 
>>>>>>>>>>>>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>>>>>>>>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>>>>>>>>> messages when the OSD was marked out:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>>>>>>>>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>>>>>>>>> 34.476006 secs
>>>>>>>>>>>>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>>>>>>>>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>>>>>>>>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>>>>>>>>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 
>>>>>>>>>>>>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>>>>>>>>> OSD spindles have been running at 100% during this test. I have seen
>>>>>>>>>>>>> slowed I/O from the clients as expected from the extra load, but so
>>>>>>>>>>>>> far no blocked messages. I'm going to run some more tests.
>>>>>>>>>>>> 
>>>>>>>>>>>> Good to hear.
>>>>>>>>>>>> 
>>>>>>>>>>>> FWIW I looked through the logs and all of the slow request no flag point
>>>>>>>>>>>> messages came from osd.163... and the logs don't show when they arrived.
>>>>>>>>>>>> My guess is this OSD has a slower disk than the others, or something else
>>>>>>>>>>>> funny is going on?
>>>>>>>>>>>> 
>>>>>>>>>>>> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>>>>>>>> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>>>>>>>> osd.163.
>>>>>>>>>>>> 
>>>>>>>>>>>> sage
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>>>>>>>>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>>>>>>>>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>>>>>>>>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>>>>>>>>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>>>>>>>>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>>>>>>>>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>>>>>>>>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>>>>>>>>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>>>>>>>>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>>>>>>>>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>>>>>>>>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>>>>>>>>> fo5a
>>>>>>>>>>>>> =ahEi
>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>>>>>>>>>> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With some off-list help, we have adjusted
>>>>>>>>>>>>>>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>>>>>>>>>>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>>>>>>>>>>> it does not solve the problem with the blocked I/O.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>>>>>>>>>>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>>>>>>>>>>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>>>>>>>>>>> it starts servicing it or what exactly.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>>>>>>>>>>> to master and things didn't go so well. The OSDs would not start
>>>>>>>>>>>>>>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>>>>>>>>>>> and all OSDs and the OSD then started, but never became active in the
>>>>>>>>>>>>>>> cluster. It just sat there after reading all the PGs. There were
>>>>>>>>>>>>>>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>>>>>>>>>>> downgrading to the Infernalis branch and still no luck getting the
>>>>>>>>>>>>>>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>>>>>>>>>>> All packages were installed from gitbuilder.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Did you chown -R ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>       https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>>>>>>>>>> an error when it encountered the other files?  If you can generate a debug
>>>>>>>>>>>>>> osd = 20 log, that would be helpful.. thanks!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>>>>>>>>>>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>>>>>>>>>>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>>>>>>>>>>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>>>>>>>>>>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>>>>>>>>>>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>>>>>>>>>>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>>>>>>>>>>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>>>>>>>>>>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>>>>>>>>>>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>>>>>>>>>>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>>>>>>>>>>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>>>>>>>>>>> GdXC
>>>>>>>>>>>>>>> =Aigq
>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>>>>>>>>>>>> volumes. I've included the CRUSH map in the tarball.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I stopped one OSD process and marked it out. I let it recover for a
>>>>>>>>>>>>>>>> few minutes and then I started the process again and marked it in. I
>>>>>>>>>>>>>>>> started getting block I/O messages during the recovery.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>>>>>>>>>>>> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>>>>>>>>>>>> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>>>>>>>>>>>> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>>>>>>>>>>>> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>>>>>>>>>>>> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>>>>>>>>>>>> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>>>>>>>>>>>> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>>>>>>>>>>>> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>>>>>>>>>>>> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>>>>>>>>>>>> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>>>>>>>>>>>> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>>>>>>>>>>>> 3EPx
>>>>>>>>>>>>>>>> =UDIV
>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We are still struggling with this and have tried a lot of different
>>>>>>>>>>>>>>>>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>>>>>>>>>>>>>> consulting services for non-Red Hat systems. If there are some
>>>>>>>>>>>>>>>>>> certified Ceph consultants in the US that we can do both remote and
>>>>>>>>>>>>>>>>>> on-site engagements, please let us know.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This certainly seems to be network related, but somewhere in the
>>>>>>>>>>>>>>>>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>>>>>>>>>>>>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>>>>>>>>>>>>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>>>>>>>>>>>>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>>>>>>>>>>>>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>>>>>>>>>>>>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>>>>>>>>>>>>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>>>>>>>>>>>>>> network admins have verified that packets are not being dropped in the
>>>>>>>>>>>>>>>>>> switches for these nodes. We have tried different kernels including
>>>>>>>>>>>>>>>>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>>>>>>>>>>>>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>>>>>>>>>>>>>> (from CentOS 7.1) with similar results.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The messages seem slightly different:
>>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>>>>>>>>>>>>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>> 100.087155 secs
>>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>>>>>>>>>>>>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>>>>>>>>>>>>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>>>>>>>>>>>>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>>>>>>>>>>>>>> points reached
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I don't know what "no flag points reached" means.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Just that the op hasn't been marked as reaching any interesting points
>>>>>>>>>>>>>>>>> (op->mark_*() calls).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>>>>>>>>>>>>> It's extremely verbose but it'll let us see where the op is getting
>>>>>>>>>>>>>>>>> blocked.  If you see the "slow request" message it means the op in
>>>>>>>>>>>>>>>>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>>>>>>>>>>>>> something we can blame on the network stack.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>>>>>>>>>>>>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>>>>>>>>>>>>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>>>>>>>>>>>>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>>>>>>>>>>>>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>>>>>>>>>>>>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>>>>>>>>>>>>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>>>>>>>>>>>>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>>>>>>>>>>>>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>>>>>>>>>>>>>> priority. We tried increasing the number of op threads but this didn't
>>>>>>>>>>>>>>>>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>>>>>>>>>>>>>> become active and could be the cause for slow I/O while the other PGs
>>>>>>>>>>>>>>>>>> are being checked.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> What I don't understand is that the messages are delayed. As soon as
>>>>>>>>>>>>>>>>>> the message is received by Ceph OSD process, it is very quickly
>>>>>>>>>>>>>>>>>> committed to the journal and a response is sent back to the primary
>>>>>>>>>>>>>>>>>> OSD which is received very quickly as well. I've adjust
>>>>>>>>>>>>>>>>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>>>>>>>>>>>>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>>>>>>>>>>>>>> of RAM per nodes for 10 OSDs.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is there something that could cause the kernel to get a packet but not
>>>>>>>>>>>>>>>>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>>>>>>>>>>>>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>>>>>>>>>>>>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>>>>>>>>>>>>>> the Ceph process?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We can really use some pointers no matter how outrageous. We've have
>>>>>>>>>>>>>>>>>> over 6 people looking into this for weeks now and just can't think of
>>>>>>>>>>>>>>>>>> anything else.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>>>>>>>>>>>>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>>>>>>>>>>>>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>>>>>>>>>>>>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>>>>>>>>>>>>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>>>>>>>>>>>>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>>>>>>>>>>>>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>>>>>>>>>>>>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>>>>>>>>>>>>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>>>>>>>>>>>>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>>>>>>>>>>>>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>>>>>>>>>>>>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>>>>>>>>>>>>>> l7OF
>>>>>>>>>>>>>>>>>> =OI++
>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>>>>>>>>>>>>>>> like all the blocked I/O has stopped (no entries in the log for the
>>>>>>>>>>>>>>>>>>> last 12 hours). This makes me believe that there is some issue with
>>>>>>>>>>>>>>>>>>> the number of sockets or some other TCP issue. We have not messed with
>>>>>>>>>>>>>>>>>>> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>>>>>>>>>>>>>>> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>>>>>>>>>>>>>>> processes and 16K system wide.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Does this seem like the right spot to be looking? What are some
>>>>>>>>>>>>>>>>>>> configuration items we should be looking at?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>>>>>>>>>>>>>>>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>>>>>>>>>>>>>>>> seems that there were some major reworks in the network handling in
>>>>>>>>>>>>>>>>>>>> the kernel to efficiently handle that network rate. If I remember
>>>>>>>>>>>>>>>>>>>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>>>>>>>>>>>>>>>> that we did see packet loss while congesting our ISLs in our initial
>>>>>>>>>>>>>>>>>>>> testing, but we could not tell where the dropping was happening. We
>>>>>>>>>>>>>>>>>>>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>>>>>>>>>>>>>>>> trying to congest things. We probably already saw this issue, just
>>>>>>>>>>>>>>>>>>>> didn't know it.
>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>>>>>>>>>>>>>>>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>>>>>>>>>>>>>>>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>>>>>>>>>>>>>>>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>>>>>>>>>>>>>>>>> drivers might cause problems though.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Here's ifconfig from one of the nodes:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ens513f1: flags=4163  mtu 1500
>>>>>>>>>>>>>>>>>>>>>       inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>>>>>>>>>>>>>>>>>       inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>>>>>>>>>>>>>>>>>       ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>>>>>>>>>>>>>>>>>       RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>>>>>>>>>>>>>>>>>       RX errors 0  dropped 0  overruns 0  frame 0
>>>>>>>>>>>>>>>>>>>>>       TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>>>>>>>>>>>>>>>>>       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> OK, here is the update on the saga...
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>>>>>>>>>>>>>>>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>>>>>>>>>>>>>>>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>>>>>>>>>>>>>>>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>>>>>>>>>>>>>>>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>>>>>>>>>>>>>>>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>>>>>>>>>>>>>>>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>>>>>>>>>>>>>>>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>>>>>>>>>>>>>>>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>>>>>>>>>>>>>>>>>> need the network enhancements in the 4.x series to work well.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>>>>>>>>>>>>>>>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>>>>>>>>>>>>>>>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>>>>>>>>>>>>>>>>>> differing hardware and network configs.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>>>>>>>>>>>>>>>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>>>>>>>>>>>>>>>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>>>>>>>>>>>>>>>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>>>>>>>>>>>>>>>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>>>>>>>>>>>>>>>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>>>>>>>>>>>>>>>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>>>>>>>>>>>>>>>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>>>>>>>>>>>>>>>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>>>>>>>>>>>>>>>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>>>>>>>>>>>>>>>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>>>>>>>>>>>>>>>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>>>>>>>>>>>>>>>>>> 4OEo
>>>>>>>>>>>>>>>>>>>>>> =P33I
>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>>>>>>>>>>>>>>>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>>>>>>>>>>>>>>>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>>>>>>>>>>>>>>>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>>>>>>>>>>>>>>>>>>> blocked I/O.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>>>>>>>>>>>>>>>>>>> the blocked I/O.
>>>>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>>>>>>>>>>>>>>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>>>>>>>>>>>>>>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>>>>>>>>>>>>>>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>>>>>>>>>>>>>>>>>>>> delayed for many 10s of seconds?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> What kernel are you running?
>>>>>>>>>>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>>>>>>>>>>>>>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>>>>>>>>>>>>>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>>>>>>>>>>>>>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>>>>>>>>>>>>>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>>>>>>>>>>>>>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>>>>>>>>>>>>>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>>>>>>>>>>>>>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>>>>>>>>>>>>>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>>>>>>>>>>>>>>>>>>>> transfer).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>>>>>>>>>>>>>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>>>>>>>>>>>>>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>>>>>>>>>>>>>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>>>>>>>>>>>>>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>>>>>>>>>>>>>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>>>>>>>>>>>>>>>>>>>> some help.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Single Test started about
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.439150 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.379680 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.406303:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.318144:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,14
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.954212 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 16,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.704367 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Server   IP addr              OSD
>>>>>>>>>>>>>>>>>>>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>>>>>>>>>>>>>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>>>>>>>>>>>>>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>>>>>>>>>>>>>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>>>>>>>>>>>>>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>>>>>>>>>>>>>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> fio job:
>>>>>>>>>>>>>>>>>>>>>>>>>> [rbd-test]
>>>>>>>>>>>>>>>>>>>>>>>>>> readwrite=write
>>>>>>>>>>>>>>>>>>>>>>>>>> blocksize=4M
>>>>>>>>>>>>>>>>>>>>>>> ####runtime=60
>>>>>>>>>>>>>>>>>>>>>>>>>> name=rbd-test
>>>>>>>>>>>>>>>>>>>>>>> ####readwrite=randwrite
>>>>>>>>>>>>>>>>>>>>>>> ####bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>>>>>>>>>>>>>>>>> ####rwmixread=72
>>>>>>>>>>>>>>>>>>>>>>> ####norandommap
>>>>>>>>>>>>>>>>>>>>>>> ####size=1T
>>>>>>>>>>>>>>>>>>>>>>> ####blocksize=4k
>>>>>>>>>>>>>>>>>>>>>>>>>> ioengine=rbd
>>>>>>>>>>>>>>>>>>>>>>>>>> rbdname=test2
>>>>>>>>>>>>>>>>>>>>>>>>>> pool=rbd
>>>>>>>>>>>>>>>>>>>>>>>>>> clientname=admin
>>>>>>>>>>>>>>>>>>>>>>>>>> iodepth=8
>>>>>>>>>>>>>>>>>>>>>>> ####numjobs=4
>>>>>>>>>>>>>>>>>>>>>>> ####thread
>>>>>>>>>>>>>>>>>>>>>>> ####group_reporting
>>>>>>>>>>>>>>>>>>>>>>> ####time_based
>>>>>>>>>>>>>>>>>>>>>>> ####direct=1
>>>>>>>>>>>>>>>>>>>>>>> ####ramp_time=60
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>>>>>>>>>>>>>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>>>>>>>>>>>>>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>>>>>>>>>>>>>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>>>>>>>>>>>>>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>>>>>>>>>>>>>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>>>>>>>>>>>>>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>>>>>>>>>>>>>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>>>>>>>>>>>>>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>>>>>>>>>>>>>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>>>>>>>>>>>>>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>>>>>>>>>>>>>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>>>>>>>>>>>>>>>>>>>> J3hS
>>>>>>>>>>>>>>>>>>>>>>>>>> =0J7F
>>>>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>>>>>>>>>>>>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>>>>>>>>>>>>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>>>>>>>>>>>>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>>>>>>>>>>>>>>>>>>>> 20",
>>>>>>>>>>>>>>>>>>>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>>>>>>>>>>>>>>>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>>>>>>>>>>>>>>>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>>>>>>>>>>>>>>>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>>>>>>>>>>>>>>>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>>>>>>>>>>>>>>>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>>>>>>>>>>>>>>>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>>>>>>>>>>>>>>>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>>>>>>>>>>>>>>>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>>>>>>>>>>>>>>>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>>>>>>>>>>>>>>>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>>>>>>>>>>>>>>>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>>>>>>>>>>>>>>>>>>> gcZm
>>>>>>>>>>>>>>>>>>>>>>> =CjwB
>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>>>>>>>>>>>>>>>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>>>>>>>>>>>>>>>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>>>>>>>>>>>>>>>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>>>>>>>>>>>>>>>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>>>>>>>>>>>>>>>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>>>>>>>>>>>>>>>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>>>>>>>>>>>>>>>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>>>>>>>>>>>>>>>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>>>>>>>>>>>>>>>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>>>>>>>>>>>>>>>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>>>>>>>>>>>>>>>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>>>>>>>>>>>>>>>> ae22
>>>>>>>>>>>>>>>>>>>> =AX+L
>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>> 
>>>>>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>>>>>> 6Kfk
>>>>>>>>>>> =/gR6
>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>> 
>>>>>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>>>>>> JFPi
>>>>>>>>> =ofgq
>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>> 
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>> 
>>>>>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>>>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>>>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>>>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>>>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>>>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>>>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>>>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>>>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>>>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>>>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>>>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>>>>>> JYNo
>>>>>>>> =msX2
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>> 
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>> 
>>>>>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>>>>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>>>>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>>>>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>>>>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>>>>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>>>>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>>>>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>>>>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>>>>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>>>>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>>>>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>>>>> rZW5
>>>>>> =dKn9
>>>>>> -----END PGP SIGNATURE-----
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> -- 
>>>> WBR, Max A. Krasilnikov
>>>> ColoCall Data Center
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> -- 
>> WBR, Max A. Krasilnikov
>> ColoCall Data Center


-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [ceph-users] Potential OSD deadlock?
       [not found]                                                                                                   ` <CAANLjFquEvjDDT94ZL2mXQh5r_XWCxw3X=eFZ=c29gNHKt=2tw@mail.gmail.com>
@ 2015-10-13 17:03                                                                                                     ` Sage Weil
       [not found]                                                                                                       ` <alpine.DEB.2.00.1510130956130.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-13 17:03 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users, ceph-devel

On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> After a weekend, I'm ready to hit this from a different direction.
> 
> I replicated the issue with Firefly so it doesn't seem an issue that
> has been introduced or resolved in any nearby version. I think overall
> we may be seeing [1] to a great degree. From what I can extract from
> the logs, it looks like in situations where OSDs are going up and
> down, I see I/O blocked at the primary OSD waiting for peering and/or
> the PG to become clean before dispatching the I/O to the replicas.
> 
> In an effort to understand the flow of the logs, I've attached a small
> 2 minute segment of a log I've extracted what I believe to be
> important entries in the life cycle of an I/O along with my
> understanding. If someone would be kind enough to help my
> understanding, I would appreciate it.
> 
> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
> >> 192.168.55.12:0/2013622 pipe(0x26c90000 sd=47 :6800 s=2 pgs=2 cs=1
> l=1 c=0x32c85440).reader got message 19 0x2af81700
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Messenger has recieved the message from the client (previous
> entries in the 7fb9d2c68700 thread are the individual segments that
> make up this message).
> 
> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
> <== client.6709 192.168.55.12:0/2013622 19 ====
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> ==== 235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
> 
> - ->OSD process acknowledges that it has received the write.
> 
> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
> 0x3052b300 prio 63 cost 4194304 latency 0.012371
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Not sure excatly what is going on here, the op is being enqueued somewhere..
> 
> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
> 0x3052b300 prio 63 cost 4194304 latency 30.017094
> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
> active+clean]
> 
> - ->The op is dequeued from this mystery queue 30 seconds later in a
> different thread.

^^ This is the problem.  Everything after this looks reasonable.  Looking 
at the other dequeue_op calls over this period, it looks like we're just 
overwhelmed with higher priority requests.  New clients are 63, while 
osd_repop (replicated write from another primary) are 127 and replies from 
our own replicated ops are 196.  We do process a few other prio 63 items, 
but you'll see that their latency is also climbing up to 30s over this 
period.

The question is why we suddenly get a lot of them.. maybe the peering on 
other OSDs just completed so we get a bunch of these?  It's also not clear 
to me what makes osd.4 or this op special.  We expect a mix of primary and 
replica ops on all the OSDs, so why would we suddenly have more of them 
here....

sage


> 
> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
> 
> - ->Not sure what this message is. Look up of secondary OSDs?
> 
> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> new_repop rep_tid 17815 on osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5
> 
> - ->Dispatch write to secondaty OSDs?
> 
> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> -- ?+4195078 0x238fd600 con 0x32bcb5a0
> 
> - ->OSD dispatch write to OSD.0.
> 
> 2015-10-12 14:13:06.545132 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
> submit_message osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> remote, 192.168.55.15:6801/32036, have pipe.
> 
> - ->Message sent to OSD.0.
> 
> 2015-10-12 14:13:06.545195 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
> --> 192.168.55.11:6801/13185 -- osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> -- ?+4195078 0x16edd200 con 0x3a37b20
> 
> - ->OSD dispatch write to OSD.5.
> 
> 2015-10-12 14:13:06.545210 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
> submit_message osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> remote, 192.168.55.11:6801/13185, have pipe.
> 
> - ->Message sent to OSD.5.
> 
> 2015-10-12 14:13:06.545229 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> append_log log((0'0,44'703], crt=44'700) [44'704 (44'691) modify
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> client.6709.0:67 2015-10-12 14:12:34.340082]
> 2015-10-12 14:13:06.545268 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'700 lcod 44'702 mlcod
> 44'702 active+clean] add_log_entry 44'704 (44'691) modify
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> client.6709.0:67 2015-10-12 14:12:34.340082
> 
> - ->These record the OP in the journal log?
> 
> 2015-10-12 14:13:06.563241 7fb9d326e700 20 -- 192.168.55.16:6801/11295
> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> cs=3 l=0 c=0x3a37b20).writer encoding 17337 features 37154696925806591
> 0x16edd200 osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> 
> - ->Writing the data to OSD.5?
> 
> 2015-10-12 14:13:06.573938 7fb9d3874700 10 -- 192.168.55.16:6801/11295
> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
> l=0 c=0x32bcb5a0).reader got ack seq 1206 >= 1206 on 0x238fd600
> osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> 
> - ->Messenger gets ACK from OSD.0 that it reveiced that last packet?
> 
> 2015-10-12 14:13:06.613425 7fb9d3874700 10 -- 192.168.55.16:6801/11295
> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
> l=0 c=0x32bcb5a0).reader got message 1146 0x3ffa480
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> 
> - ->Messenger receives ack on disk from OSD.0.
> 
> 2015-10-12 14:13:06.613447 7fb9d3874700  1 -- 192.168.55.16:6801/11295
> <== osd.0 192.168.55.15:6801/32036 1146 ====
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
> 83+0+0 (2772408781 0 0) 0x3ffa480 con 0x32bcb5a0
> 
> - ->OSD process gets on disk ACK from OSD.0.
> 
> 2015-10-12 14:13:06.613478 7fb9d3874700 10 osd.4 44 handle_replica_op
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
> 
> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
> correlate that to the previous message other than by time.
> 
> 2015-10-12 14:13:06.613504 7fb9d3874700 15 osd.4 44 enqueue_op
> 0x120f9b00 prio 196 cost 0 latency 0.000250
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> 
> - ->The reply is enqueued onto a mystery queue.
> 
> 2015-10-12 14:13:06.627793 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> cs=3 l=0 c=0x3a37b20).reader got ack seq 17337 >= 17337 on 0x16edd200
> osd_repop(client.6709.0:67 0.29
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> 
> - ->Messenger gets ACK from OSD.5 that it reveiced that last packet?
> 
> 2015-10-12 14:13:06.628364 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> cs=3 l=0 c=0x3a37b20).reader got message 16477 0x21cef3c0
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> 
> - ->Messenger receives ack on disk from OSD.5.
> 
> 2015-10-12 14:13:06.628382 7fb9d6afd700  1 -- 192.168.55.16:6801/11295
> <== osd.5 192.168.55.11:6801/13185 16477 ====
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
> 83+0+0 (2104182993 0 0) 0x21cef3c0 con 0x3a37b20
> 
> - ->OSD process gets on disk ACK from OSD.5.
> 
> 2015-10-12 14:13:06.628406 7fb9d6afd700 10 osd.4 44 handle_replica_op
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
> 
> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
> correlate that to the previous message other than by time.
> 
> 2015-10-12 14:13:06.628426 7fb9d6afd700 15 osd.4 44 enqueue_op
> 0x3e41600 prio 196 cost 0 latency 0.000180
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> 
> - ->The reply is enqueued onto a mystery queue.
> 
> 2015-10-12 14:13:07.124206 7fb9f4e9f700  0 log_channel(cluster) log
> [WRN] : slow request 30.598371 seconds old, received at 2015-10-12
> 14:12:36.525724: osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) currently waiting for subops
> from 0,5
> 
> - ->OP has not been dequeued to the client from the mystery queue yet.
> 
> 2015-10-12 14:13:07.278449 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod 44'702 mlcod
> 44'702 active+clean] eval_repop repgather(0x37ea3cc0 44'704
> rep_tid=17815 committed?=0 applied?=0 lock=0
> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
> wants=ad
> 
> - ->Not sure what this means. The OP has been completed on all replicas?
> 
> 2015-10-12 14:13:07.278566 7fb9e0535700 10 osd.4 44 dequeue_op
> 0x120f9b00 prio 196 cost 0 latency 0.665312
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
> 44'702 mlcod 44'702 active+clean]
> 
> - ->One of the replica OPs is dequeued in a different thread
> 
> 2015-10-12 14:13:07.278809 7fb9e0535700 10 osd.4 44 dequeue_op
> 0x3e41600 prio 196 cost 0 latency 0.650563
> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
> 44'702 mlcod 44'702 active+clean]
> 
> - ->The other replica OP is dequeued in the new thread
> 
> 2015-10-12 14:13:07.967469 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> active+clean] eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815
> committed?=1 applied?=0 lock=0 op=osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
> 
> - ->Not sure what this does. A thread that joins the replica OPs with
> the primary OP?
> 
> 2015-10-12 14:13:07.967515 7fb9efe95700 15 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> active+clean] log_op_stats osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5 inb 4194304 outb 0 rlat
> 0.000000 lat 31.441789
> 
> - ->Logs that the write has been committed to all replicas in the
> primary journal?
> 
> Not sure what the rest of these do, nor do I understand where the
> client gets an ACK that the write is committed.
> 
> 2015-10-12 14:13:07.967583 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> active+clean]  sending commit on repgather(0x37ea3cc0 44'704
> rep_tid=17815 committed?=1 applied?=0 lock=0
> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
> 0x3a2f0840
> 
> 2015-10-12 14:13:10.351452 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'702 active+clean]
> eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> applied?=1 lock=0 op[0/1943]client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
> 
> 2015-10-12 14:13:10.354089 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
> removing repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> applied?=1 lock=0 op=osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5)
> 
> 2015-10-12 14:13:10.354163 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>  q front is repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> applied?=1 lock=0 op=osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5)
> 
> 2015-10-12 14:13:10.354199 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
> remove_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> applied?=1 lock=0 op=osd_op(client.6709.0:67
> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> ack+ondisk+write+known_if_redirected e44) v5)
> 
> 2015-10-12 14:13:15.488448 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> v 44'707 (0'0,44'707] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> [4,5,0] r=0 lpr=32 luod=44'705 lua=44'705 crt=44'704 lcod 44'704 mlcod
> 44'704 active+clean] append_log: trimming to 44'704 entries 44'704
> (44'691) modify
> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> client.6709.0:67 2015-10-12 14:12:34.340082
> 
> Thanks for hanging in there with me on this...
> 
> [1] http://www.spinics.net/lists/ceph-devel/msg26633.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWHCx0CRDmVDuy+mK58QAAXf8P/j6MD52r2DLqOP9hKFAP
> MJUktg8uqK1i8awtuIQhJHAPDZQF8EACOXg6RBuOz75iryCFKAJXk5exLXrE
> pIZqY/0/JCsUEPuQGaMY9GVQNrTeB82F5VIu572i2xeFir4fUEcvllXSeR9O
> CxSgaAncxUYGSXwsiCJ28QhwPCFXtCLACg1eTpghhAcOwY0t+z6ZB3vh+WxB
> B8kRCdee78TVZOgeTnd66aBJUrr21Ir9aPqSm73uY561dyDmyxc4zPq+FDsJ
> kuac+Ky9Lc6rqhxwRptbdx5i/EDzxj96EKEz2v4SFBmvzU8jtZlA8THJ6WlF
> 6lZRpRIMfEqVu4neFcdUIct8+Brf7fuxOI7hbhUL5xq2I6yDSY8E2T8ImRoS
> w8bSrjFV3wmnXSCHnFJPROqdhtlQlH1PkKPBRJeJrkrB1MloX0ybU4hNIr7Q
> 4ZyzeLpD9sgL1vEfUVuCksgiVJhzlFOyqeRHcfpPEnLxyGL/+mLUa5lQ5m5l
> m286ZnsMZGMzAdSA/tsqnTFzL0HbjkiWD/OMU5zThSKW2tZBNWg3xZE5Yia9
> zAbhxpvxqhKQ7nfmv3xeVJ1GKb9CuzfN9ZIGPltHvpA3rZf3I4+XVlWbbhDZ
> z8Xp8Pw8f7neh89Tv3AT+krM1jrE1ZxOF5A2K4CxBcS3OEMc5UIZ2fy4dHSo
> 0iTE
> =t7nL
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Thu, Oct 8, 2015 at 11:44 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > Sage,
> >
> > After trying to bisect this issue (all test moved the bisect towards
> > Infernalis) and eventually testing the Infernalis branch again, it
> > looks like the problem still exists although it is handled a tad
> > better in Infernalis. I'm going to test against Firefly/Giant next
> > week and then try and dive into the code to see if I can expose any
> > thing.
> >
> > If I can do anything to provide you with information, please let me know.
> >
> > Thanks,
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.2.0
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> > BCFo
> > =GJL4
> > -----END PGP SIGNATURE-----
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We forgot to upload the ceph.log yesterday. It is there now.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> I upped the debug on about everything and ran the test for about 40
> >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> >>> There was at least one op on osd.19 that was blocked for over 1,000
> >>> seconds. Hopefully this will have something that will cast a light on
> >>> what is going on.
> >>>
> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> >>> the test to verify the results from the dev cluster. This cluster
> >>> matches the hardware of our production cluster but is not yet in
> >>> production so we can safely wipe it to downgrade back to Hammer.
> >>>
> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
> >>>
> >>> Let me know what else we can do to help.
> >>>
> >>> Thanks,
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: Mailvelope v1.2.0
> >>> Comment: https://www.mailvelope.com
> >>>
> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
> >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
> >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
> >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
> >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
> >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
> >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
> >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
> >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
> >>> EDrG
> >>> =BZVw
> >>> -----END PGP SIGNATURE-----
> >>> ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>
> >>>
> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> On my second test (a much longer one), it took nearly an hour, but a
> >>>> few messages have popped up over a 20 window. Still far less than I
> >>>> have been seeing.
> >>>> - ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> I'll capture another set of logs. Is there any other debugging you
> >>>>> want turned up? I've seen the same thing where I see the message
> >>>>> dispatched to the secondary OSD, but the message just doesn't show up
> >>>>> for 30+ seconds in the secondary OSD logs.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
> >>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> Hash: SHA256
> >>>>>>>
> >>>>>>> I can't think of anything. In my dev cluster the only thing that has
> >>>>>>> changed is the Ceph versions (no reboot). What I like is even though
> >>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
> >>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
> >>>>>>> the OSD boots or during the recovery period. This is with
> >>>>>>> max_backfills set to 20, one backfill max in our production cluster is
> >>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
> >>>>>>> our dev cluster very easily and very quickly with these settings. So
> >>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
> >>>>>>> marked out. We would love to see that go away too, but this is far
> >>>>>>                                             (me too!)
> >>>>>>> better than what we have now. This dev cluster also has
> >>>>>>> osd_client_message_cap set to default (100).
> >>>>>>>
> >>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
> >>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
> >>>>>>> you you prefer a bisect to find the introduction of the problem
> >>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
> >>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> >>>>>>> commit that prevents a clean build as that is my most limiting factor?
> >>>>>>
> >>>>>> Nothing comes to mind.  I think the best way to find this is still to see
> >>>>>> it happen in the logs with hammer.  The frustrating thing with that log
> >>>>>> dump you sent is that although I see plenty of slow request warnings in
> >>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
> >>>>>> turned up for long enough?
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Thanks,
> >>>>>>> - ----------------
> >>>>>>> Robert LeBlanc
> >>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> >>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> Hash: SHA256
> >>>>>>> >>
> >>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> >>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> >>>>>>> >> messages when the OSD was marked out:
> >>>>>>> >>
> >>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> >>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> >>>>>>> >> 34.476006 secs
> >>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> >>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
> >>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> >>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> >>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
> >>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> >>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> >>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
> >>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> >>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>>>>>> >>
> >>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
> >>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
> >>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
> >>>>>>> >> far no blocked messages. I'm going to run some more tests.
> >>>>>>> >
> >>>>>>> > Good to hear.
> >>>>>>> >
> >>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
> >>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
> >>>>>>> > My guess is this OSD has a slower disk than the others, or something else
> >>>>>>> > funny is going on?
> >>>>>>> >
> >>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
> >>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
> >>>>>>> > osd.163.
> >>>>>>> >
> >>>>>>> > sage
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >>
> >>>>>>> >> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> Version: Mailvelope v1.2.0
> >>>>>>> >> Comment: https://www.mailvelope.com
> >>>>>>> >>
> >>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> >>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> >>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> >>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> >>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> >>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> >>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> >>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> >>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> >>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> >>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> >>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
> >>>>>>> >> fo5a
> >>>>>>> >> =ahEi
> >>>>>>> >> -----END PGP SIGNATURE-----
> >>>>>>> >> ----------------
> >>>>>>> >> Robert LeBlanc
> >>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
> >>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> Hash: SHA256
> >>>>>>> >> >>
> >>>>>>> >> >> With some off-list help, we have adjusted
> >>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
> >>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
> >>>>>>> >> >> it does not solve the problem with the blocked I/O.
> >>>>>>> >> >>
> >>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
> >>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
> >>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
> >>>>>>> >> >> it starts servicing it or what exactly.
> >>>>>>> >> >
> >>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
> >>>>>>> >> >
> >>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
> >>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
> >>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
> >>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
> >>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
> >>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
> >>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
> >>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
> >>>>>>> >> >> All packages were installed from gitbuilder.
> >>>>>>> >> >
> >>>>>>> >> > Did you chown -R ?
> >>>>>>> >> >
> >>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
> >>>>>>> >> >
> >>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
> >>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
> >>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
> >>>>>>> >> >
> >>>>>>> >> > sage
> >>>>>>> >> >
> >>>>>>> >> >
> >>>>>>> >> >>
> >>>>>>> >> >> Thanks,
> >>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> Version: Mailvelope v1.2.0
> >>>>>>> >> >> Comment: https://www.mailvelope.com
> >>>>>>> >> >>
> >>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >>>>>>> >> >> GdXC
> >>>>>>> >> >> =Aigq
> >>>>>>> >> >> -----END PGP SIGNATURE-----
> >>>>>>> >> >> ----------------
> >>>>>>> >> >> Robert LeBlanc
> >>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >>
> >>>>>>> >> >>
> >>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
> >>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> > Hash: SHA256
> >>>>>>> >> >> >
> >>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
> >>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
> >>>>>>> >> >> >
> >>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
> >>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
> >>>>>>> >> >> > started getting block I/O messages during the recovery.
> >>>>>>> >> >> >
> >>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >>>>>>> >> >> >
> >>>>>>> >> >> > Thanks,
> >>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> > Version: Mailvelope v1.2.0
> >>>>>>> >> >> > Comment: https://www.mailvelope.com
> >>>>>>> >> >> >
> >>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> >>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> >>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> >>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> >>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> >>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> >>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> >>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> >>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> >>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> >>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> >>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> >>>>>>> >> >> > 3EPx
> >>>>>>> >> >> > =UDIV
> >>>>>>> >> >> > -----END PGP SIGNATURE-----
> >>>>>>> >> >> >
> >>>>>>> >> >> > ----------------
> >>>>>>> >> >> > Robert LeBlanc
> >>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >
> >>>>>>> >> >> >
> >>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
> >>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> Hash: SHA256
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
> >>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
> >>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
> >>>>>>> >> >> >>> on-site engagements, please let us know.
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
> >>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
> >>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> >>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
> >>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> >>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
> >>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
> >>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> >>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
> >>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
> >>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
> >>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> >>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> The messages seem slightly different:
> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> >>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> >>>>>>> >> >> >>> 100.087155 secs
> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> >>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
> >>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> >>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> >>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> >>>>>>> >> >> >>> points reached
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> I don't know what "no flag points reached" means.
> >>>>>>> >> >> >>
> >>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
> >>>>>>> >> >> >> (op->mark_*() calls).
> >>>>>>> >> >> >>
> >>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> >>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
> >>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
> >>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
> >>>>>>> >> >> >> something we can blame on the network stack.
> >>>>>>> >> >> >>
> >>>>>>> >> >> >> sage
> >>>>>>> >> >> >>
> >>>>>>> >> >> >>
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
> >>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> >>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
> >>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
> >>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> >>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
> >>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
> >>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
> >>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
> >>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
> >>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
> >>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
> >>>>>>> >> >> >>> are being checked.
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
> >>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
> >>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
> >>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
> >>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> >>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> >>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
> >>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
> >>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> >>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
> >>>>>>> >> >> >>> the Ceph process?
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
> >>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
> >>>>>>> >> >> >>> anything else.
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> Thanks,
> >>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> >>> Version: Mailvelope v1.1.0
> >>>>>>> >> >> >>> Comment: https://www.mailvelope.com
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >>>>>>> >> >> >>> l7OF
> >>>>>>> >> >> >>> =OI++
> >>>>>>> >> >> >>> -----END PGP SIGNATURE-----
> >>>>>>> >> >> >>> ----------------
> >>>>>>> >> >> >>> Robert LeBlanc
> >>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> >>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> >>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
> >>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
> >>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
> >>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> >>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> >>>>>>> >> >> >>> > processes and 16K system wide.
> >>>>>>> >> >> >>> >
> >>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
> >>>>>>> >> >> >>> > configuration items we should be looking at?
> >>>>>>> >> >> >>> >
> >>>>>>> >> >> >>> > Thanks,
> >>>>>>> >> >> >>> > ----------------
> >>>>>>> >> >> >>> > Robert LeBlanc
> >>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>> >
> >>>>>>> >> >> >>> >
> >>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> >> Hash: SHA256
> >>>>>>> >> >> >>> >>
> >>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
> >>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
> >>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> >>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
> >>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
> >>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
> >>>>>>> >> >> >>> >> didn't know it.
> >>>>>>> >> >> >>> >> - ----------------
> >>>>>>> >> >> >>> >> Robert LeBlanc
> >>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>> >>
> >>>>>>> >> >> >>> >>
> >>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>>>>>> >> >> >>> >>> drivers might cause problems though.
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
> >>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>> Mark
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> >>>> Hash: SHA256
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>>>>> >> >> >>> >>>> differing hardware and network configs.
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> Thanks,
> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
> >>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>>>>> >> >> >>> >>>> 4OEo
> >>>>>>> >> >> >>> >>>> =P33I
> >>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>> ----------------
> >>>>>>> >> >> >>> >>>> Robert LeBlanc
> >>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>>>>> >> >> >>> >>>> wrote:
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> >>>>> Hash: SHA256
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>>>> >> >> >>> >>>>> blocked I/O.
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>>>> >> >> >>> >>>>> the blocked I/O.
> >>>>>>> >> >> >>> >>>>> - ----------------
> >>>>>>> >> >> >>> >>>>> Robert LeBlanc
> >>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>>> >> >> >>> >>>>>> has
> >>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>> sage
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>>
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>>> What kernel are you running?
> >>>>>>> >> >> >>> >>>>>>> -Sam
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >>>>>>> >> >> >>> >>>>>>>> transfer).
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>> >> >> >>> >>>>>>>> thread.
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>> >> >> >>> >>>>>>>> some help.
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> Single Test started about
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
> >>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> fio job:
> >>>>>>> >> >> >>> >>>>>>>> [rbd-test]
> >>>>>>> >> >> >>> >>>>>>>> readwrite=write
> >>>>>>> >> >> >>> >>>>>>>> blocksize=4M
> >>>>>>> >> >> >>> >>>>>>>> #runtime=60
> >>>>>>> >> >> >>> >>>>>>>> name=rbd-test
> >>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
> >>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
> >>>>>>> >> >> >>> >>>>>>>> #norandommap
> >>>>>>> >> >> >>> >>>>>>>> #size=1T
> >>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
> >>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
> >>>>>>> >> >> >>> >>>>>>>> rbdname=test2
> >>>>>>> >> >> >>> >>>>>>>> pool=rbd
> >>>>>>> >> >> >>> >>>>>>>> clientname=admin
> >>>>>>> >> >> >>> >>>>>>>> iodepth=8
> >>>>>>> >> >> >>> >>>>>>>> #numjobs=4
> >>>>>>> >> >> >>> >>>>>>>> #thread
> >>>>>>> >> >> >>> >>>>>>>> #group_reporting
> >>>>>>> >> >> >>> >>>>>>>> #time_based
> >>>>>>> >> >> >>> >>>>>>>> #direct=1
> >>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> Thanks,
> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>> >> >> >>> >>>>>>>> J3hS
> >>>>>>> >> >> >>> >>>>>>>> =0J7F
> >>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>>>>>> ----------------
> >>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
> >>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>> >> >> >>> >>>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
> >>>>>>> >> >> >>> >>>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>> >> >> >>> >>>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> I'm not
> >>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>> >> >> >>> >>>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>> >> >> >>> >>>>>>>>>> the
> >>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>> >> >> >>> >>>>>>>>>> on
> >>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>> >> >> >>> >>>>>>>>> 20",
> >>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>> >> >> >>> >>>>>>>>> out
> >>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>> >> >> >>> >>>>>>>>> -Greg
> >>>>>>> >> >> >>> >>>>>>>>
> >>>>>>> >> >> >>> >>>>>>>> --
> >>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>> >> >> >>> >>>>>>>> in
> >>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>>>
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
> >>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
> >>>>>>> >> >> >>> >>>>>
> >>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>>>> >> >> >>> >>>>> gcZm
> >>>>>>> >> >> >>> >>>>> =CjwB
> >>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>> --
> >>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>>> >> >> >>> >>>> the body of a message to majordomo@vger.kernel.org
> >>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>> >> >> >>> >>>>
> >>>>>>> >> >> >>> >>>
> >>>>>>> >> >> >>> >>
> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
> >>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
> >>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
> >>>>>>> >> >> >>> >>
> >>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >>>>>>> >> >> >>> >> ae22
> >>>>>>> >> >> >>> >> =AX+L
> >>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
> >>>>>>> >> >> >>> _______________________________________________
> >>>>>>> >> >> >>> ceph-users mailing list
> >>>>>>> >> >> >>> ceph-users@lists.ceph.com
> >>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>> >> >> >>>
> >>>>>>> >> >> >>>
> >>>>>>> >> >> _______________________________________________
> >>>>>>> >> >> ceph-users mailing list
> >>>>>>> >> >> ceph-users@lists.ceph.com
> >>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>> >> >>
> >>>>>>> >> >>
> >>>>>>> >> --
> >>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>>> >> the body of a message to majordomo@vger.kernel.org
> >>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>>
> >>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>> Version: Mailvelope v1.2.0
> >>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>
> >>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
> >>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
> >>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
> >>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
> >>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
> >>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
> >>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
> >>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
> >>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
> >>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
> >>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
> >>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
> >>>>>>> 6Kfk
> >>>>>>> =/gR6
> >>>>>>> -----END PGP SIGNATURE-----
> >>>>>>> --
> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.2.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
> >>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
> >>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
> >>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
> >>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
> >>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
> >>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
> >>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
> >>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
> >>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
> >>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
> >>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
> >>>>> JFPi
> >>>>> =ofgq
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.2.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
> >>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
> >>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
> >>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
> >>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
> >>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
> >>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
> >>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
> >>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
> >>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
> >>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
> >>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
> >>>> JYNo
> >>>> =msX2
> >>>> -----END PGP SIGNATURE-----
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.2.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
> >> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
> >> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
> >> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
> >> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
> >> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
> >> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
> >> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
> >> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
> >> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
> >> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
> >> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
> >> rZW5
> >> =dKn9
> >> -----END PGP SIGNATURE-----
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                       ` <alpine.DEB.2.00.1510130956130.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-14  6:00                                                                                                         ` Haomai Wang
       [not found]                                                                                                           ` <CACJqLyaeognJ479tjv3S8u1ZpfRr2=qFbgmW1fMu2BcVPt_gNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Haomai Wang @ 2015-10-14  6:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> After a weekend, I'm ready to hit this from a different direction.
>>
>> I replicated the issue with Firefly so it doesn't seem an issue that
>> has been introduced or resolved in any nearby version. I think overall
>> we may be seeing [1] to a great degree. From what I can extract from
>> the logs, it looks like in situations where OSDs are going up and
>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> the PG to become clean before dispatching the I/O to the replicas.
>>
>> In an effort to understand the flow of the logs, I've attached a small
>> 2 minute segment of a log I've extracted what I believe to be
>> important entries in the life cycle of an I/O along with my
>> understanding. If someone would be kind enough to help my
>> understanding, I would appreciate it.
>>
>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>> >> 192.168.55.12:0/2013622 pipe(0x26c90000 sd=47 :6800 s=2 pgs=2 cs=1
>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Messenger has recieved the message from the client (previous
>> entries in the 7fb9d2c68700 thread are the individual segments that
>> make up this message).
>>
>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>> <== client.6709 192.168.55.12:0/2013622 19 ====
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> ==== 235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>
>> - ->OSD process acknowledges that it has received the write.
>>
>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Not sure excatly what is going on here, the op is being enqueued somewhere..
>>
>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>> active+clean]
>>
>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>> different thread.
>
> ^^ This is the problem.  Everything after this looks reasonable.  Looking
> at the other dequeue_op calls over this period, it looks like we're just
> overwhelmed with higher priority requests.  New clients are 63, while
> osd_repop (replicated write from another primary) are 127 and replies from
> our own replicated ops are 196.  We do process a few other prio 63 items,
> but you'll see that their latency is also climbing up to 30s over this
> period.
>
> The question is why we suddenly get a lot of them.. maybe the peering on
> other OSDs just completed so we get a bunch of these?  It's also not clear
> to me what makes osd.4 or this op special.  We expect a mix of primary and
> replica ops on all the OSDs, so why would we suddenly have more of them
> here....

I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
related to this thread.

So is it means that there exists live lock with client op and repop?
We permit all clients issue too much client ops which cause some OSDs
bottleneck, then actually other OSDs maybe idle enough and accept more
client ops. Finally, all osds are stuck into the bottleneck OSD. It
seemed reasonable, but why it will last so long?

>
> sage
>
>
>>
>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>>
>> - ->Not sure what this message is. Look up of secondary OSDs?
>>
>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> new_repop rep_tid 17815 on osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Dispatch write to secondaty OSDs?
>>
>> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> -- ?+4195078 0x238fd600 con 0x32bcb5a0
>>
>> - ->OSD dispatch write to OSD.0.
>>
>> 2015-10-12 14:13:06.545132 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>> submit_message osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> remote, 192.168.55.15:6801/32036, have pipe.
>>
>> - ->Message sent to OSD.0.
>>
>> 2015-10-12 14:13:06.545195 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>> --> 192.168.55.11:6801/13185 -- osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> -- ?+4195078 0x16edd200 con 0x3a37b20
>>
>> - ->OSD dispatch write to OSD.5.
>>
>> 2015-10-12 14:13:06.545210 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>> submit_message osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> remote, 192.168.55.11:6801/13185, have pipe.
>>
>> - ->Message sent to OSD.5.
>>
>> 2015-10-12 14:13:06.545229 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> append_log log((0'0,44'703], crt=44'700) [44'704 (44'691) modify
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> client.6709.0:67 2015-10-12 14:12:34.340082]
>> 2015-10-12 14:13:06.545268 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'700 lcod 44'702 mlcod
>> 44'702 active+clean] add_log_entry 44'704 (44'691) modify
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> client.6709.0:67 2015-10-12 14:12:34.340082
>>
>> - ->These record the OP in the journal log?
>>
>> 2015-10-12 14:13:06.563241 7fb9d326e700 20 -- 192.168.55.16:6801/11295
>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> cs=3 l=0 c=0x3a37b20).writer encoding 17337 features 37154696925806591
>> 0x16edd200 osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>
>> - ->Writing the data to OSD.5?
>>
>> 2015-10-12 14:13:06.573938 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>> l=0 c=0x32bcb5a0).reader got ack seq 1206 >= 1206 on 0x238fd600
>> osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>
>> - ->Messenger gets ACK from OSD.0 that it reveiced that last packet?
>>
>> 2015-10-12 14:13:06.613425 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>> l=0 c=0x32bcb5a0).reader got message 1146 0x3ffa480
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>
>> - ->Messenger receives ack on disk from OSD.0.
>>
>> 2015-10-12 14:13:06.613447 7fb9d3874700  1 -- 192.168.55.16:6801/11295
>> <== osd.0 192.168.55.15:6801/32036 1146 ====
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>> 83+0+0 (2772408781 0 0) 0x3ffa480 con 0x32bcb5a0
>>
>> - ->OSD process gets on disk ACK from OSD.0.
>>
>> 2015-10-12 14:13:06.613478 7fb9d3874700 10 osd.4 44 handle_replica_op
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>>
>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>> correlate that to the previous message other than by time.
>>
>> 2015-10-12 14:13:06.613504 7fb9d3874700 15 osd.4 44 enqueue_op
>> 0x120f9b00 prio 196 cost 0 latency 0.000250
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>
>> - ->The reply is enqueued onto a mystery queue.
>>
>> 2015-10-12 14:13:06.627793 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> cs=3 l=0 c=0x3a37b20).reader got ack seq 17337 >= 17337 on 0x16edd200
>> osd_repop(client.6709.0:67 0.29
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>
>> - ->Messenger gets ACK from OSD.5 that it reveiced that last packet?
>>
>> 2015-10-12 14:13:06.628364 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> cs=3 l=0 c=0x3a37b20).reader got message 16477 0x21cef3c0
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>
>> - ->Messenger receives ack on disk from OSD.5.
>>
>> 2015-10-12 14:13:06.628382 7fb9d6afd700  1 -- 192.168.55.16:6801/11295
>> <== osd.5 192.168.55.11:6801/13185 16477 ====
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>> 83+0+0 (2104182993 0 0) 0x21cef3c0 con 0x3a37b20
>>
>> - ->OSD process gets on disk ACK from OSD.5.
>>
>> 2015-10-12 14:13:06.628406 7fb9d6afd700 10 osd.4 44 handle_replica_op
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>>
>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>> correlate that to the previous message other than by time.
>>
>> 2015-10-12 14:13:06.628426 7fb9d6afd700 15 osd.4 44 enqueue_op
>> 0x3e41600 prio 196 cost 0 latency 0.000180
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>
>> - ->The reply is enqueued onto a mystery queue.
>>
>> 2015-10-12 14:13:07.124206 7fb9f4e9f700  0 log_channel(cluster) log
>> [WRN] : slow request 30.598371 seconds old, received at 2015-10-12
>> 14:12:36.525724: osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) currently waiting for subops
>> from 0,5
>>
>> - ->OP has not been dequeued to the client from the mystery queue yet.
>>
>> 2015-10-12 14:13:07.278449 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod 44'702 mlcod
>> 44'702 active+clean] eval_repop repgather(0x37ea3cc0 44'704
>> rep_tid=17815 committed?=0 applied?=0 lock=0
>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>> wants=ad
>>
>> - ->Not sure what this means. The OP has been completed on all replicas?
>>
>> 2015-10-12 14:13:07.278566 7fb9e0535700 10 osd.4 44 dequeue_op
>> 0x120f9b00 prio 196 cost 0 latency 0.665312
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>> 44'702 mlcod 44'702 active+clean]
>>
>> - ->One of the replica OPs is dequeued in a different thread
>>
>> 2015-10-12 14:13:07.278809 7fb9e0535700 10 osd.4 44 dequeue_op
>> 0x3e41600 prio 196 cost 0 latency 0.650563
>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>> 44'702 mlcod 44'702 active+clean]
>>
>> - ->The other replica OP is dequeued in the new thread
>>
>> 2015-10-12 14:13:07.967469 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> active+clean] eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815
>> committed?=1 applied?=0 lock=0 op=osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>>
>> - ->Not sure what this does. A thread that joins the replica OPs with
>> the primary OP?
>>
>> 2015-10-12 14:13:07.967515 7fb9efe95700 15 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> active+clean] log_op_stats osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5 inb 4194304 outb 0 rlat
>> 0.000000 lat 31.441789
>>
>> - ->Logs that the write has been committed to all replicas in the
>> primary journal?
>>
>> Not sure what the rest of these do, nor do I understand where the
>> client gets an ACK that the write is committed.
>>
>> 2015-10-12 14:13:07.967583 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> active+clean]  sending commit on repgather(0x37ea3cc0 44'704
>> rep_tid=17815 committed?=1 applied?=0 lock=0
>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>> 0x3a2f0840
>>
>> 2015-10-12 14:13:10.351452 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'702 active+clean]
>> eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> applied?=1 lock=0 op[0/1943]client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>>
>> 2015-10-12 14:13:10.354089 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>> removing repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5)
>>
>> 2015-10-12 14:13:10.354163 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>>  q front is repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5)
>>
>> 2015-10-12 14:13:10.354199 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>> remove_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> ack+ondisk+write+known_if_redirected e44) v5)
>>
>> 2015-10-12 14:13:15.488448 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'707 (0'0,44'707] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 luod=44'705 lua=44'705 crt=44'704 lcod 44'704 mlcod
>> 44'704 active+clean] append_log: trimming to 44'704 entries 44'704
>> (44'691) modify
>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> client.6709.0:67 2015-10-12 14:12:34.340082
>>
>> Thanks for hanging in there with me on this...
>>
>> [1] http://www.spinics.net/lists/ceph-devel/msg26633.html
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWHCx0CRDmVDuy+mK58QAAXf8P/j6MD52r2DLqOP9hKFAP
>> MJUktg8uqK1i8awtuIQhJHAPDZQF8EACOXg6RBuOz75iryCFKAJXk5exLXrE
>> pIZqY/0/JCsUEPuQGaMY9GVQNrTeB82F5VIu572i2xeFir4fUEcvllXSeR9O
>> CxSgaAncxUYGSXwsiCJ28QhwPCFXtCLACg1eTpghhAcOwY0t+z6ZB3vh+WxB
>> B8kRCdee78TVZOgeTnd66aBJUrr21Ir9aPqSm73uY561dyDmyxc4zPq+FDsJ
>> kuac+Ky9Lc6rqhxwRptbdx5i/EDzxj96EKEz2v4SFBmvzU8jtZlA8THJ6WlF
>> 6lZRpRIMfEqVu4neFcdUIct8+Brf7fuxOI7hbhUL5xq2I6yDSY8E2T8ImRoS
>> w8bSrjFV3wmnXSCHnFJPROqdhtlQlH1PkKPBRJeJrkrB1MloX0ybU4hNIr7Q
>> 4ZyzeLpD9sgL1vEfUVuCksgiVJhzlFOyqeRHcfpPEnLxyGL/+mLUa5lQ5m5l
>> m286ZnsMZGMzAdSA/tsqnTFzL0HbjkiWD/OMU5zThSKW2tZBNWg3xZE5Yia9
>> zAbhxpvxqhKQ7nfmv3xeVJ1GKb9CuzfN9ZIGPltHvpA3rZf3I4+XVlWbbhDZ
>> z8Xp8Pw8f7neh89Tv3AT+krM1jrE1ZxOF5A2K4CxBcS3OEMc5UIZ2fy4dHSo
>> 0iTE
>> =t7nL
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Oct 8, 2015 at 11:44 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > Sage,
>> >
>> > After trying to bisect this issue (all test moved the bisect towards
>> > Infernalis) and eventually testing the Infernalis branch again, it
>> > looks like the problem still exists although it is handled a tad
>> > better in Infernalis. I'm going to test against Firefly/Giant next
>> > week and then try and dive into the code to see if I can expose any
>> > thing.
>> >
>> > If I can do anything to provide you with information, please let me know.
>> >
>> > Thanks,
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: Mailvelope v1.2.0
>> > Comment: https://www.mailvelope.com
>> >
>> > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>> > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>> > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>> > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>> > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>> > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>> > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>> > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>> > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>> > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>> > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>> > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>> > BCFo
>> > =GJL4
>> > -----END PGP SIGNATURE-----
>> > ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >> We forgot to upload the ceph.log yesterday. It is there now.
>> >> - ----------------
>> >> Robert LeBlanc
>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>
>> >>
>> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA256
>> >>>
>> >>> I upped the debug on about everything and ran the test for about 40
>> >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>> >>> There was at least one op on osd.19 that was blocked for over 1,000
>> >>> seconds. Hopefully this will have something that will cast a light on
>> >>> what is going on.
>> >>>
>> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>> >>> the test to verify the results from the dev cluster. This cluster
>> >>> matches the hardware of our production cluster but is not yet in
>> >>> production so we can safely wipe it to downgrade back to Hammer.
>> >>>
>> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>> >>>
>> >>> Let me know what else we can do to help.
>> >>>
>> >>> Thanks,
>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> Version: Mailvelope v1.2.0
>> >>> Comment: https://www.mailvelope.com
>> >>>
>> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>> >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>> >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>> >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>> >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>> >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>> >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>> >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>> >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>> >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>> >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>> >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>> >>> EDrG
>> >>> =BZVw
>> >>> -----END PGP SIGNATURE-----
>> >>> ----------------
>> >>> Robert LeBlanc
>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>
>> >>>
>> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>> Hash: SHA256
>> >>>>
>> >>>> On my second test (a much longer one), it took nearly an hour, but a
>> >>>> few messages have popped up over a 20 window. Still far less than I
>> >>>> have been seeing.
>> >>>> - ----------------
>> >>>> Robert LeBlanc
>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>
>> >>>>
>> >>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>> Hash: SHA256
>> >>>>>
>> >>>>> I'll capture another set of logs. Is there any other debugging you
>> >>>>> want turned up? I've seen the same thing where I see the message
>> >>>>> dispatched to the secondary OSD, but the message just doesn't show up
>> >>>>> for 30+ seconds in the secondary OSD logs.
>> >>>>> - ----------------
>> >>>>> Robert LeBlanc
>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>> >>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> Hash: SHA256
>> >>>>>>>
>> >>>>>>> I can't think of anything. In my dev cluster the only thing that has
>> >>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>> >>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>> >>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>> >>>>>>> the OSD boots or during the recovery period. This is with
>> >>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>> >>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>> >>>>>>> our dev cluster very easily and very quickly with these settings. So
>> >>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>> >>>>>>> marked out. We would love to see that go away too, but this is far
>> >>>>>>                                             (me too!)
>> >>>>>>> better than what we have now. This dev cluster also has
>> >>>>>>> osd_client_message_cap set to default (100).
>> >>>>>>>
>> >>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>> >>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>> >>>>>>> you you prefer a bisect to find the introduction of the problem
>> >>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>> >>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>> >>>>>>> commit that prevents a clean build as that is my most limiting factor?
>> >>>>>>
>> >>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>> >>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>> >>>>>> dump you sent is that although I see plenty of slow request warnings in
>> >>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>> >>>>>> turned up for long enough?
>> >>>>>>
>> >>>>>> sage
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> - ----------------
>> >>>>>>> Robert LeBlanc
>> >>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>> >>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> Hash: SHA256
>> >>>>>>> >>
>> >>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> >>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> >>>>>>> >> messages when the OSD was marked out:
>> >>>>>>> >>
>> >>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> >>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> >>>>>>> >> 34.476006 secs
>> >>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> >>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>> >>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> >>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> >>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>> >>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> >>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> >>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>> >>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> >>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>>>>>> >>
>> >>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>> >>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
>> >>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
>> >>>>>>> >> far no blocked messages. I'm going to run some more tests.
>> >>>>>>> >
>> >>>>>>> > Good to hear.
>> >>>>>>> >
>> >>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
>> >>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
>> >>>>>>> > My guess is this OSD has a slower disk than the others, or something else
>> >>>>>>> > funny is going on?
>> >>>>>>> >
>> >>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>> >>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>> >>>>>>> > osd.163.
>> >>>>>>> >
>> >>>>>>> > sage
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >>
>> >>>>>>> >> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> Version: Mailvelope v1.2.0
>> >>>>>>> >> Comment: https://www.mailvelope.com
>> >>>>>>> >>
>> >>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> >>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> >>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>> >>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>> >>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>> >>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>> >>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>> >>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>> >>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>> >>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>> >>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>> >>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>> >>>>>>> >> fo5a
>> >>>>>>> >> =ahEi
>> >>>>>>> >> -----END PGP SIGNATURE-----
>> >>>>>>> >> ----------------
>> >>>>>>> >> Robert LeBlanc
>> >>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>> >>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> Hash: SHA256
>> >>>>>>> >> >>
>> >>>>>>> >> >> With some off-list help, we have adjusted
>> >>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>> >>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>> >>>>>>> >> >> it does not solve the problem with the blocked I/O.
>> >>>>>>> >> >>
>> >>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>> >>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>> >>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>> >>>>>>> >> >> it starts servicing it or what exactly.
>> >>>>>>> >> >
>> >>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>> >>>>>>> >> >
>> >>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> >>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
>> >>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> >>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>> >>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>> >>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> >>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>> >>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>> >>>>>>> >> >> All packages were installed from gitbuilder.
>> >>>>>>> >> >
>> >>>>>>> >> > Did you chown -R ?
>> >>>>>>> >> >
>> >>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>> >>>>>>> >> >
>> >>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>> >>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
>> >>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
>> >>>>>>> >> >
>> >>>>>>> >> > sage
>> >>>>>>> >> >
>> >>>>>>> >> >
>> >>>>>>> >> >>
>> >>>>>>> >> >> Thanks,
>> >>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> Version: Mailvelope v1.2.0
>> >>>>>>> >> >> Comment: https://www.mailvelope.com
>> >>>>>>> >> >>
>> >>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> >>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> >>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> >>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> >>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> >>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> >>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> >>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> >>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> >>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> >>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> >>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> >>>>>>> >> >> GdXC
>> >>>>>>> >> >> =Aigq
>> >>>>>>> >> >> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> ----------------
>> >>>>>>> >> >> Robert LeBlanc
>> >>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >>
>> >>>>>>> >> >>
>> >>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>> >>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> > Hash: SHA256
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>> >>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>> >>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
>> >>>>>>> >> >> > started getting block I/O messages during the recovery.
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > Thanks,
>> >>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> > Version: Mailvelope v1.2.0
>> >>>>>>> >> >> > Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> >>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> >>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> >>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> >>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> >>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> >>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> >>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> >>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> >>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> >>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> >>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> >>>>>>> >> >> > 3EPx
>> >>>>>>> >> >> > =UDIV
>> >>>>>>> >> >> > -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > ----------------
>> >>>>>>> >> >> > Robert LeBlanc
>> >>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >
>> >>>>>>> >> >> >
>> >>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>> >>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> Hash: SHA256
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>> >>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> >>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>> >>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>> >>>>>>> >> >> >>> on-site engagements, please let us know.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>> >>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>> >>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> >>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>> >>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> >>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>> >>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>> >>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> >>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>> >>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>> >>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>> >>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> >>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> The messages seem slightly different:
>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> >>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> >>>>>>> >> >> >>> 100.087155 secs
>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> >>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>> >>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> >>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> >>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> >>>>>>> >> >> >>> points reached
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> I don't know what "no flag points reached" means.
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>> >>>>>>> >> >> >> (op->mark_*() calls).
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> >>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>> >>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>> >>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>> >>>>>>> >> >> >> something we can blame on the network stack.
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >> sage
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>> >>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> >>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>> >>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>> >>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> >>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>> >>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>> >>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>> >>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>> >>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>> >>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>> >>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>> >>>>>>> >> >> >>> are being checked.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>> >>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>> >>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>> >>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>> >>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> >>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> >>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>> >>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>> >>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> >>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>> >>>>>>> >> >> >>> the Ceph process?
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>> >>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>> >>>>>>> >> >> >>> anything else.
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> Thanks,
>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> >>> Version: Mailvelope v1.1.0
>> >>>>>>> >> >> >>> Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >>>>>>> >> >> >>> l7OF
>> >>>>>>> >> >> >>> =OI++
>> >>>>>>> >> >> >>> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >>> ----------------
>> >>>>>>> >> >> >>> Robert LeBlanc
>> >>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>> >>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> >>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>> >>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>> >>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>> >>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> >>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> >>>>>>> >> >> >>> > processes and 16K system wide.
>> >>>>>>> >> >> >>> >
>> >>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>> >>>>>>> >> >> >>> > configuration items we should be looking at?
>> >>>>>>> >> >> >>> >
>> >>>>>>> >> >> >>> > Thanks,
>> >>>>>>> >> >> >>> > ----------------
>> >>>>>>> >> >> >>> > Robert LeBlanc
>> >>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>> >
>> >>>>>>> >> >> >>> >
>> >>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> >> Hash: SHA256
>> >>>>>>> >> >> >>> >>
>> >>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>> >>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>> >>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>> >>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>> >>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>> >>>>>>> >> >> >>> >> didn't know it.
>> >>>>>>> >> >> >>> >> - ----------------
>> >>>>>>> >> >> >>> >> Robert LeBlanc
>> >>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>> >>
>> >>>>>>> >> >> >>> >>
>> >>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >>>>>>> >> >> >>> >>> drivers might cause problems though.
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>> >>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>> Mark
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> >>>> Hash: SHA256
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >>>>>>> >> >> >>> >>>> differing hardware and network configs.
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> Thanks,
>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>> >>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>>>>>> >> >> >>> >>>> 4OEo
>> >>>>>>> >> >> >>> >>>> =P33I
>> >>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>> ----------------
>> >>>>>>> >> >> >>> >>>> Robert LeBlanc
>> >>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>>>>>> >> >> >>> >>>> wrote:
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> >>>>> Hash: SHA256
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >>>>>>> >> >> >>> >>>>> blocked I/O.
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >>>>>>> >> >> >>> >>>>> the blocked I/O.
>> >>>>>>> >> >> >>> >>>>> - ----------------
>> >>>>>>> >> >> >>> >>>>> Robert LeBlanc
>> >>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >>>>>>> >> >> >>> >>>>>> has
>> >>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>> sage
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>>
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>>> What kernel are you running?
>> >>>>>>> >> >> >>> >>>>>>> -Sam
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >>>>>>> >> >> >>> >>>>>>>> transfer).
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>> >>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >>>>>>> >> >> >>> >>>>>>>> thread.
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >>>>>>> >> >> >>> >>>>>>>> some help.
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> Single Test started about
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>> >>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> fio job:
>> >>>>>>> >> >> >>> >>>>>>>> [rbd-test]
>> >>>>>>> >> >> >>> >>>>>>>> readwrite=write
>> >>>>>>> >> >> >>> >>>>>>>> blocksize=4M
>> >>>>>>> >> >> >>> >>>>>>>> #runtime=60
>> >>>>>>> >> >> >>> >>>>>>>> name=rbd-test
>> >>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>> >>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
>> >>>>>>> >> >> >>> >>>>>>>> #norandommap
>> >>>>>>> >> >> >>> >>>>>>>> #size=1T
>> >>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
>> >>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
>> >>>>>>> >> >> >>> >>>>>>>> rbdname=test2
>> >>>>>>> >> >> >>> >>>>>>>> pool=rbd
>> >>>>>>> >> >> >>> >>>>>>>> clientname=admin
>> >>>>>>> >> >> >>> >>>>>>>> iodepth=8
>> >>>>>>> >> >> >>> >>>>>>>> #numjobs=4
>> >>>>>>> >> >> >>> >>>>>>>> #thread
>> >>>>>>> >> >> >>> >>>>>>>> #group_reporting
>> >>>>>>> >> >> >>> >>>>>>>> #time_based
>> >>>>>>> >> >> >>> >>>>>>>> #direct=1
>> >>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> Thanks,
>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>>>>>> >> >> >>> >>>>>>>> J3hS
>> >>>>>>> >> >> >>> >>>>>>>> =0J7F
>> >>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>>>>>> ----------------
>> >>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>> >>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> I'm not
>> >>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >>>>>>> >> >> >>> >>>>>>>>>> the
>> >>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >>>>>>> >> >> >>> >>>>>>>>>> on
>> >>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >>>>>>> >> >> >>> >>>>>>>>> 20",
>> >>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >>>>>>> >> >> >>> >>>>>>>>> out
>> >>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >>>>>>> >> >> >>> >>>>>>>>> -Greg
>> >>>>>>> >> >> >>> >>>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>> --
>> >>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>>>>>> >> >> >>> >>>>>>>> in
>> >>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>>>
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>> >>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >>> >>>>>
>> >>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>>>>>> >> >> >>> >>>>> gcZm
>> >>>>>>> >> >> >>> >>>>> =CjwB
>> >>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>> --
>> >>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>> >> >> >>> >>>>
>> >>>>>>> >> >> >>> >>>
>> >>>>>>> >> >> >>> >>
>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>> >>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>> >>>>>>> >> >> >>> >>
>> >>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >>>>>>> >> >> >>> >> ae22
>> >>>>>>> >> >> >>> >> =AX+L
>> >>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>> >>>>>>> >> >> >>> _______________________________________________
>> >>>>>>> >> >> >>> ceph-users mailing list
>> >>>>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> >>>
>> >>>>>>> >> >> _______________________________________________
>> >>>>>>> >> >> ceph-users mailing list
>> >>>>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>>>>> >> >>
>> >>>>>>> >> >>
>> >>>>>>> >> --
>> >>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>>
>> >>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>>>> Version: Mailvelope v1.2.0
>> >>>>>>> Comment: https://www.mailvelope.com
>> >>>>>>>
>> >>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>> >>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>> >>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>> >>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>> >>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>> >>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>> >>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>> >>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>> >>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>> >>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>> >>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>> >>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>> >>>>>>> 6Kfk
>> >>>>>>> =/gR6
>> >>>>>>> -----END PGP SIGNATURE-----
>> >>>>>>> --
>> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>>>> Version: Mailvelope v1.2.0
>> >>>>> Comment: https://www.mailvelope.com
>> >>>>>
>> >>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>> >>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>> >>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>> >>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>> >>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>> >>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>> >>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>> >>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>> >>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>> >>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>> >>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>> >>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>> >>>>> JFPi
>> >>>>> =ofgq
>> >>>>> -----END PGP SIGNATURE-----
>> >>>>
>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>>> Version: Mailvelope v1.2.0
>> >>>> Comment: https://www.mailvelope.com
>> >>>>
>> >>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>> >>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>> >>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>> >>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>> >>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>> >>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>> >>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>> >>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>> >>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>> >>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>> >>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>> >>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>> >>>> JYNo
>> >>>> =msX2
>> >>>> -----END PGP SIGNATURE-----
>> >>
>> >> -----BEGIN PGP SIGNATURE-----
>> >> Version: Mailvelope v1.2.0
>> >> Comment: https://www.mailvelope.com
>> >>
>> >> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>> >> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>> >> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>> >> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>> >> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>> >> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>> >> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>> >> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>> >> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>> >> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>> >> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>> >> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>> >> rZW5
>> >> =dKn9
>> >> -----END PGP SIGNATURE-----
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                           ` <CACJqLyaeognJ479tjv3S8u1ZpfRr2=qFbgmW1fMu2BcVPt_gNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-14 15:41                                                                                                             ` Robert LeBlanc
       [not found]                                                                                                               ` <CAANLjFrBNMeGkawcBUYqjWSjoWyQHCxjpEM291TmOp40HhCoSA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-14 15:41 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

It seems in our situation the cluster is just busy, usually with
really small RBD I/O. We have gotten things to where it doesn't happen
as much in a steady state, but when we have an OSD fail (mostly from
an XFS log bug we hit at least once a week), it is very painful as the
OSD exits and enters the cluster. We are working to split the PGs a
couple of fold, but this is a painful process for the reasons
mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
on IRC about getting the other primaries to throttle back when such a
situation occurs so that each primary OSD has some time to service
client I/O and to push back on the clients to slow down in these
situations.

In our case a single OSD can lock up a VM for a very long time while
others are happily going about their business. Instead of looking like
the cluster is out of I/O, it looks like there is an error. If
pressure is pushed back to clients, it would show up as all of the
clients slowing down a little instead of one or two just hanging for
even over 1,000 seconds.

My thoughts is that each OSD should have some percentage to time given
to servicing client I/O whereas now it seems that replica I/O can
completely starve client I/O. I understand why replica traffic needs a
higher priority, but I think some balance needs to be attained.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
+LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
eP3D
=R0O9
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang <haomaiwang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> After a weekend, I'm ready to hit this from a different direction.
>>>
>>> I replicated the issue with Firefly so it doesn't seem an issue that
>>> has been introduced or resolved in any nearby version. I think overall
>>> we may be seeing [1] to a great degree. From what I can extract from
>>> the logs, it looks like in situations where OSDs are going up and
>>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>>> the PG to become clean before dispatching the I/O to the replicas.
>>>
>>> In an effort to understand the flow of the logs, I've attached a small
>>> 2 minute segment of a log I've extracted what I believe to be
>>> important entries in the life cycle of an I/O along with my
>>> understanding. If someone would be kind enough to help my
>>> understanding, I would appreciate it.
>>>
>>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>>> >> 192.168.55.12:0/2013622 pipe(0x26c90000 sd=47 :6800 s=2 pgs=2 cs=1
>>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Messenger has recieved the message from the client (previous
>>> entries in the 7fb9d2c68700 thread are the individual segments that
>>> make up this message).
>>>
>>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>>> <== client.6709 192.168.55.12:0/2013622 19 ====
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>> ==== 235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>>
>>> - ->OSD process acknowledges that it has received the write.
>>>
>>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Not sure excatly what is going on here, the op is being enqueued somewhere..
>>>
>>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>>> active+clean]
>>>
>>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>>> different thread.
>>
>> ^^ This is the problem.  Everything after this looks reasonable.  Looking
>> at the other dequeue_op calls over this period, it looks like we're just
>> overwhelmed with higher priority requests.  New clients are 63, while
>> osd_repop (replicated write from another primary) are 127 and replies from
>> our own replicated ops are 196.  We do process a few other prio 63 items,
>> but you'll see that their latency is also climbing up to 30s over this
>> period.
>>
>> The question is why we suddenly get a lot of them.. maybe the peering on
>> other OSDs just completed so we get a bunch of these?  It's also not clear
>> to me what makes osd.4 or this op special.  We expect a mix of primary and
>> replica ops on all the OSDs, so why would we suddenly have more of them
>> here....
>
> I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
> related to this thread.
>
> So is it means that there exists live lock with client op and repop?
> We permit all clients issue too much client ops which cause some OSDs
> bottleneck, then actually other OSDs maybe idle enough and accept more
> client ops. Finally, all osds are stuck into the bottleneck OSD. It
> seemed reasonable, but why it will last so long?
>
>>
>> sage
>>
>>
>>>
>>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>>>
>>> - ->Not sure what this message is. Look up of secondary OSDs?
>>>
>>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>>> new_repop rep_tid 17815 on osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5
>>>
>>> - ->Dispatch write to secondaty OSDs?
>>>
>>> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>>> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>> -- ?+4195078 0x238fd600 con 0x32bcb5a0
>>>
>>> - ->OSD dispatch write to OSD.0.
>>>
>>> 2015-10-12 14:13:06.545132 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>>> submit_message osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>> remote, 192.168.55.15:6801/32036, have pipe.
>>>
>>> - ->Message sent to OSD.0.
>>>
>>> 2015-10-12 14:13:06.545195 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>>> --> 192.168.55.11:6801/13185 -- osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>> -- ?+4195078 0x16edd200 con 0x3a37b20
>>>
>>> - ->OSD dispatch write to OSD.5.
>>>
>>> 2015-10-12 14:13:06.545210 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>>> submit_message osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>> remote, 192.168.55.11:6801/13185, have pipe.
>>>
>>> - ->Message sent to OSD.5.
>>>
>>> 2015-10-12 14:13:06.545229 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>>> append_log log((0'0,44'703], crt=44'700) [44'704 (44'691) modify
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>>> client.6709.0:67 2015-10-12 14:12:34.340082]
>>> 2015-10-12 14:13:06.545268 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'700 lcod 44'702 mlcod
>>> 44'702 active+clean] add_log_entry 44'704 (44'691) modify
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>>> client.6709.0:67 2015-10-12 14:12:34.340082
>>>
>>> - ->These record the OP in the journal log?
>>>
>>> 2015-10-12 14:13:06.563241 7fb9d326e700 20 -- 192.168.55.16:6801/11295
>>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>>> cs=3 l=0 c=0x3a37b20).writer encoding 17337 features 37154696925806591
>>> 0x16edd200 osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>>
>>> - ->Writing the data to OSD.5?
>>>
>>> 2015-10-12 14:13:06.573938 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>>> l=0 c=0x32bcb5a0).reader got ack seq 1206 >= 1206 on 0x238fd600
>>> osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>>
>>> - ->Messenger gets ACK from OSD.0 that it reveiced that last packet?
>>>
>>> 2015-10-12 14:13:06.613425 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>>> l=0 c=0x32bcb5a0).reader got message 1146 0x3ffa480
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>>
>>> - ->Messenger receives ack on disk from OSD.0.
>>>
>>> 2015-10-12 14:13:06.613447 7fb9d3874700  1 -- 192.168.55.16:6801/11295
>>> <== osd.0 192.168.55.15:6801/32036 1146 ====
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>>> 83+0+0 (2772408781 0 0) 0x3ffa480 con 0x32bcb5a0
>>>
>>> - ->OSD process gets on disk ACK from OSD.0.
>>>
>>> 2015-10-12 14:13:06.613478 7fb9d3874700 10 osd.4 44 handle_replica_op
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>>>
>>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>>> correlate that to the previous message other than by time.
>>>
>>> 2015-10-12 14:13:06.613504 7fb9d3874700 15 osd.4 44 enqueue_op
>>> 0x120f9b00 prio 196 cost 0 latency 0.000250
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>>
>>> - ->The reply is enqueued onto a mystery queue.
>>>
>>> 2015-10-12 14:13:06.627793 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>>> cs=3 l=0 c=0x3a37b20).reader got ack seq 17337 >= 17337 on 0x16edd200
>>> osd_repop(client.6709.0:67 0.29
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>>>
>>> - ->Messenger gets ACK from OSD.5 that it reveiced that last packet?
>>>
>>> 2015-10-12 14:13:06.628364 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>>> cs=3 l=0 c=0x3a37b20).reader got message 16477 0x21cef3c0
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>>
>>> - ->Messenger receives ack on disk from OSD.5.
>>>
>>> 2015-10-12 14:13:06.628382 7fb9d6afd700  1 -- 192.168.55.16:6801/11295
>>> <== osd.5 192.168.55.11:6801/13185 16477 ====
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>>> 83+0+0 (2104182993 0 0) 0x21cef3c0 con 0x3a37b20
>>>
>>> - ->OSD process gets on disk ACK from OSD.5.
>>>
>>> 2015-10-12 14:13:06.628406 7fb9d6afd700 10 osd.4 44 handle_replica_op
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>>>
>>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>>> correlate that to the previous message other than by time.
>>>
>>> 2015-10-12 14:13:06.628426 7fb9d6afd700 15 osd.4 44 enqueue_op
>>> 0x3e41600 prio 196 cost 0 latency 0.000180
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>>>
>>> - ->The reply is enqueued onto a mystery queue.
>>>
>>> 2015-10-12 14:13:07.124206 7fb9f4e9f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.598371 seconds old, received at 2015-10-12
>>> 14:12:36.525724: osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) currently waiting for subops
>>> from 0,5
>>>
>>> - ->OP has not been dequeued to the client from the mystery queue yet.
>>>
>>> 2015-10-12 14:13:07.278449 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod 44'702 mlcod
>>> 44'702 active+clean] eval_repop repgather(0x37ea3cc0 44'704
>>> rep_tid=17815 committed?=0 applied?=0 lock=0
>>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>>> wants=ad
>>>
>>> - ->Not sure what this means. The OP has been completed on all replicas?
>>>
>>> 2015-10-12 14:13:07.278566 7fb9e0535700 10 osd.4 44 dequeue_op
>>> 0x120f9b00 prio 196 cost 0 latency 0.665312
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>>> 44'702 mlcod 44'702 active+clean]
>>>
>>> - ->One of the replica OPs is dequeued in a different thread
>>>
>>> 2015-10-12 14:13:07.278809 7fb9e0535700 10 osd.4 44 dequeue_op
>>> 0x3e41600 prio 196 cost 0 latency 0.650563
>>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>>> 44'702 mlcod 44'702 active+clean]
>>>
>>> - ->The other replica OP is dequeued in the new thread
>>>
>>> 2015-10-12 14:13:07.967469 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>>> active+clean] eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815
>>> committed?=1 applied?=0 lock=0 op=osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>>>
>>> - ->Not sure what this does. A thread that joins the replica OPs with
>>> the primary OP?
>>>
>>> 2015-10-12 14:13:07.967515 7fb9efe95700 15 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>>> active+clean] log_op_stats osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5 inb 4194304 outb 0 rlat
>>> 0.000000 lat 31.441789
>>>
>>> - ->Logs that the write has been committed to all replicas in the
>>> primary journal?
>>>
>>> Not sure what the rest of these do, nor do I understand where the
>>> client gets an ACK that the write is committed.
>>>
>>> 2015-10-12 14:13:07.967583 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>>> active+clean]  sending commit on repgather(0x37ea3cc0 44'704
>>> rep_tid=17815 committed?=1 applied?=0 lock=0
>>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>>> 0x3a2f0840
>>>
>>> 2015-10-12 14:13:10.351452 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'702 active+clean]
>>> eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>>> applied?=1 lock=0 op[0/1943]client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>>>
>>> 2015-10-12 14:13:10.354089 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>>> removing repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5)
>>>
>>> 2015-10-12 14:13:10.354163 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>>>  q front is repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5)
>>>
>>> 2015-10-12 14:13:10.354199 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>>> remove_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>>> ack+ondisk+write+known_if_redirected e44) v5)
>>>
>>> 2015-10-12 14:13:15.488448 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>>> v 44'707 (0'0,44'707] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>>> [4,5,0] r=0 lpr=32 luod=44'705 lua=44'705 crt=44'704 lcod 44'704 mlcod
>>> 44'704 active+clean] append_log: trimming to 44'704 entries 44'704
>>> (44'691) modify
>>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>>> client.6709.0:67 2015-10-12 14:12:34.340082
>>>
>>> Thanks for hanging in there with me on this...
>>>
>>> [1] http://www.spinics.net/lists/ceph-devel/msg26633.html
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.0
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWHCx0CRDmVDuy+mK58QAAXf8P/j6MD52r2DLqOP9hKFAP
>>> MJUktg8uqK1i8awtuIQhJHAPDZQF8EACOXg6RBuOz75iryCFKAJXk5exLXrE
>>> pIZqY/0/JCsUEPuQGaMY9GVQNrTeB82F5VIu572i2xeFir4fUEcvllXSeR9O
>>> CxSgaAncxUYGSXwsiCJ28QhwPCFXtCLACg1eTpghhAcOwY0t+z6ZB3vh+WxB
>>> B8kRCdee78TVZOgeTnd66aBJUrr21Ir9aPqSm73uY561dyDmyxc4zPq+FDsJ
>>> kuac+Ky9Lc6rqhxwRptbdx5i/EDzxj96EKEz2v4SFBmvzU8jtZlA8THJ6WlF
>>> 6lZRpRIMfEqVu4neFcdUIct8+Brf7fuxOI7hbhUL5xq2I6yDSY8E2T8ImRoS
>>> w8bSrjFV3wmnXSCHnFJPROqdhtlQlH1PkKPBRJeJrkrB1MloX0ybU4hNIr7Q
>>> 4ZyzeLpD9sgL1vEfUVuCksgiVJhzlFOyqeRHcfpPEnLxyGL/+mLUa5lQ5m5l
>>> m286ZnsMZGMzAdSA/tsqnTFzL0HbjkiWD/OMU5zThSKW2tZBNWg3xZE5Yia9
>>> zAbhxpvxqhKQ7nfmv3xeVJ1GKb9CuzfN9ZIGPltHvpA3rZf3I4+XVlWbbhDZ
>>> z8Xp8Pw8f7neh89Tv3AT+krM1jrE1ZxOF5A2K4CxBcS3OEMc5UIZ2fy4dHSo
>>> 0iTE
>>> =t7nL
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Thu, Oct 8, 2015 at 11:44 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>> > Hash: SHA256
>>> >
>>> > Sage,
>>> >
>>> > After trying to bisect this issue (all test moved the bisect towards
>>> > Infernalis) and eventually testing the Infernalis branch again, it
>>> > looks like the problem still exists although it is handled a tad
>>> > better in Infernalis. I'm going to test against Firefly/Giant next
>>> > week and then try and dive into the code to see if I can expose any
>>> > thing.
>>> >
>>> > If I can do anything to provide you with information, please let me know.
>>> >
>>> > Thanks,
>>> > -----BEGIN PGP SIGNATURE-----
>>> > Version: Mailvelope v1.2.0
>>> > Comment: https://www.mailvelope.com
>>> >
>>> > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>> > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>> > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>> > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>> > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>> > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>> > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>> > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>> > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>> > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>> > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>> > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>> > BCFo
>>> > =GJL4
>>> > -----END PGP SIGNATURE-----
>>> > ----------------
>>> > Robert LeBlanc
>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >
>>> >
>>> > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >> Hash: SHA256
>>> >>
>>> >> We forgot to upload the ceph.log yesterday. It is there now.
>>> >> - ----------------
>>> >> Robert LeBlanc
>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>
>>> >>
>>> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>> Hash: SHA256
>>> >>>
>>> >>> I upped the debug on about everything and ran the test for about 40
>>> >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>> >>> There was at least one op on osd.19 that was blocked for over 1,000
>>> >>> seconds. Hopefully this will have something that will cast a light on
>>> >>> what is going on.
>>> >>>
>>> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>> >>> the test to verify the results from the dev cluster. This cluster
>>> >>> matches the hardware of our production cluster but is not yet in
>>> >>> production so we can safely wipe it to downgrade back to Hammer.
>>> >>>
>>> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>> >>>
>>> >>> Let me know what else we can do to help.
>>> >>>
>>> >>> Thanks,
>>> >>> -----BEGIN PGP SIGNATURE-----
>>> >>> Version: Mailvelope v1.2.0
>>> >>> Comment: https://www.mailvelope.com
>>> >>>
>>> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>> >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>> >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>> >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>> >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>> >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>> >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>> >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>> >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>> >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>> >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>> >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>> >>> EDrG
>>> >>> =BZVw
>>> >>> -----END PGP SIGNATURE-----
>>> >>> ----------------
>>> >>> Robert LeBlanc
>>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>
>>> >>>
>>> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>> Hash: SHA256
>>> >>>>
>>> >>>> On my second test (a much longer one), it took nearly an hour, but a
>>> >>>> few messages have popped up over a 20 window. Still far less than I
>>> >>>> have been seeing.
>>> >>>> - ----------------
>>> >>>> Robert LeBlanc
>>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>> Hash: SHA256
>>> >>>>>
>>> >>>>> I'll capture another set of logs. Is there any other debugging you
>>> >>>>> want turned up? I've seen the same thing where I see the message
>>> >>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>> >>>>> for 30+ seconds in the secondary OSD logs.
>>> >>>>> - ----------------
>>> >>>>> Robert LeBlanc
>>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>> >>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> Hash: SHA256
>>> >>>>>>>
>>> >>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>> >>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>> >>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>> >>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>> >>>>>>> the OSD boots or during the recovery period. This is with
>>> >>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>> >>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>> >>>>>>> our dev cluster very easily and very quickly with these settings. So
>>> >>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>> >>>>>>> marked out. We would love to see that go away too, but this is far
>>> >>>>>>                                             (me too!)
>>> >>>>>>> better than what we have now. This dev cluster also has
>>> >>>>>>> osd_client_message_cap set to default (100).
>>> >>>>>>>
>>> >>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>> >>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>> >>>>>>> you you prefer a bisect to find the introduction of the problem
>>> >>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>> >>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>> >>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>> >>>>>>
>>> >>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>> >>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>> >>>>>> dump you sent is that although I see plenty of slow request warnings in
>>> >>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>> >>>>>> turned up for long enough?
>>> >>>>>>
>>> >>>>>> sage
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>> - ----------------
>>> >>>>>>> Robert LeBlanc
>>> >>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>> >>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>> >>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> Hash: SHA256
>>> >>>>>>> >>
>>> >>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>> >>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>> >>>>>>> >> messages when the OSD was marked out:
>>> >>>>>>> >>
>>> >>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>> >>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>> >>>>>>> >> 34.476006 secs
>>> >>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>> >>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>>> >>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>> >>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>> >>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>>> >>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>> >>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>> >>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>>> >>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>> >>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>>> >>>>>>> >>
>>> >>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>>> >>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
>>> >>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
>>> >>>>>>> >> far no blocked messages. I'm going to run some more tests.
>>> >>>>>>> >
>>> >>>>>>> > Good to hear.
>>> >>>>>>> >
>>> >>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
>>> >>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
>>> >>>>>>> > My guess is this OSD has a slower disk than the others, or something else
>>> >>>>>>> > funny is going on?
>>> >>>>>>> >
>>> >>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>>> >>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>>> >>>>>>> > osd.163.
>>> >>>>>>> >
>>> >>>>>>> > sage
>>> >>>>>>> >
>>> >>>>>>> >
>>> >>>>>>> >>
>>> >>>>>>> >> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> Version: Mailvelope v1.2.0
>>> >>>>>>> >> Comment: https://www.mailvelope.com
>>> >>>>>>> >>
>>> >>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>> >>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>> >>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>> >>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>> >>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>> >>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>> >>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>> >>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>> >>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>> >>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>> >>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>> >>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>> >>>>>>> >> fo5a
>>> >>>>>>> >> =ahEi
>>> >>>>>>> >> -----END PGP SIGNATURE-----
>>> >>>>>>> >> ----------------
>>> >>>>>>> >> Robert LeBlanc
>>> >>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>> >>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>> >>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> Hash: SHA256
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> With some off-list help, we have adjusted
>>> >>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>>> >>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>>> >>>>>>> >> >> it does not solve the problem with the blocked I/O.
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>>> >>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>>> >>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>>> >>>>>>> >> >> it starts servicing it or what exactly.
>>> >>>>>>> >> >
>>> >>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>>> >>>>>>> >> >
>>> >>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>> >>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
>>> >>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>> >>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>>> >>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>>> >>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>> >>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>>> >>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>>> >>>>>>> >> >> All packages were installed from gitbuilder.
>>> >>>>>>> >> >
>>> >>>>>>> >> > Did you chown -R ?
>>> >>>>>>> >> >
>>> >>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>> >>>>>>> >> >
>>> >>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>>> >>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
>>> >>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
>>> >>>>>>> >> >
>>> >>>>>>> >> > sage
>>> >>>>>>> >> >
>>> >>>>>>> >> >
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> Thanks,
>>> >>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> Version: Mailvelope v1.2.0
>>> >>>>>>> >> >> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>> >>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>> >>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>> >>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>> >>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>> >>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>> >>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>> >>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>> >>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>> >>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>> >>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>> >>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>> >>>>>>> >> >> GdXC
>>> >>>>>>> >> >> =Aigq
>>> >>>>>>> >> >> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> ----------------
>>> >>>>>>> >> >> Robert LeBlanc
>>> >>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >>
>>> >>>>>>> >> >>
>>> >>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>> >>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> > Hash: SHA256
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>>> >>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>>> >>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
>>> >>>>>>> >> >> > started getting block I/O messages during the recovery.
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > Thanks,
>>> >>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> > Version: Mailvelope v1.2.0
>>> >>>>>>> >> >> > Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>> >>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>> >>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>> >>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>> >>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>> >>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>> >>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>> >>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>> >>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>> >>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>> >>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>> >>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>> >>>>>>> >> >> > 3EPx
>>> >>>>>>> >> >> > =UDIV
>>> >>>>>>> >> >> > -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > ----------------
>>> >>>>>>> >> >> > Robert LeBlanc
>>> >>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> >
>>> >>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>> >>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> Hash: SHA256
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>>> >>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>> >>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>>> >>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>>> >>>>>>> >> >> >>> on-site engagements, please let us know.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>>> >>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>>> >>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>> >>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>>> >>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>> >>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>>> >>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>>> >>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>> >>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>>> >>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>>> >>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>>> >>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>> >>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> The messages seem slightly different:
>>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>> >>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>> >> >> >>> 100.087155 secs
>>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>> >>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>>> >>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>> >>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>> >>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>> >>>>>>> >> >> >>> points reached
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> I don't know what "no flag points reached" means.
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>>> >>>>>>> >> >> >> (op->mark_*() calls).
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>> >>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>>> >>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>>> >>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>>> >>>>>>> >> >> >> something we can blame on the network stack.
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >> sage
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>>> >>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>> >>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>>> >>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>>> >>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>> >>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>>> >>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>>> >>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>>> >>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>>> >>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>>> >>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>>> >>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>>> >>>>>>> >> >> >>> are being checked.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>>> >>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>>> >>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>>> >>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>>> >>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>> >>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>> >>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>>> >>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>>> >>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>> >>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>>> >>>>>>> >> >> >>> the Ceph process?
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>>> >>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>>> >>>>>>> >> >> >>> anything else.
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> Thanks,
>>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> Version: Mailvelope v1.1.0
>>> >>>>>>> >> >> >>> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>> >>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>> >>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>> >>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>> >>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>> >>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>> >>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>> >>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>> >>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>> >>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>> >>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>> >>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>> >>>>>>> >> >> >>> l7OF
>>> >>>>>>> >> >> >>> =OI++
>>> >>>>>>> >> >> >>> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> ----------------
>>> >>>>>>> >> >> >>> Robert LeBlanc
>>> >>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>> >>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>>> >>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>>> >>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>>> >>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>>> >>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>> >>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>> >>>>>>> >> >> >>> > processes and 16K system wide.
>>> >>>>>>> >> >> >>> >
>>> >>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>>> >>>>>>> >> >> >>> > configuration items we should be looking at?
>>> >>>>>>> >> >> >>> >
>>> >>>>>>> >> >> >>> > Thanks,
>>> >>>>>>> >> >> >>> > ----------------
>>> >>>>>>> >> >> >>> > Robert LeBlanc
>>> >>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>> >
>>> >>>>>>> >> >> >>> >
>>> >>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> >> Hash: SHA256
>>> >>>>>>> >> >> >>> >>
>>> >>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>> >>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>> >>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>>> >>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>>> >>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>>> >>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>>> >>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>>> >>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>>> >>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>>> >>>>>>> >> >> >>> >> didn't know it.
>>> >>>>>>> >> >> >>> >> - ----------------
>>> >>>>>>> >> >> >>> >> Robert LeBlanc
>>> >>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>> >>
>>> >>>>>>> >> >> >>> >>
>>> >>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>> >>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>> >>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>> >>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>> >>>>>>> >> >> >>> >>> drivers might cause problems though.
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>>> >>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>> >>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>> >>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>> >>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>> >>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>>> >>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>> >>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>> Mark
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> >>>> Hash: SHA256
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>>> >>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>>> >>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>>> >>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>>> >>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>>> >>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>>> >>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>>> >>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>> >>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>> >>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>> >>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>>> >>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>>> >>>>>>> >> >> >>> >>>> differing hardware and network configs.
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> Thanks,
>>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>>> >>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>> >>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>> >>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>> >>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>> >>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>> >>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>> >>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>> >>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>> >>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>> >>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>> >>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>> >>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>> >>>>>>> >> >> >>> >>>> 4OEo
>>> >>>>>>> >> >> >>> >>>> =P33I
>>> >>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>> ----------------
>>> >>>>>>> >> >> >>> >>>> Robert LeBlanc
>>> >>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>> >>>>>>> >> >> >>> >>>> wrote:
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> >>>>> Hash: SHA256
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>> >>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>>> >>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>>> >>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>>> >>>>>>> >> >> >>> >>>>> blocked I/O.
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>> >>>>>>> >> >> >>> >>>>> the blocked I/O.
>>> >>>>>>> >> >> >>> >>>>> - ----------------
>>> >>>>>>> >> >> >>> >>>>> Robert LeBlanc
>>> >>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>> >>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>>> >>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>> >>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>> >>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>> >>>>>>> >> >> >>> >>>>>> has
>>> >>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>> sage
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>> What kernel are you running?
>>> >>>>>>> >> >> >>> >>>>>>> -Sam
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>> >>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>>> >>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>>> >>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>> >>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>> >>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>> >>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>> >>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>> >>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>> >>>>>>> >> >> >>> >>>>>>>> transfer).
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>> >>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>> >>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>>> >>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>>> >>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>> >>>>>>> >> >> >>> >>>>>>>> thread.
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>> >>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>> >>>>>>> >> >> >>> >>>>>>>> some help.
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> Single Test started about
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>> >>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>> >>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>> >>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>>> >>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>>> >>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>>> >>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>>> >>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>>> >>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>>> >>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> fio job:
>>> >>>>>>> >> >> >>> >>>>>>>> [rbd-test]
>>> >>>>>>> >> >> >>> >>>>>>>> readwrite=write
>>> >>>>>>> >> >> >>> >>>>>>>> blocksize=4M
>>> >>>>>>> >> >> >>> >>>>>>>> #runtime=60
>>> >>>>>>> >> >> >>> >>>>>>>> name=rbd-test
>>> >>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>>> >>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>> >>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
>>> >>>>>>> >> >> >>> >>>>>>>> #norandommap
>>> >>>>>>> >> >> >>> >>>>>>>> #size=1T
>>> >>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
>>> >>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
>>> >>>>>>> >> >> >>> >>>>>>>> rbdname=test2
>>> >>>>>>> >> >> >>> >>>>>>>> pool=rbd
>>> >>>>>>> >> >> >>> >>>>>>>> clientname=admin
>>> >>>>>>> >> >> >>> >>>>>>>> iodepth=8
>>> >>>>>>> >> >> >>> >>>>>>>> #numjobs=4
>>> >>>>>>> >> >> >>> >>>>>>>> #thread
>>> >>>>>>> >> >> >>> >>>>>>>> #group_reporting
>>> >>>>>>> >> >> >>> >>>>>>>> #time_based
>>> >>>>>>> >> >> >>> >>>>>>>> #direct=1
>>> >>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> Thanks,
>>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>>> >>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>> >>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>> >>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>> >>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>> >>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>> >>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>> >>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>> >>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>> >>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>> >>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>> >>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>> >>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>> >>>>>>> >> >> >>> >>>>>>>> J3hS
>>> >>>>>>> >> >> >>> >>>>>>>> =0J7F
>>> >>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>>>>>> ----------------
>>> >>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>>> >>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>> >>>>>>> >> >> >>> >>>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>>> >>>>>>> >> >> >>> >>>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>>> >>>>>>> >> >> >>> >>>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> I'm not
>>> >>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>> >>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>> >>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>> >>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>> >>>>>>> >> >> >>> >>>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>> >>>>>>> >> >> >>> >>>>>>>>>> the
>>> >>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>> >>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>>> >>>>>>> >> >> >>> >>>>>>>>>> on
>>> >>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>> >>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>> >>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>> >>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>> >>>>>>> >> >> >>> >>>>>>>>> 20",
>>> >>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>> >>>>>>> >> >> >>> >>>>>>>>> out
>>> >>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>>> >>>>>>> >> >> >>> >>>>>>>>> -Greg
>>> >>>>>>> >> >> >>> >>>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>> --
>>> >>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>>>>>> >> >> >>> >>>>>>>> in
>>> >>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>>>
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>>> >>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >>> >>>>>
>>> >>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>> >>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>> >>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>> >>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>> >>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>> >>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>> >>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>> >>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>> >>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>> >>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>> >>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>> >>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>> >>>>>>> >> >> >>> >>>>> gcZm
>>> >>>>>>> >> >> >>> >>>>> =CjwB
>>> >>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>> --
>>> >>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>>>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>>> >> >> >>> >>>>
>>> >>>>>>> >> >> >>> >>>
>>> >>>>>>> >> >> >>> >>
>>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>>> >>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>>> >>>>>>> >> >> >>> >>
>>> >>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>> >>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>> >>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>> >>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>> >>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>> >>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>> >>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>> >>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>> >>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>> >>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>> >>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>> >>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>> >>>>>>> >> >> >>> >> ae22
>>> >>>>>>> >> >> >>> >> =AX+L
>>> >>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>>> >>>>>>> >> >> >>> _______________________________________________
>>> >>>>>>> >> >> >>> ceph-users mailing list
>>> >>>>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> >>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> >>>
>>> >>>>>>> >> >> _______________________________________________
>>> >>>>>>> >> >> ceph-users mailing list
>>> >>>>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> >>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>>>>>> >> >>
>>> >>>>>>> >> >>
>>> >>>>>>> >> --
>>> >>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>>>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>>
>>> >>>>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>>>> Version: Mailvelope v1.2.0
>>> >>>>>>> Comment: https://www.mailvelope.com
>>> >>>>>>>
>>> >>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>> >>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>> >>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>> >>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>> >>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>> >>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>> >>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>> >>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>> >>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>> >>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>> >>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>> >>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>> >>>>>>> 6Kfk
>>> >>>>>>> =/gR6
>>> >>>>>>> -----END PGP SIGNATURE-----
>>> >>>>>>> --
>>> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>> -----BEGIN PGP SIGNATURE-----
>>> >>>>> Version: Mailvelope v1.2.0
>>> >>>>> Comment: https://www.mailvelope.com
>>> >>>>>
>>> >>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>> >>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>> >>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>> >>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>> >>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>> >>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>> >>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>> >>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>> >>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>> >>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>> >>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>> >>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>> >>>>> JFPi
>>> >>>>> =ofgq
>>> >>>>> -----END PGP SIGNATURE-----
>>> >>>>
>>> >>>> -----BEGIN PGP SIGNATURE-----
>>> >>>> Version: Mailvelope v1.2.0
>>> >>>> Comment: https://www.mailvelope.com
>>> >>>>
>>> >>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>> >>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>> >>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>> >>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>> >>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>> >>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>> >>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>> >>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>> >>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>> >>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>> >>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>> >>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>> >>>> JYNo
>>> >>>> =msX2
>>> >>>> -----END PGP SIGNATURE-----
>>> >>
>>> >> -----BEGIN PGP SIGNATURE-----
>>> >> Version: Mailvelope v1.2.0
>>> >> Comment: https://www.mailvelope.com
>>> >>
>>> >> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>> >> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>> >> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>> >> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>> >> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>> >> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>> >> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>> >> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>> >> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>> >> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>> >> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>> >> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>> >> rZW5
>>> >> =dKn9
>>> >> -----END PGP SIGNATURE-----
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Best Regards,
>
> Wheat

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                               ` <CAANLjFrBNMeGkawcBUYqjWSjoWyQHCxjpEM291TmOp40HhCoSA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-14 17:08                                                                                                                 ` Sage Weil
       [not found]                                                                                                                   ` <alpine.DEB.2.00.1510140955240.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Sage Weil @ 2015-10-14 17:08 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

On Wed, 14 Oct 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> It seems in our situation the cluster is just busy, usually with
> really small RBD I/O. We have gotten things to where it doesn't happen
> as much in a steady state, but when we have an OSD fail (mostly from
> an XFS log bug we hit at least once a week), it is very painful as the
> OSD exits and enters the cluster. We are working to split the PGs a
> couple of fold, but this is a painful process for the reasons
> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
> on IRC about getting the other primaries to throttle back when such a
> situation occurs so that each primary OSD has some time to service
> client I/O and to push back on the clients to slow down in these
> situations.
> 
> In our case a single OSD can lock up a VM for a very long time while
> others are happily going about their business. Instead of looking like
> the cluster is out of I/O, it looks like there is an error. If
> pressure is pushed back to clients, it would show up as all of the
> clients slowing down a little instead of one or two just hanging for
> even over 1,000 seconds.

This 1000 seconds figure is very troubling.  Do you have logs?  I suspect 
this is a different issue than the prioritization one in the log from the 
other day (which only waited about 30s for higher-priority replica 
requests).

> My thoughts is that each OSD should have some percentage to time given
> to servicing client I/O whereas now it seems that replica I/O can
> completely starve client I/O. I understand why replica traffic needs a
> higher priority, but I think some balance needs to be attained.

We currently do 'fair' prioritized queueing with a token bucket filter 
only for requests with priorities <= 63.  Simply increasing this threshold 
so that it covers replica requests might be enough.  But... we'll be 
starting client requests locally at the expense of in-progress client 
writes elsewhere.  Given that the amount of (our) client-related work we 
do is always bounded by the msgr throttle, I think this is okay since we 
only make the situation worse by a fixed factor.  (We still don't address 
the possibilty that we are replica for every other osd in the system and 
could be flooded by N*(max client ops per osd).

It's this line:

	https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334

sage



> 
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
> eP3D
> =R0O9
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang <haomaiwang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> After a weekend, I'm ready to hit this from a different direction.
> >>>
> >>> I replicated the issue with Firefly so it doesn't seem an issue that
> >>> has been introduced or resolved in any nearby version. I think overall
> >>> we may be seeing [1] to a great degree. From what I can extract from
> >>> the logs, it looks like in situations where OSDs are going up and
> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
> >>> the PG to become clean before dispatching the I/O to the replicas.
> >>>
> >>> In an effort to understand the flow of the logs, I've attached a small
> >>> 2 minute segment of a log I've extracted what I believe to be
> >>> important entries in the life cycle of an I/O along with my
> >>> understanding. If someone would be kind enough to help my
> >>> understanding, I would appreciate it.
> >>>
> >>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
> >>> >> 192.168.55.12:0/2013622 pipe(0x26c90000 sd=47 :6800 s=2 pgs=2 cs=1
> >>> l=1 c=0x32c85440).reader got message 19 0x2af81700
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> >>>
> >>> - ->Messenger has recieved the message from the client (previous
> >>> entries in the 7fb9d2c68700 thread are the individual segments that
> >>> make up this message).
> >>>
> >>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
> >>> <== client.6709 192.168.55.12:0/2013622 19 ====
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> >>> ==== 235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
> >>>
> >>> - ->OSD process acknowledges that it has received the write.
> >>>
> >>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
> >>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> >>>
> >>> - ->Not sure excatly what is going on here, the op is being enqueued somewhere..
> >>>
> >>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
> >>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
> >>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
> >>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
> >>> active+clean]
> >>>
> >>> - ->The op is dequeued from this mystery queue 30 seconds later in a
> >>> different thread.
> >>
> >> ^^ This is the problem.  Everything after this looks reasonable.  Looking
> >> at the other dequeue_op calls over this period, it looks like we're just
> >> overwhelmed with higher priority requests.  New clients are 63, while
> >> osd_repop (replicated write from another primary) are 127 and replies from
> >> our own replicated ops are 196.  We do process a few other prio 63 items,
> >> but you'll see that their latency is also climbing up to 30s over this
> >> period.
> >>
> >> The question is why we suddenly get a lot of them.. maybe the peering on
> >> other OSDs just completed so we get a bunch of these?  It's also not clear
> >> to me what makes osd.4 or this op special.  We expect a mix of primary and
> >> replica ops on all the OSDs, so why would we suddenly have more of them
> >> here....
> >
> > I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
> > related to this thread.
> >
> > So is it means that there exists live lock with client op and repop?
> > We permit all clients issue too much client ops which cause some OSDs
> > bottleneck, then actually other OSDs maybe idle enough and accept more
> > client ops. Finally, all osds are stuck into the bottleneck OSD. It
> > seemed reasonable, but why it will last so long?
> >
> >>
> >> sage
> >>
> >>
> >>>
> >>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> >>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
> >>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
> >>>
> >>> - ->Not sure what this message is. Look up of secondary OSDs?
> >>>
> >>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> >>> new_repop rep_tid 17815 on osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5
> >>>
> >>> - ->Dispatch write to secondaty OSDs?
> >>>
> >>> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
> >>> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>> -- ?+4195078 0x238fd600 con 0x32bcb5a0
> >>>
> >>> - ->OSD dispatch write to OSD.0.
> >>>
> >>> 2015-10-12 14:13:06.545132 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
> >>> submit_message osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>> remote, 192.168.55.15:6801/32036, have pipe.
> >>>
> >>> - ->Message sent to OSD.0.
> >>>
> >>> 2015-10-12 14:13:06.545195 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
> >>> --> 192.168.55.11:6801/13185 -- osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>> -- ?+4195078 0x16edd200 con 0x3a37b20
> >>>
> >>> - ->OSD dispatch write to OSD.5.
> >>>
> >>> 2015-10-12 14:13:06.545210 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
> >>> submit_message osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>> remote, 192.168.55.11:6801/13185, have pipe.
> >>>
> >>> - ->Message sent to OSD.5.
> >>>
> >>> 2015-10-12 14:13:06.545229 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
> >>> append_log log((0'0,44'703], crt=44'700) [44'704 (44'691) modify
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> >>> client.6709.0:67 2015-10-12 14:12:34.340082]
> >>> 2015-10-12 14:13:06.545268 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'700 lcod 44'702 mlcod
> >>> 44'702 active+clean] add_log_entry 44'704 (44'691) modify
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> >>> client.6709.0:67 2015-10-12 14:12:34.340082
> >>>
> >>> - ->These record the OP in the journal log?
> >>>
> >>> 2015-10-12 14:13:06.563241 7fb9d326e700 20 -- 192.168.55.16:6801/11295
> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> >>> cs=3 l=0 c=0x3a37b20).writer encoding 17337 features 37154696925806591
> >>> 0x16edd200 osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>>
> >>> - ->Writing the data to OSD.5?
> >>>
> >>> 2015-10-12 14:13:06.573938 7fb9d3874700 10 -- 192.168.55.16:6801/11295
> >>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
> >>> l=0 c=0x32bcb5a0).reader got ack seq 1206 >= 1206 on 0x238fd600
> >>> osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>>
> >>> - ->Messenger gets ACK from OSD.0 that it reveiced that last packet?
> >>>
> >>> 2015-10-12 14:13:06.613425 7fb9d3874700 10 -- 192.168.55.16:6801/11295
> >>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
> >>> l=0 c=0x32bcb5a0).reader got message 1146 0x3ffa480
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> >>>
> >>> - ->Messenger receives ack on disk from OSD.0.
> >>>
> >>> 2015-10-12 14:13:06.613447 7fb9d3874700  1 -- 192.168.55.16:6801/11295
> >>> <== osd.0 192.168.55.15:6801/32036 1146 ====
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
> >>> 83+0+0 (2772408781 0 0) 0x3ffa480 con 0x32bcb5a0
> >>>
> >>> - ->OSD process gets on disk ACK from OSD.0.
> >>>
> >>> 2015-10-12 14:13:06.613478 7fb9d3874700 10 osd.4 44 handle_replica_op
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
> >>>
> >>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
> >>> correlate that to the previous message other than by time.
> >>>
> >>> 2015-10-12 14:13:06.613504 7fb9d3874700 15 osd.4 44 enqueue_op
> >>> 0x120f9b00 prio 196 cost 0 latency 0.000250
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> >>>
> >>> - ->The reply is enqueued onto a mystery queue.
> >>>
> >>> 2015-10-12 14:13:06.627793 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> >>> cs=3 l=0 c=0x3a37b20).reader got ack seq 17337 >= 17337 on 0x16edd200
> >>> osd_repop(client.6709.0:67 0.29
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
> >>>
> >>> - ->Messenger gets ACK from OSD.5 that it reveiced that last packet?
> >>>
> >>> 2015-10-12 14:13:06.628364 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
> >>> cs=3 l=0 c=0x3a37b20).reader got message 16477 0x21cef3c0
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> >>>
> >>> - ->Messenger receives ack on disk from OSD.5.
> >>>
> >>> 2015-10-12 14:13:06.628382 7fb9d6afd700  1 -- 192.168.55.16:6801/11295
> >>> <== osd.5 192.168.55.11:6801/13185 16477 ====
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
> >>> 83+0+0 (2104182993 0 0) 0x21cef3c0 con 0x3a37b20
> >>>
> >>> - ->OSD process gets on disk ACK from OSD.5.
> >>>
> >>> 2015-10-12 14:13:06.628406 7fb9d6afd700 10 osd.4 44 handle_replica_op
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
> >>>
> >>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
> >>> correlate that to the previous message other than by time.
> >>>
> >>> 2015-10-12 14:13:06.628426 7fb9d6afd700 15 osd.4 44 enqueue_op
> >>> 0x3e41600 prio 196 cost 0 latency 0.000180
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
> >>>
> >>> - ->The reply is enqueued onto a mystery queue.
> >>>
> >>> 2015-10-12 14:13:07.124206 7fb9f4e9f700  0 log_channel(cluster) log
> >>> [WRN] : slow request 30.598371 seconds old, received at 2015-10-12
> >>> 14:12:36.525724: osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) currently waiting for subops
> >>> from 0,5
> >>>
> >>> - ->OP has not been dequeued to the client from the mystery queue yet.
> >>>
> >>> 2015-10-12 14:13:07.278449 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod 44'702 mlcod
> >>> 44'702 active+clean] eval_repop repgather(0x37ea3cc0 44'704
> >>> rep_tid=17815 committed?=0 applied?=0 lock=0
> >>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
> >>> wants=ad
> >>>
> >>> - ->Not sure what this means. The OP has been completed on all replicas?
> >>>
> >>> 2015-10-12 14:13:07.278566 7fb9e0535700 10 osd.4 44 dequeue_op
> >>> 0x120f9b00 prio 196 cost 0 latency 0.665312
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
> >>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
> >>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
> >>> 44'702 mlcod 44'702 active+clean]
> >>>
> >>> - ->One of the replica OPs is dequeued in a different thread
> >>>
> >>> 2015-10-12 14:13:07.278809 7fb9e0535700 10 osd.4 44 dequeue_op
> >>> 0x3e41600 prio 196 cost 0 latency 0.650563
> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
> >>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
> >>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
> >>> 44'702 mlcod 44'702 active+clean]
> >>>
> >>> - ->The other replica OP is dequeued in the new thread
> >>>
> >>> 2015-10-12 14:13:07.967469 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> >>> active+clean] eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815
> >>> committed?=1 applied?=0 lock=0 op=osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
> >>>
> >>> - ->Not sure what this does. A thread that joins the replica OPs with
> >>> the primary OP?
> >>>
> >>> 2015-10-12 14:13:07.967515 7fb9efe95700 15 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> >>> active+clean] log_op_stats osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5 inb 4194304 outb 0 rlat
> >>> 0.000000 lat 31.441789
> >>>
> >>> - ->Logs that the write has been committed to all replicas in the
> >>> primary journal?
> >>>
> >>> Not sure what the rest of these do, nor do I understand where the
> >>> client gets an ACK that the write is committed.
> >>>
> >>> 2015-10-12 14:13:07.967583 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
> >>> active+clean]  sending commit on repgather(0x37ea3cc0 44'704
> >>> rep_tid=17815 committed?=1 applied?=0 lock=0
> >>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
> >>> 0x3a2f0840
> >>>
> >>> 2015-10-12 14:13:10.351452 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'702 active+clean]
> >>> eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> >>> applied?=1 lock=0 op[0/1943]client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
> >>>
> >>> 2015-10-12 14:13:10.354089 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
> >>> removing repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5)
> >>>
> >>> 2015-10-12 14:13:10.354163 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
> >>>  q front is repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5)
> >>>
> >>> 2015-10-12 14:13:10.354199 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
> >>> remove_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
> >>> ack+ondisk+write+known_if_redirected e44) v5)
> >>>
> >>> 2015-10-12 14:13:15.488448 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
> >>> v 44'707 (0'0,44'707] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
> >>> [4,5,0] r=0 lpr=32 luod=44'705 lua=44'705 crt=44'704 lcod 44'704 mlcod
> >>> 44'704 active+clean] append_log: trimming to 44'704 entries 44'704
> >>> (44'691) modify
> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
> >>> client.6709.0:67 2015-10-12 14:12:34.340082
> >>>
> >>> Thanks for hanging in there with me on this...
> >>>
> >>> [1] http://www.spinics.net/lists/ceph-devel/msg26633.html
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: Mailvelope v1.2.0
> >>> Comment: https://www.mailvelope.com
> >>>
> >>> wsFcBAEBCAAQBQJWHCx0CRDmVDuy+mK58QAAXf8P/j6MD52r2DLqOP9hKFAP
> >>> MJUktg8uqK1i8awtuIQhJHAPDZQF8EACOXg6RBuOz75iryCFKAJXk5exLXrE
> >>> pIZqY/0/JCsUEPuQGaMY9GVQNrTeB82F5VIu572i2xeFir4fUEcvllXSeR9O
> >>> CxSgaAncxUYGSXwsiCJ28QhwPCFXtCLACg1eTpghhAcOwY0t+z6ZB3vh+WxB
> >>> B8kRCdee78TVZOgeTnd66aBJUrr21Ir9aPqSm73uY561dyDmyxc4zPq+FDsJ
> >>> kuac+Ky9Lc6rqhxwRptbdx5i/EDzxj96EKEz2v4SFBmvzU8jtZlA8THJ6WlF
> >>> 6lZRpRIMfEqVu4neFcdUIct8+Brf7fuxOI7hbhUL5xq2I6yDSY8E2T8ImRoS
> >>> w8bSrjFV3wmnXSCHnFJPROqdhtlQlH1PkKPBRJeJrkrB1MloX0ybU4hNIr7Q
> >>> 4ZyzeLpD9sgL1vEfUVuCksgiVJhzlFOyqeRHcfpPEnLxyGL/+mLUa5lQ5m5l
> >>> m286ZnsMZGMzAdSA/tsqnTFzL0HbjkiWD/OMU5zThSKW2tZBNWg3xZE5Yia9
> >>> zAbhxpvxqhKQ7nfmv3xeVJ1GKb9CuzfN9ZIGPltHvpA3rZf3I4+XVlWbbhDZ
> >>> z8Xp8Pw8f7neh89Tv3AT+krM1jrE1ZxOF5A2K4CxBcS3OEMc5UIZ2fy4dHSo
> >>> 0iTE
> >>> =t7nL
> >>> -----END PGP SIGNATURE-----
> >>> ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>
> >>>
> >>> On Thu, Oct 8, 2015 at 11:44 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> > -----BEGIN PGP SIGNED MESSAGE-----
> >>> > Hash: SHA256
> >>> >
> >>> > Sage,
> >>> >
> >>> > After trying to bisect this issue (all test moved the bisect towards
> >>> > Infernalis) and eventually testing the Infernalis branch again, it
> >>> > looks like the problem still exists although it is handled a tad
> >>> > better in Infernalis. I'm going to test against Firefly/Giant next
> >>> > week and then try and dive into the code to see if I can expose any
> >>> > thing.
> >>> >
> >>> > If I can do anything to provide you with information, please let me know.
> >>> >
> >>> > Thanks,
> >>> > -----BEGIN PGP SIGNATURE-----
> >>> > Version: Mailvelope v1.2.0
> >>> > Comment: https://www.mailvelope.com
> >>> >
> >>> > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
> >>> > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
> >>> > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
> >>> > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
> >>> > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
> >>> > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
> >>> > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
> >>> > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
> >>> > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
> >>> > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
> >>> > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
> >>> > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
> >>> > BCFo
> >>> > =GJL4
> >>> > -----END PGP SIGNATURE-----
> >>> > ----------------
> >>> > Robert LeBlanc
> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >
> >>> >
> >>> > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >> Hash: SHA256
> >>> >>
> >>> >> We forgot to upload the ceph.log yesterday. It is there now.
> >>> >> - ----------------
> >>> >> Robert LeBlanc
> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>
> >>> >>
> >>> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
> >>> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>> Hash: SHA256
> >>> >>>
> >>> >>> I upped the debug on about everything and ran the test for about 40
> >>> >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
> >>> >>> There was at least one op on osd.19 that was blocked for over 1,000
> >>> >>> seconds. Hopefully this will have something that will cast a light on
> >>> >>> what is going on.
> >>> >>>
> >>> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
> >>> >>> the test to verify the results from the dev cluster. This cluster
> >>> >>> matches the hardware of our production cluster but is not yet in
> >>> >>> production so we can safely wipe it to downgrade back to Hammer.
> >>> >>>
> >>> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
> >>> >>>
> >>> >>> Let me know what else we can do to help.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> -----BEGIN PGP SIGNATURE-----
> >>> >>> Version: Mailvelope v1.2.0
> >>> >>> Comment: https://www.mailvelope.com
> >>> >>>
> >>> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
> >>> >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
> >>> >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
> >>> >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
> >>> >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
> >>> >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
> >>> >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
> >>> >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
> >>> >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
> >>> >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
> >>> >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
> >>> >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
> >>> >>> EDrG
> >>> >>> =BZVw
> >>> >>> -----END PGP SIGNATURE-----
> >>> >>> ----------------
> >>> >>> Robert LeBlanc
> >>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>
> >>> >>>
> >>> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>> Hash: SHA256
> >>> >>>>
> >>> >>>> On my second test (a much longer one), it took nearly an hour, but a
> >>> >>>> few messages have popped up over a 20 window. Still far less than I
> >>> >>>> have been seeing.
> >>> >>>> - ----------------
> >>> >>>> Robert LeBlanc
> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>
> >>> >>>>
> >>> >>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>> Hash: SHA256
> >>> >>>>>
> >>> >>>>> I'll capture another set of logs. Is there any other debugging you
> >>> >>>>> want turned up? I've seen the same thing where I see the message
> >>> >>>>> dispatched to the secondary OSD, but the message just doesn't show up
> >>> >>>>> for 30+ seconds in the secondary OSD logs.
> >>> >>>>> - ----------------
> >>> >>>>> Robert LeBlanc
> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
> >>> >>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >>> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> Hash: SHA256
> >>> >>>>>>>
> >>> >>>>>>> I can't think of anything. In my dev cluster the only thing that has
> >>> >>>>>>> changed is the Ceph versions (no reboot). What I like is even though
> >>> >>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
> >>> >>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
> >>> >>>>>>> the OSD boots or during the recovery period. This is with
> >>> >>>>>>> max_backfills set to 20, one backfill max in our production cluster is
> >>> >>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
> >>> >>>>>>> our dev cluster very easily and very quickly with these settings. So
> >>> >>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
> >>> >>>>>>> marked out. We would love to see that go away too, but this is far
> >>> >>>>>>                                             (me too!)
> >>> >>>>>>> better than what we have now. This dev cluster also has
> >>> >>>>>>> osd_client_message_cap set to default (100).
> >>> >>>>>>>
> >>> >>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
> >>> >>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
> >>> >>>>>>> you you prefer a bisect to find the introduction of the problem
> >>> >>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
> >>> >>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
> >>> >>>>>>> commit that prevents a clean build as that is my most limiting factor?
> >>> >>>>>>
> >>> >>>>>> Nothing comes to mind.  I think the best way to find this is still to see
> >>> >>>>>> it happen in the logs with hammer.  The frustrating thing with that log
> >>> >>>>>> dump you sent is that although I see plenty of slow request warnings in
> >>> >>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
> >>> >>>>>> turned up for long enough?
> >>> >>>>>>
> >>> >>>>>> sage
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>> Thanks,
> >>> >>>>>>> - ----------------
> >>> >>>>>>> Robert LeBlanc
> >>> >>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
> >>> >>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> >>> >>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> Hash: SHA256
> >>> >>>>>>> >>
> >>> >>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> >>> >>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> >>> >>>>>>> >> messages when the OSD was marked out:
> >>> >>>>>>> >>
> >>> >>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
> >>> >>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
> >>> >>>>>>> >> 34.476006 secs
> >>> >>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
> >>> >>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
> >>> >>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
> >>> >>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>> >>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
> >>> >>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
> >>> >>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
> >>> >>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>> >>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
> >>> >>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
> >>> >>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
> >>> >>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
> >>> >>>>>>> >>
> >>> >>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
> >>> >>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
> >>> >>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
> >>> >>>>>>> >> far no blocked messages. I'm going to run some more tests.
> >>> >>>>>>> >
> >>> >>>>>>> > Good to hear.
> >>> >>>>>>> >
> >>> >>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
> >>> >>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
> >>> >>>>>>> > My guess is this OSD has a slower disk than the others, or something else
> >>> >>>>>>> > funny is going on?
> >>> >>>>>>> >
> >>> >>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
> >>> >>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
> >>> >>>>>>> > osd.163.
> >>> >>>>>>> >
> >>> >>>>>>> > sage
> >>> >>>>>>> >
> >>> >>>>>>> >
> >>> >>>>>>> >>
> >>> >>>>>>> >> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> Version: Mailvelope v1.2.0
> >>> >>>>>>> >> Comment: https://www.mailvelope.com
> >>> >>>>>>> >>
> >>> >>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
> >>> >>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
> >>> >>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
> >>> >>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
> >>> >>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
> >>> >>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
> >>> >>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
> >>> >>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
> >>> >>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
> >>> >>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
> >>> >>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
> >>> >>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
> >>> >>>>>>> >> fo5a
> >>> >>>>>>> >> =ahEi
> >>> >>>>>>> >> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> ----------------
> >>> >>>>>>> >> Robert LeBlanc
> >>> >>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >>
> >>> >>>>>>> >>
> >>> >>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
> >>> >>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> >>> >>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> Hash: SHA256
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >> With some off-list help, we have adjusted
> >>> >>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
> >>> >>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
> >>> >>>>>>> >> >> it does not solve the problem with the blocked I/O.
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
> >>> >>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
> >>> >>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
> >>> >>>>>>> >> >> it starts servicing it or what exactly.
> >>> >>>>>>> >> >
> >>> >>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
> >>> >>>>>>> >> >
> >>> >>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
> >>> >>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
> >>> >>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
> >>> >>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
> >>> >>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
> >>> >>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
> >>> >>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
> >>> >>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
> >>> >>>>>>> >> >> All packages were installed from gitbuilder.
> >>> >>>>>>> >> >
> >>> >>>>>>> >> > Did you chown -R ?
> >>> >>>>>>> >> >
> >>> >>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
> >>> >>>>>>> >> >
> >>> >>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
> >>> >>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
> >>> >>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
> >>> >>>>>>> >> >
> >>> >>>>>>> >> > sage
> >>> >>>>>>> >> >
> >>> >>>>>>> >> >
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >> Thanks,
> >>> >>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> Version: Mailvelope v1.2.0
> >>> >>>>>>> >> >> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
> >>> >>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
> >>> >>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
> >>> >>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
> >>> >>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
> >>> >>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
> >>> >>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
> >>> >>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
> >>> >>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
> >>> >>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
> >>> >>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
> >>> >>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
> >>> >>>>>>> >> >> GdXC
> >>> >>>>>>> >> >> =Aigq
> >>> >>>>>>> >> >> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> ----------------
> >>> >>>>>>> >> >> Robert LeBlanc
> >>> >>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
> >>> >>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> > Hash: SHA256
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
> >>> >>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
> >>> >>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
> >>> >>>>>>> >> >> > started getting block I/O messages during the recovery.
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > Thanks,
> >>> >>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> > Version: Mailvelope v1.2.0
> >>> >>>>>>> >> >> > Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
> >>> >>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
> >>> >>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
> >>> >>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
> >>> >>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
> >>> >>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
> >>> >>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
> >>> >>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
> >>> >>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
> >>> >>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
> >>> >>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
> >>> >>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
> >>> >>>>>>> >> >> > 3EPx
> >>> >>>>>>> >> >> > =UDIV
> >>> >>>>>>> >> >> > -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > ----------------
> >>> >>>>>>> >> >> > Robert LeBlanc
> >>> >>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> >
> >>> >>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
> >>> >>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> >>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> Hash: SHA256
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
> >>> >>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
> >>> >>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
> >>> >>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
> >>> >>>>>>> >> >> >>> on-site engagements, please let us know.
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
> >>> >>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
> >>> >>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> >>> >>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
> >>> >>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
> >>> >>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
> >>> >>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
> >>> >>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> >>> >>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
> >>> >>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
> >>> >>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
> >>> >>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> >>> >>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> The messages seem slightly different:
> >>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> >>> >>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>> >> >> >>> 100.087155 secs
> >>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> >>> >>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
> >>> >>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> >>> >>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> >>> >>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> >>> >>>>>>> >> >> >>> points reached
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> I don't know what "no flag points reached" means.
> >>> >>>>>>> >> >> >>
> >>> >>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
> >>> >>>>>>> >> >> >> (op->mark_*() calls).
> >>> >>>>>>> >> >> >>
> >>> >>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
> >>> >>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
> >>> >>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
> >>> >>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
> >>> >>>>>>> >> >> >> something we can blame on the network stack.
> >>> >>>>>>> >> >> >>
> >>> >>>>>>> >> >> >> sage
> >>> >>>>>>> >> >> >>
> >>> >>>>>>> >> >> >>
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
> >>> >>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
> >>> >>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
> >>> >>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
> >>> >>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
> >>> >>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
> >>> >>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
> >>> >>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
> >>> >>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
> >>> >>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
> >>> >>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
> >>> >>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
> >>> >>>>>>> >> >> >>> are being checked.
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
> >>> >>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
> >>> >>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
> >>> >>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
> >>> >>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
> >>> >>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
> >>> >>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
> >>> >>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
> >>> >>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> >>> >>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
> >>> >>>>>>> >> >> >>> the Ceph process?
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
> >>> >>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
> >>> >>>>>>> >> >> >>> anything else.
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> Thanks,
> >>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> Version: Mailvelope v1.1.0
> >>> >>>>>>> >> >> >>> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> >>> >>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> >>> >>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> >>> >>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> >>> >>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> >>> >>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> >>> >>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> >>> >>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> >>> >>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> >>> >>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> >>> >>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> >>> >>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> >>> >>>>>>> >> >> >>> l7OF
> >>> >>>>>>> >> >> >>> =OI++
> >>> >>>>>>> >> >> >>> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> ----------------
> >>> >>>>>>> >> >> >>> Robert LeBlanc
> >>> >>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> >>> >>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
> >>> >>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
> >>> >>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
> >>> >>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
> >>> >>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> >>> >>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> >>> >>>>>>> >> >> >>> > processes and 16K system wide.
> >>> >>>>>>> >> >> >>> >
> >>> >>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
> >>> >>>>>>> >> >> >>> > configuration items we should be looking at?
> >>> >>>>>>> >> >> >>> >
> >>> >>>>>>> >> >> >>> > Thanks,
> >>> >>>>>>> >> >> >>> > ----------------
> >>> >>>>>>> >> >> >>> > Robert LeBlanc
> >>> >>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>> >
> >>> >>>>>>> >> >> >>> >
> >>> >>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> >> Hash: SHA256
> >>> >>>>>>> >> >> >>> >>
> >>> >>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >>> >>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >>> >>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
> >>> >>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
> >>> >>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
> >>> >>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
> >>> >>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
> >>> >>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >>> >>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
> >>> >>>>>>> >> >> >>> >> didn't know it.
> >>> >>>>>>> >> >> >>> >> - ----------------
> >>> >>>>>>> >> >> >>> >> Robert LeBlanc
> >>> >>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>> >>
> >>> >>>>>>> >> >> >>> >>
> >>> >>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> >>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
> >>> >>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> >>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> >>>>>>> >> >> >>> >>> drivers might cause problems though.
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
> >>> >>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>> >>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>> >>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>> >>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>> >>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>> >>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>> >>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>> Mark
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> >>>> Hash: SHA256
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
> >>> >>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>> >>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>> >>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>> >>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>> >>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
> >>> >>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>> >>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>> >>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>> >>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>> >>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>> >>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
> >>> >>>>>>> >> >> >>> >>>> differing hardware and network configs.
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> Thanks,
> >>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
> >>> >>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>> >>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>> >>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>> >>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>> >>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>> >>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>> >>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>> >>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>> >>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>> >>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>> >>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>> >>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>> >>>>>>> >> >> >>> >>>> 4OEo
> >>> >>>>>>> >> >> >>> >>>> =P33I
> >>> >>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>> ----------------
> >>> >>>>>>> >> >> >>> >>>> Robert LeBlanc
> >>> >>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>> >>>>>>> >> >> >>> >>>> wrote:
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> >>>>> Hash: SHA256
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>> >>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>> >>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>> >>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>> >>>>>>> >> >> >>> >>>>> blocked I/O.
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>> >>>>>>> >> >> >>> >>>>> the blocked I/O.
> >>> >>>>>>> >> >> >>> >>>>> - ----------------
> >>> >>>>>>> >> >> >>> >>>>> Robert LeBlanc
> >>> >>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>> >>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>> >>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>> >>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>> >>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>> >>>>>>> >> >> >>> >>>>>> has
> >>> >>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>> sage
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>> What kernel are you running?
> >>> >>>>>>> >> >> >>> >>>>>>> -Sam
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>> >>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
> >>> >>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>> >>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>> >>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>> >>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>> >>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>> >>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>> >>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
> >>> >>>>>>> >> >> >>> >>>>>>>> transfer).
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
> >>> >>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>> >>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
> >>> >>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
> >>> >>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>> >>>>>>> >> >> >>> >>>>>>>> thread.
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>> >>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>> >>>>>>> >> >> >>> >>>>>>>> some help.
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> Single Test started about
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>> >>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>> >>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
> >>> >>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>> >>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>> >>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>> >>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>> >>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>> >>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> fio job:
> >>> >>>>>>> >> >> >>> >>>>>>>> [rbd-test]
> >>> >>>>>>> >> >> >>> >>>>>>>> readwrite=write
> >>> >>>>>>> >> >> >>> >>>>>>>> blocksize=4M
> >>> >>>>>>> >> >> >>> >>>>>>>> #runtime=60
> >>> >>>>>>> >> >> >>> >>>>>>>> name=rbd-test
> >>> >>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
> >>> >>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>> >>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
> >>> >>>>>>> >> >> >>> >>>>>>>> #norandommap
> >>> >>>>>>> >> >> >>> >>>>>>>> #size=1T
> >>> >>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
> >>> >>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
> >>> >>>>>>> >> >> >>> >>>>>>>> rbdname=test2
> >>> >>>>>>> >> >> >>> >>>>>>>> pool=rbd
> >>> >>>>>>> >> >> >>> >>>>>>>> clientname=admin
> >>> >>>>>>> >> >> >>> >>>>>>>> iodepth=8
> >>> >>>>>>> >> >> >>> >>>>>>>> #numjobs=4
> >>> >>>>>>> >> >> >>> >>>>>>>> #thread
> >>> >>>>>>> >> >> >>> >>>>>>>> #group_reporting
> >>> >>>>>>> >> >> >>> >>>>>>>> #time_based
> >>> >>>>>>> >> >> >>> >>>>>>>> #direct=1
> >>> >>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> Thanks,
> >>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
> >>> >>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>> >>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>> >>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>> >>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>> >>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>> >>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>> >>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>> >>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>> >>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>> >>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>> >>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>> >>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>> >>>>>>> >> >> >>> >>>>>>>> J3hS
> >>> >>>>>>> >> >> >>> >>>>>>>> =0J7F
> >>> >>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>>>>>> ----------------
> >>> >>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
> >>> >>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>> >>>>>>> >> >> >>> >>>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> >>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
> >>> >>>>>>> >> >> >>> >>>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>> >>>>>>> >> >> >>> >>>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> I'm not
> >>> >>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>> >>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>> >>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
> >>> >>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>> >>>>>>> >> >> >>> >>>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>> >>>>>>> >> >> >>> >>>>>>>>>> the
> >>> >>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>> >>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>> >>>>>>> >> >> >>> >>>>>>>>>> on
> >>> >>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
> >>> >>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>> >>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>> >>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>> >>>>>>> >> >> >>> >>>>>>>>> 20",
> >>> >>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>> >>>>>>> >> >> >>> >>>>>>>>> out
> >>> >>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>> >>>>>>> >> >> >>> >>>>>>>>> -Greg
> >>> >>>>>>> >> >> >>> >>>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>> --
> >>> >>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> >>>>>>> >> >> >>> >>>>>>>> in
> >>> >>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>>>
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
> >>> >>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >>> >>>>>
> >>> >>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>> >>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>> >>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>> >>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>> >>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>> >>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>> >>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>> >>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>> >>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>> >>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>> >>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>> >>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>> >>>>>>> >> >> >>> >>>>> gcZm
> >>> >>>>>>> >> >> >>> >>>>> =CjwB
> >>> >>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>> --
> >>> >>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> >>>>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>>>> >> >> >>> >>>>
> >>> >>>>>>> >> >> >>> >>>
> >>> >>>>>>> >> >> >>> >>
> >>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
> >>> >>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
> >>> >>>>>>> >> >> >>> >>
> >>> >>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >>> >>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >>> >>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >>> >>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >>> >>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >>> >>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >>> >>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >>> >>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >>> >>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >>> >>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >>> >>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >>> >>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >>> >>>>>>> >> >> >>> >> ae22
> >>> >>>>>>> >> >> >>> >> =AX+L
> >>> >>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
> >>> >>>>>>> >> >> >>> _______________________________________________
> >>> >>>>>>> >> >> >>> ceph-users mailing list
> >>> >>>>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>> >>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> >>>
> >>> >>>>>>> >> >> _______________________________________________
> >>> >>>>>>> >> >> ceph-users mailing list
> >>> >>>>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>> >>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> >>
> >>> >>>>>>> >> --
> >>> >>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> >>>>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>>>> >>
> >>> >>>>>>> >>
> >>> >>>>>>>
> >>> >>>>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>>>> Version: Mailvelope v1.2.0
> >>> >>>>>>> Comment: https://www.mailvelope.com
> >>> >>>>>>>
> >>> >>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
> >>> >>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
> >>> >>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
> >>> >>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
> >>> >>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
> >>> >>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
> >>> >>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
> >>> >>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
> >>> >>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
> >>> >>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
> >>> >>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
> >>> >>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
> >>> >>>>>>> 6Kfk
> >>> >>>>>>> =/gR6
> >>> >>>>>>> -----END PGP SIGNATURE-----
> >>> >>>>>>> --
> >>> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> >>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >>> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>
> >>> >>>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>>> Version: Mailvelope v1.2.0
> >>> >>>>> Comment: https://www.mailvelope.com
> >>> >>>>>
> >>> >>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
> >>> >>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
> >>> >>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
> >>> >>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
> >>> >>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
> >>> >>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
> >>> >>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
> >>> >>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
> >>> >>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
> >>> >>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
> >>> >>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
> >>> >>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
> >>> >>>>> JFPi
> >>> >>>>> =ofgq
> >>> >>>>> -----END PGP SIGNATURE-----
> >>> >>>>
> >>> >>>> -----BEGIN PGP SIGNATURE-----
> >>> >>>> Version: Mailvelope v1.2.0
> >>> >>>> Comment: https://www.mailvelope.com
> >>> >>>>
> >>> >>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
> >>> >>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
> >>> >>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
> >>> >>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
> >>> >>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
> >>> >>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
> >>> >>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
> >>> >>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
> >>> >>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
> >>> >>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
> >>> >>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
> >>> >>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
> >>> >>>> JYNo
> >>> >>>> =msX2
> >>> >>>> -----END PGP SIGNATURE-----
> >>> >>
> >>> >> -----BEGIN PGP SIGNATURE-----
> >>> >> Version: Mailvelope v1.2.0
> >>> >> Comment: https://www.mailvelope.com
> >>> >>
> >>> >> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
> >>> >> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
> >>> >> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
> >>> >> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
> >>> >> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
> >>> >> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
> >>> >> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
> >>> >> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
> >>> >> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
> >>> >> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
> >>> >> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
> >>> >> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
> >>> >> rZW5
> >>> >> =dKn9
> >>> >> -----END PGP SIGNATURE-----
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                                   ` <alpine.DEB.2.00.1510140955240.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2015-10-14 17:58                                                                                                                     ` Robert LeBlanc
  0 siblings, 0 replies; 45+ messages in thread
From: Robert LeBlanc @ 2015-10-14 17:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm sure I have a log of a 1,000 second block somewhere, I'll have to
look around for it.

I'll try turning that knob and see what happens. I'll come back with
the results.

Thanks,

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Oct 14, 2015 at 11:08 AM, Sage Weil  wrote:
> On Wed, 14 Oct 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> It seems in our situation the cluster is just busy, usually with
>> really small RBD I/O. We have gotten things to where it doesn't happen
>> as much in a steady state, but when we have an OSD fail (mostly from
>> an XFS log bug we hit at least once a week), it is very painful as the
>> OSD exits and enters the cluster. We are working to split the PGs a
>> couple of fold, but this is a painful process for the reasons
>> mentioned in the tracker. Matt Benjamin and Sam Just had a discussion
>> on IRC about getting the other primaries to throttle back when such a
>> situation occurs so that each primary OSD has some time to service
>> client I/O and to push back on the clients to slow down in these
>> situations.
>>
>> In our case a single OSD can lock up a VM for a very long time while
>> others are happily going about their business. Instead of looking like
>> the cluster is out of I/O, it looks like there is an error. If
>> pressure is pushed back to clients, it would show up as all of the
>> clients slowing down a little instead of one or two just hanging for
>> even over 1,000 seconds.
>
> This 1000 seconds figure is very troubling.  Do you have logs?  I suspect
> this is a different issue than the prioritization one in the log from the
> other day (which only waited about 30s for higher-priority replica
> requests).
>
>> My thoughts is that each OSD should have some percentage to time given
>> to servicing client I/O whereas now it seems that replica I/O can
>> completely starve client I/O. I understand why replica traffic needs a
>> higher priority, but I think some balance needs to be attained.
>
> We currently do 'fair' prioritized queueing with a token bucket filter
> only for requests with priorities <= 63.  Simply increasing this threshold
> so that it covers replica requests might be enough.  But... we'll be
> starting client requests locally at the expense of in-progress client
> writes elsewhere.  Given that the amount of (our) client-related work we
> do is always bounded by the msgr throttle, I think this is okay since we
> only make the situation worse by a fixed factor.  (We still don't address
> the possibilty that we are replica for every other osd in the system and
> could be flooded by N*(max client ops per osd).
>
> It's this line:
>
>         https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L8334
>
> sage
>
>
>
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWHne4CRDmVDuy+mK58QAAwYUP/RzTrmsYV7Vi6e64Yikh
>> YMMI4Cxt4mBWbTIOsb8iRY98EkqhUWd/kz45OoFQgwE4hS3O5Lksf3u0pcmS
>> I+Gz6jQ4/K0B6Mc3Rt19ofD1cA9s6BLnHSqTFZEUVapiHftj84ewIRLts9dg
>> YCJJeaaOV8fu07oZvnumRTAKOzWPyQizQKBGx7nujIg13Us0st83C8uANzoX
>> hKvlA2qVMXO4rLgR7nZMcgj+X+/79v7MDycM3WP/Q21ValsNfETQVhN+XxC8
>> D/IUfX4/AKUEuF4WBEck4Z/Wx9YD+EvpLtQVLy21daazRApWES/iy089F63O
>> k9RHp189c4WCduFBaTvZj2cdekAq/Wl50O1AdafYFptWqYhw+aKpihI+yMrX
>> +LhWgoYALD6wyXr0KVDZZszIRZbO/PSjct8z13aXBJoJm9r0Vyazfhi9jNW9
>> Z/1GD7gv5oHymf7eR9u7T8INdjNzn6Qllj7XCyZfQv5TYxsRWMZxf5vEkpMB
>> nAYANoZcNs4ZSIy+OdFOb6nM66ujrytWL1DqWusJUEM/GauBw0fxnQ/i+pMy
>> XU8gYbG1um5YY8jrtvvkhnbHdeO/k24/cH7MGslxeezBPnMNzmqj3qVdiX1H
>> EBbyBBtp8OF+pKExrmZc2w01W/Nxl6GbVoG+IKJ61FgwKOXEiMwb0wv5mu30
>> eP3D
>> =R0O9
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Oct 14, 2015 at 12:00 AM, Haomai Wang  wrote:
>> > On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
>> >> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA256
>> >>>
>> >>> After a weekend, I'm ready to hit this from a different direction.
>> >>>
>> >>> I replicated the issue with Firefly so it doesn't seem an issue that
>> >>> has been introduced or resolved in any nearby version. I think overall
>> >>> we may be seeing [1] to a great degree. From what I can extract from
>> >>> the logs, it looks like in situations where OSDs are going up and
>> >>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> >>> the PG to become clean before dispatching the I/O to the replicas.
>> >>>
>> >>> In an effort to understand the flow of the logs, I've attached a small
>> >>> 2 minute segment of a log I've extracted what I believe to be
>> >>> important entries in the life cycle of an I/O along with my
>> >>> understanding. If someone would be kind enough to help my
>> >>> understanding, I would appreciate it.
>> >>>
>> >>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>> >>> >> 192.168.55.12:0/2013622 pipe(0x26c90000 sd=47 :6800 s=2 pgs=2 cs=1
>> >>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> >>>
>> >>> - ->Messenger has recieved the message from the client (previous
>> >>> entries in the 7fb9d2c68700 thread are the individual segments that
>> >>> make up this message).
>> >>>
>> >>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>> >>> <== client.6709 192.168.55.12:0/2013622 19 ====
>> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> >>> ==== 235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>> >>>
>> >>> - ->OSD process acknowledges that it has received the write.
>> >>>
>> >>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>> >>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> >>>
>> >>> - ->Not sure excatly what is going on here, the op is being enqueued somewhere..
>> >>>
>> >>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>> >>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>> >>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>> >>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>> >>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>> >>> active+clean]
>> >>>
>> >>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>> >>> different thread.
>> >>
>> >> ^^ This is the problem.  Everything after this looks reasonable.  Looking
>> >> at the other dequeue_op calls over this period, it looks like we're just
>> >> overwhelmed with higher priority requests.  New clients are 63, while
>> >> osd_repop (replicated write from another primary) are 127 and replies from
>> >> our own replicated ops are 196.  We do process a few other prio 63 items,
>> >> but you'll see that their latency is also climbing up to 30s over this
>> >> period.
>> >>
>> >> The question is why we suddenly get a lot of them.. maybe the peering on
>> >> other OSDs just completed so we get a bunch of these?  It's also not clear
>> >> to me what makes osd.4 or this op special.  We expect a mix of primary and
>> >> replica ops on all the OSDs, so why would we suddenly have more of them
>> >> here....
>> >
>> > I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
>> > related to this thread.
>> >
>> > So is it means that there exists live lock with client op and repop?
>> > We permit all clients issue too much client ops which cause some OSDs
>> > bottleneck, then actually other OSDs maybe idle enough and accept more
>> > client ops. Finally, all osds are stuck into the bottleneck OSD. It
>> > seemed reasonable, but why it will last so long?
>> >
>> >>
>> >> sage
>> >>
>> >>
>> >>>
>> >>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> >>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> >>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>> >>>
>> >>> - ->Not sure what this message is. Look up of secondary OSDs?
>> >>>
>> >>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> >>> new_repop rep_tid 17815 on osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5
>> >>>
>> >>> - ->Dispatch write to secondaty OSDs?
>> >>>
>> >>> 2015-10-12 14:13:06.545116 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>> >>> --> 192.168.55.15:6801/32036 -- osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>> -- ?+4195078 0x238fd600 con 0x32bcb5a0
>> >>>
>> >>> - ->OSD dispatch write to OSD.0.
>> >>>
>> >>> 2015-10-12 14:13:06.545132 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>> >>> submit_message osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>> remote, 192.168.55.15:6801/32036, have pipe.
>> >>>
>> >>> - ->Message sent to OSD.0.
>> >>>
>> >>> 2015-10-12 14:13:06.545195 7fb9e2d3a700  1 -- 192.168.55.16:6801/11295
>> >>> --> 192.168.55.11:6801/13185 -- osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>> -- ?+4195078 0x16edd200 con 0x3a37b20
>> >>>
>> >>> - ->OSD dispatch write to OSD.5.
>> >>>
>> >>> 2015-10-12 14:13:06.545210 7fb9e2d3a700 20 -- 192.168.55.16:6801/11295
>> >>> submit_message osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>> remote, 192.168.55.11:6801/13185, have pipe.
>> >>>
>> >>> - ->Message sent to OSD.5.
>> >>>
>> >>> 2015-10-12 14:13:06.545229 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> >>> append_log log((0'0,44'703], crt=44'700) [44'704 (44'691) modify
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> >>> client.6709.0:67 2015-10-12 14:12:34.340082]
>> >>> 2015-10-12 14:13:06.545268 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'700 lcod 44'702 mlcod
>> >>> 44'702 active+clean] add_log_entry 44'704 (44'691) modify
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> >>> client.6709.0:67 2015-10-12 14:12:34.340082
>> >>>
>> >>> - ->These record the OP in the journal log?
>> >>>
>> >>> 2015-10-12 14:13:06.563241 7fb9d326e700 20 -- 192.168.55.16:6801/11295
>> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> >>> cs=3 l=0 c=0x3a37b20).writer encoding 17337 features 37154696925806591
>> >>> 0x16edd200 osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>>
>> >>> - ->Writing the data to OSD.5?
>> >>>
>> >>> 2015-10-12 14:13:06.573938 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>> >>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>> >>> l=0 c=0x32bcb5a0).reader got ack seq 1206 >= 1206 on 0x238fd600
>> >>> osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>>
>> >>> - ->Messenger gets ACK from OSD.0 that it reveiced that last packet?
>> >>>
>> >>> 2015-10-12 14:13:06.613425 7fb9d3874700 10 -- 192.168.55.16:6801/11295
>> >>> >> 192.168.55.15:6801/32036 pipe(0x3f96000 sd=176 :6801 s=2 pgs=8 cs=3
>> >>> l=0 c=0x32bcb5a0).reader got message 1146 0x3ffa480
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>> >>>
>> >>> - ->Messenger receives ack on disk from OSD.0.
>> >>>
>> >>> 2015-10-12 14:13:06.613447 7fb9d3874700  1 -- 192.168.55.16:6801/11295
>> >>> <== osd.0 192.168.55.15:6801/32036 1146 ====
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>> >>> 83+0+0 (2772408781 0 0) 0x3ffa480 con 0x32bcb5a0
>> >>>
>> >>> - ->OSD process gets on disk ACK from OSD.0.
>> >>>
>> >>> 2015-10-12 14:13:06.613478 7fb9d3874700 10 osd.4 44 handle_replica_op
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>> >>>
>> >>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>> >>> correlate that to the previous message other than by time.
>> >>>
>> >>> 2015-10-12 14:13:06.613504 7fb9d3874700 15 osd.4 44 enqueue_op
>> >>> 0x120f9b00 prio 196 cost 0 latency 0.000250
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>> >>>
>> >>> - ->The reply is enqueued onto a mystery queue.
>> >>>
>> >>> 2015-10-12 14:13:06.627793 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> >>> cs=3 l=0 c=0x3a37b20).reader got ack seq 17337 >= 17337 on 0x16edd200
>> >>> osd_repop(client.6709.0:67 0.29
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 v 44'704) v1
>> >>>
>> >>> - ->Messenger gets ACK from OSD.5 that it reveiced that last packet?
>> >>>
>> >>> 2015-10-12 14:13:06.628364 7fb9d6afd700 10 -- 192.168.55.16:6801/11295
>> >>> >> 192.168.55.11:6801/13185 pipe(0x2d355000 sd=98 :6801 s=2 pgs=12
>> >>> cs=3 l=0 c=0x3a37b20).reader got message 16477 0x21cef3c0
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>> >>>
>> >>> - ->Messenger receives ack on disk from OSD.5.
>> >>>
>> >>> 2015-10-12 14:13:06.628382 7fb9d6afd700  1 -- 192.168.55.16:6801/11295
>> >>> <== osd.5 192.168.55.11:6801/13185 16477 ====
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 ====
>> >>> 83+0+0 (2104182993 0 0) 0x21cef3c0 con 0x3a37b20
>> >>>
>> >>> - ->OSD process gets on disk ACK from OSD.5.
>> >>>
>> >>> 2015-10-12 14:13:06.628406 7fb9d6afd700 10 osd.4 44 handle_replica_op
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 epoch 44
>> >>>
>> >>> - ->Primary OSD records the ACK (duplicate message?). Not sure how to
>> >>> correlate that to the previous message other than by time.
>> >>>
>> >>> 2015-10-12 14:13:06.628426 7fb9d6afd700 15 osd.4 44 enqueue_op
>> >>> 0x3e41600 prio 196 cost 0 latency 0.000180
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1
>> >>>
>> >>> - ->The reply is enqueued onto a mystery queue.
>> >>>
>> >>> 2015-10-12 14:13:07.124206 7fb9f4e9f700  0 log_channel(cluster) log
>> >>> [WRN] : slow request 30.598371 seconds old, received at 2015-10-12
>> >>> 14:12:36.525724: osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) currently waiting for subops
>> >>> from 0,5
>> >>>
>> >>> - ->OP has not been dequeued to the client from the mystery queue yet.
>> >>>
>> >>> 2015-10-12 14:13:07.278449 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod 44'702 mlcod
>> >>> 44'702 active+clean] eval_repop repgather(0x37ea3cc0 44'704
>> >>> rep_tid=17815 committed?=0 applied?=0 lock=0
>> >>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>> >>> wants=ad
>> >>>
>> >>> - ->Not sure what this means. The OP has been completed on all replicas?
>> >>>
>> >>> 2015-10-12 14:13:07.278566 7fb9e0535700 10 osd.4 44 dequeue_op
>> >>> 0x120f9b00 prio 196 cost 0 latency 0.665312
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>> >>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>> >>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>> >>> 44'702 mlcod 44'702 active+clean]
>> >>>
>> >>> - ->One of the replica OPs is dequeued in a different thread
>> >>>
>> >>> 2015-10-12 14:13:07.278809 7fb9e0535700 10 osd.4 44 dequeue_op
>> >>> 0x3e41600 prio 196 cost 0 latency 0.650563
>> >>> osd_repop_reply(client.6709.0:67 0.29 ondisk, result = 0) v1 pg
>> >>> pg[0.29( v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44
>> >>> 32/32/10) [4,5,0] r=0 lpr=32 luod=44'703 lua=44'703 crt=44'702 lcod
>> >>> 44'702 mlcod 44'702 active+clean]
>> >>>
>> >>> - ->The other replica OP is dequeued in the new thread
>> >>>
>> >>> 2015-10-12 14:13:07.967469 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> >>> active+clean] eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815
>> >>> committed?=1 applied?=0 lock=0 op=osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>> >>>
>> >>> - ->Not sure what this does. A thread that joins the replica OPs with
>> >>> the primary OP?
>> >>>
>> >>> 2015-10-12 14:13:07.967515 7fb9efe95700 15 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> >>> active+clean] log_op_stats osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5 inb 4194304 outb 0 rlat
>> >>> 0.000000 lat 31.441789
>> >>>
>> >>> - ->Logs that the write has been committed to all replicas in the
>> >>> primary journal?
>> >>>
>> >>> Not sure what the rest of these do, nor do I understand where the
>> >>> client gets an ACK that the write is committed.
>> >>>
>> >>> 2015-10-12 14:13:07.967583 7fb9efe95700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 lua=44'703 crt=44'702 lcod 44'703 mlcod 44'702
>> >>> active+clean]  sending commit on repgather(0x37ea3cc0 44'704
>> >>> rep_tid=17815 committed?=1 applied?=0 lock=0
>> >>> op=osd_op(client.6709.0:67 rbd_data.103c74b0dc51.000000000000003a
>> >>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5)
>> >>> 0x3a2f0840
>> >>>
>> >>> 2015-10-12 14:13:10.351452 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'702 active+clean]
>> >>> eval_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> >>> applied?=1 lock=0 op[0/1943]client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5) wants=ad
>> >>>
>> >>> 2015-10-12 14:13:10.354089 7fb9f0696700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>> >>> removing repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5)
>> >>>
>> >>> 2015-10-12 14:13:10.354163 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>> >>>  q front is repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5)
>> >>>
>> >>> 2015-10-12 14:13:10.354199 7fb9f0696700 20 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'704 (0'0,44'704] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 crt=44'702 lcod 44'703 mlcod 44'703 active+clean]
>> >>> remove_repop repgather(0x37ea3cc0 44'704 rep_tid=17815 committed?=1
>> >>> applied?=1 lock=0 op=osd_op(client.6709.0:67
>> >>> rbd_data.103c74b0dc51.000000000000003a [set-alloc-hint object_size
>> >>> 4194304 write_size 4194304,write 0~4194304] 0.474a01a9
>> >>> ack+ondisk+write+known_if_redirected e44) v5)
>> >>>
>> >>> 2015-10-12 14:13:15.488448 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> >>> v 44'707 (0'0,44'707] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> >>> [4,5,0] r=0 lpr=32 luod=44'705 lua=44'705 crt=44'704 lcod 44'704 mlcod
>> >>> 44'704 active+clean] append_log: trimming to 44'704 entries 44'704
>> >>> (44'691) modify
>> >>> 474a01a9/rbd_data.103c74b0dc51.000000000000003a/head//0 by
>> >>> client.6709.0:67 2015-10-12 14:12:34.340082
>> >>>
>> >>> Thanks for hanging in there with me on this...
>> >>>
>> >>> [1] http://www.spinics.net/lists/ceph-devel/msg26633.html
>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> Version: Mailvelope v1.2.0
>> >>> Comment: https://www.mailvelope.com
>> >>>
>> >>> wsFcBAEBCAAQBQJWHCx0CRDmVDuy+mK58QAAXf8P/j6MD52r2DLqOP9hKFAP
>> >>> MJUktg8uqK1i8awtuIQhJHAPDZQF8EACOXg6RBuOz75iryCFKAJXk5exLXrE
>> >>> pIZqY/0/JCsUEPuQGaMY9GVQNrTeB82F5VIu572i2xeFir4fUEcvllXSeR9O
>> >>> CxSgaAncxUYGSXwsiCJ28QhwPCFXtCLACg1eTpghhAcOwY0t+z6ZB3vh+WxB
>> >>> B8kRCdee78TVZOgeTnd66aBJUrr21Ir9aPqSm73uY561dyDmyxc4zPq+FDsJ
>> >>> kuac+Ky9Lc6rqhxwRptbdx5i/EDzxj96EKEz2v4SFBmvzU8jtZlA8THJ6WlF
>> >>> 6lZRpRIMfEqVu4neFcdUIct8+Brf7fuxOI7hbhUL5xq2I6yDSY8E2T8ImRoS
>> >>> w8bSrjFV3wmnXSCHnFJPROqdhtlQlH1PkKPBRJeJrkrB1MloX0ybU4hNIr7Q
>> >>> 4ZyzeLpD9sgL1vEfUVuCksgiVJhzlFOyqeRHcfpPEnLxyGL/+mLUa5lQ5m5l
>> >>> m286ZnsMZGMzAdSA/tsqnTFzL0HbjkiWD/OMU5zThSKW2tZBNWg3xZE5Yia9
>> >>> zAbhxpvxqhKQ7nfmv3xeVJ1GKb9CuzfN9ZIGPltHvpA3rZf3I4+XVlWbbhDZ
>> >>> z8Xp8Pw8f7neh89Tv3AT+krM1jrE1ZxOF5A2K4CxBcS3OEMc5UIZ2fy4dHSo
>> >>> 0iTE
>> >>> =t7nL
>> >>> -----END PGP SIGNATURE-----
>> >>> ----------------
>> >>> Robert LeBlanc
>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>>
>> >>>
>> >>> On Thu, Oct 8, 2015 at 11:44 PM, Robert LeBlanc  wrote:
>> >>> > -----BEGIN PGP SIGNED MESSAGE-----
>> >>> > Hash: SHA256
>> >>> >
>> >>> > Sage,
>> >>> >
>> >>> > After trying to bisect this issue (all test moved the bisect towards
>> >>> > Infernalis) and eventually testing the Infernalis branch again, it
>> >>> > looks like the problem still exists although it is handled a tad
>> >>> > better in Infernalis. I'm going to test against Firefly/Giant next
>> >>> > week and then try and dive into the code to see if I can expose any
>> >>> > thing.
>> >>> >
>> >>> > If I can do anything to provide you with information, please let me know.
>> >>> >
>> >>> > Thanks,
>> >>> > -----BEGIN PGP SIGNATURE-----
>> >>> > Version: Mailvelope v1.2.0
>> >>> > Comment: https://www.mailvelope.com
>> >>> >
>> >>> > wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>> >>> > YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>> >>> > BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>> >>> > qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>> >>> > ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>> >>> > V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>> >>> > jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>> >>> > 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>> >>> > VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>> >>> > VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>> >>> > Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>> >>> > 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>> >>> > BCFo
>> >>> > =GJL4
>> >>> > -----END PGP SIGNATURE-----
>> >>> > ----------------
>> >>> > Robert LeBlanc
>> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >
>> >>> >
>> >>> > On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc  wrote:
>> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >> Hash: SHA256
>> >>> >>
>> >>> >> We forgot to upload the ceph.log yesterday. It is there now.
>> >>> >> - ----------------
>> >>> >> Robert LeBlanc
>> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>> >>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>> Hash: SHA256
>> >>> >>>
>> >>> >>> I upped the debug on about everything and ran the test for about 40
>> >>> >>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>> >>> >>> There was at least one op on osd.19 that was blocked for over 1,000
>> >>> >>> seconds. Hopefully this will have something that will cast a light on
>> >>> >>> what is going on.
>> >>> >>>
>> >>> >>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>> >>> >>> the test to verify the results from the dev cluster. This cluster
>> >>> >>> matches the hardware of our production cluster but is not yet in
>> >>> >>> production so we can safely wipe it to downgrade back to Hammer.
>> >>> >>>
>> >>> >>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>> >>> >>>
>> >>> >>> Let me know what else we can do to help.
>> >>> >>>
>> >>> >>> Thanks,
>> >>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> >>> Version: Mailvelope v1.2.0
>> >>> >>> Comment: https://www.mailvelope.com
>> >>> >>>
>> >>> >>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>> >>> >>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>> >>> >>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>> >>> >>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>> >>> >>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>> >>> >>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>> >>> >>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>> >>> >>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>> >>> >>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>> >>> >>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>> >>> >>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>> >>> >>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>> >>> >>> EDrG
>> >>> >>> =BZVw
>> >>> >>> -----END PGP SIGNATURE-----
>> >>> >>> ----------------
>> >>> >>> Robert LeBlanc
>> >>> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>
>> >>> >>>
>> >>> >>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>> Hash: SHA256
>> >>> >>>>
>> >>> >>>> On my second test (a much longer one), it took nearly an hour, but a
>> >>> >>>> few messages have popped up over a 20 window. Still far less than I
>> >>> >>>> have been seeing.
>> >>> >>>> - ----------------
>> >>> >>>> Robert LeBlanc
>> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>> Hash: SHA256
>> >>> >>>>>
>> >>> >>>>> I'll capture another set of logs. Is there any other debugging you
>> >>> >>>>> want turned up? I've seen the same thing where I see the message
>> >>> >>>>> dispatched to the secondary OSD, but the message just doesn't show up
>> >>> >>>>> for 30+ seconds in the secondary OSD logs.
>> >>> >>>>> - ----------------
>> >>> >>>>> Robert LeBlanc
>> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>> >>> >>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >>> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> Hash: SHA256
>> >>> >>>>>>>
>> >>> >>>>>>> I can't think of anything. In my dev cluster the only thing that has
>> >>> >>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>> >>> >>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>> >>> >>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>> >>> >>>>>>> the OSD boots or during the recovery period. This is with
>> >>> >>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>> >>> >>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>> >>> >>>>>>> our dev cluster very easily and very quickly with these settings. So
>> >>> >>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>> >>> >>>>>>> marked out. We would love to see that go away too, but this is far
>> >>> >>>>>>                                             (me too!)
>> >>> >>>>>>> better than what we have now. This dev cluster also has
>> >>> >>>>>>> osd_client_message_cap set to default (100).
>> >>> >>>>>>>
>> >>> >>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>> >>> >>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>> >>> >>>>>>> you you prefer a bisect to find the introduction of the problem
>> >>> >>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>> >>> >>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>> >>> >>>>>>> commit that prevents a clean build as that is my most limiting factor?
>> >>> >>>>>>
>> >>> >>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>> >>> >>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>> >>> >>>>>> dump you sent is that although I see plenty of slow request warnings in
>> >>> >>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>> >>> >>>>>> turned up for long enough?
>> >>> >>>>>>
>> >>> >>>>>> sage
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>> Thanks,
>> >>> >>>>>>> - ----------------
>> >>> >>>>>>> Robert LeBlanc
>> >>> >>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>> >>> >>>>>>> > On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>> >>> >>>>>>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> Hash: SHA256
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>> >>> >>>>>>> >> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>> >>> >>>>>>> >> messages when the OSD was marked out:
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>> >>> >>>>>>> >> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>> >>> >>>>>>> >> 34.476006 secs
>> >>> >>>>>>> >> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>> >>> >>>>>>> >> cluster [WRN] slow request 32.913474 seconds old, received at
>> >>> >>>>>>> >> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>> >>> >>>>>>> >> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>> >>>>>>> >> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>> >>> >>>>>>> >> cluster [WRN] slow request 32.697545 seconds old, received at
>> >>> >>>>>>> >> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>> >>> >>>>>>> >> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>> >>>>>>> >> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>> >>> >>>>>>> >> cluster [WRN] slow request 32.668006 seconds old, received at
>> >>> >>>>>>> >> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>> >>> >>>>>>> >> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>> >>> >>>>>>> >> ack+read+known_if_redirected e58744) currently waiting for peered
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> But I'm not seeing the blocked messages when the OSD came back in. The
>> >>> >>>>>>> >> OSD spindles have been running at 100% during this test. I have seen
>> >>> >>>>>>> >> slowed I/O from the clients as expected from the extra load, but so
>> >>> >>>>>>> >> far no blocked messages. I'm going to run some more tests.
>> >>> >>>>>>> >
>> >>> >>>>>>> > Good to hear.
>> >>> >>>>>>> >
>> >>> >>>>>>> > FWIW I looked through the logs and all of the slow request no flag point
>> >>> >>>>>>> > messages came from osd.163... and the logs don't show when they arrived.
>> >>> >>>>>>> > My guess is this OSD has a slower disk than the others, or something else
>> >>> >>>>>>> > funny is going on?
>> >>> >>>>>>> >
>> >>> >>>>>>> > I spot checked another OSD at random (60) where I saw a slow request.  It
>> >>> >>>>>>> > was stuck peering for 10s of seconds... waiting on a pg log message from
>> >>> >>>>>>> > osd.163.
>> >>> >>>>>>> >
>> >>> >>>>>>> > sage
>> >>> >>>>>>> >
>> >>> >>>>>>> >
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> Version: Mailvelope v1.2.0
>> >>> >>>>>>> >> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>> >>> >>>>>>> >> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>> >>> >>>>>>> >> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>> >>> >>>>>>> >> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>> >>> >>>>>>> >> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>> >>> >>>>>>> >> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>> >>> >>>>>>> >> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>> >>> >>>>>>> >> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>> >>> >>>>>>> >> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>> >>> >>>>>>> >> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>> >>> >>>>>>> >> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>> >>> >>>>>>> >> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>> >>> >>>>>>> >> fo5a
>> >>> >>>>>>> >> =ahEi
>> >>> >>>>>>> >> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> ----------------
>> >>> >>>>>>> >> Robert LeBlanc
>> >>> >>>>>>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >>
>> >>> >>>>>>> >>
>> >>> >>>>>>> >> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>> >>> >>>>>>> >> > On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>> >>> >>>>>>> >> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> Hash: SHA256
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >> With some off-list help, we have adjusted
>> >>> >>>>>>> >> >> osd_client_message_cap=10000. This seems to have helped a bit and we
>> >>> >>>>>>> >> >> have seen some OSDs have a value up to 4,000 for client messages. But
>> >>> >>>>>>> >> >> it does not solve the problem with the blocked I/O.
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >> One thing that I have noticed is that almost exactly 30 seconds elapse
>> >>> >>>>>>> >> >> between an OSD boots and the first blocked I/O message. I don't know
>> >>> >>>>>>> >> >> if the OSD doesn't have time to get it's brain right about a PG before
>> >>> >>>>>>> >> >> it starts servicing it or what exactly.
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> > I'm downloading the logs from yesterday now; sorry it's taking so long.
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> >> On another note, I tried upgrading our CentOS dev cluster from Hammer
>> >>> >>>>>>> >> >> to master and things didn't go so well. The OSDs would not start
>> >>> >>>>>>> >> >> because /var/lib/ceph was not owned by ceph. I chowned the directory
>> >>> >>>>>>> >> >> and all OSDs and the OSD then started, but never became active in the
>> >>> >>>>>>> >> >> cluster. It just sat there after reading all the PGs. There were
>> >>> >>>>>>> >> >> sockets open to the monitor, but no OSD to OSD sockets. I tried
>> >>> >>>>>>> >> >> downgrading to the Infernalis branch and still no luck getting the
>> >>> >>>>>>> >> >> OSDs to come up. The OSD processes were idle after the initial boot.
>> >>> >>>>>>> >> >> All packages were installed from gitbuilder.
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> > Did you chown -R ?
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> >         https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> > My guess is you only chowned the root dir, and the OSD didn't throw
>> >>> >>>>>>> >> > an error when it encountered the other files?  If you can generate a debug
>> >>> >>>>>>> >> > osd = 20 log, that would be helpful.. thanks!
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> > sage
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> >
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >> Thanks,
>> >>> >>>>>>> >> >> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> Version: Mailvelope v1.2.0
>> >>> >>>>>>> >> >> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>> >>> >>>>>>> >> >> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>> >>> >>>>>>> >> >> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>> >>> >>>>>>> >> >> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>> >>> >>>>>>> >> >> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>> >>> >>>>>>> >> >> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>> >>> >>>>>>> >> >> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>> >>> >>>>>>> >> >> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>> >>> >>>>>>> >> >> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>> >>> >>>>>>> >> >> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>> >>> >>>>>>> >> >> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>> >>> >>>>>>> >> >> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>> >>> >>>>>>> >> >> GdXC
>> >>> >>>>>>> >> >> =Aigq
>> >>> >>>>>>> >> >> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> ----------------
>> >>> >>>>>>> >> >> Robert LeBlanc
>> >>> >>>>>>> >> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>> >>> >>>>>>> >> >> > -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> > Hash: SHA256
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > I have eight nodes running the fio job rbd_test_real to different RBD
>> >>> >>>>>>> >> >> > volumes. I've included the CRUSH map in the tarball.
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > I stopped one OSD process and marked it out. I let it recover for a
>> >>> >>>>>>> >> >> > few minutes and then I started the process again and marked it in. I
>> >>> >>>>>>> >> >> > started getting block I/O messages during the recovery.
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > Thanks,
>> >>> >>>>>>> >> >> > -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> > Version: Mailvelope v1.2.0
>> >>> >>>>>>> >> >> > Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>> >>> >>>>>>> >> >> > 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>> >>> >>>>>>> >> >> > jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>> >>> >>>>>>> >> >> > 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>> >>> >>>>>>> >> >> > OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>> >>> >>>>>>> >> >> > ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>> >>> >>>>>>> >> >> > R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>> >>> >>>>>>> >> >> > boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>> >>> >>>>>>> >> >> > sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>> >>> >>>>>>> >> >> > GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>> >>> >>>>>>> >> >> > SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>> >>> >>>>>>> >> >> > PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>> >>> >>>>>>> >> >> > 3EPx
>> >>> >>>>>>> >> >> > =UDIV
>> >>> >>>>>>> >> >> > -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > ----------------
>> >>> >>>>>>> >> >> > Robert LeBlanc
>> >>> >>>>>>> >> >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> >
>> >>> >>>>>>> >> >> > On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>> >>> >>>>>>> >> >> >> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>> >>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> Hash: SHA256
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> We are still struggling with this and have tried a lot of different
>> >>> >>>>>>> >> >> >>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>> >>> >>>>>>> >> >> >>> consulting services for non-Red Hat systems. If there are some
>> >>> >>>>>>> >> >> >>> certified Ceph consultants in the US that we can do both remote and
>> >>> >>>>>>> >> >> >>> on-site engagements, please let us know.
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> This certainly seems to be network related, but somewhere in the
>> >>> >>>>>>> >> >> >>> kernel. We have tried increasing the network and TCP buffers, number
>> >>> >>>>>>> >> >> >>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>> >>> >>>>>>> >> >> >>> on the boxes, the disks are busy, but not constantly at 100% (they
>> >>> >>>>>>> >> >> >>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>> >>> >>>>>>> >> >> >>> at a time). There seems to be no reasonable explanation why I/O is
>> >>> >>>>>>> >> >> >>> blocked pretty frequently longer than 30 seconds. We have verified
>> >>> >>>>>>> >> >> >>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>> >>> >>>>>>> >> >> >>> network admins have verified that packets are not being dropped in the
>> >>> >>>>>>> >> >> >>> switches for these nodes. We have tried different kernels including
>> >>> >>>>>>> >> >> >>> the recent Google patch to cubic. This is showing up on three cluster
>> >>> >>>>>>> >> >> >>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>> >>> >>>>>>> >> >> >>> (from CentOS 7.1) with similar results.
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> The messages seem slightly different:
>> >>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>> >>> >>>>>>> >> >> >>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>> >> >> >>> 100.087155 secs
>> >>> >>>>>>> >> >> >>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>> >>> >>>>>>> >> >> >>> cluster [WRN] slow request 30.041999 seconds old, received at
>> >>> >>>>>>> >> >> >>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>> >>> >>>>>>> >> >> >>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>> >>> >>>>>>> >> >> >>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>> >>> >>>>>>> >> >> >>> points reached
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> I don't know what "no flag points reached" means.
>> >>> >>>>>>> >> >> >>
>> >>> >>>>>>> >> >> >> Just that the op hasn't been marked as reaching any interesting points
>> >>> >>>>>>> >> >> >> (op->mark_*() calls).
>> >>> >>>>>>> >> >> >>
>> >>> >>>>>>> >> >> >> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>> >>> >>>>>>> >> >> >> It's extremely verbose but it'll let us see where the op is getting
>> >>> >>>>>>> >> >> >> blocked.  If you see the "slow request" message it means the op in
>> >>> >>>>>>> >> >> >> received by ceph (that's when the clock starts), so I suspect it's not
>> >>> >>>>>>> >> >> >> something we can blame on the network stack.
>> >>> >>>>>>> >> >> >>
>> >>> >>>>>>> >> >> >> sage
>> >>> >>>>>>> >> >> >>
>> >>> >>>>>>> >> >> >>
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> The problem is most pronounced when we have to reboot an OSD node (1
>> >>> >>>>>>> >> >> >>> of 13), we will have hundreds of I/O blocked for some times up to 300
>> >>> >>>>>>> >> >> >>> seconds. It takes a good 15 minutes for things to settle down. The
>> >>> >>>>>>> >> >> >>> production cluster is very busy doing normally 8,000 I/O and peaking
>> >>> >>>>>>> >> >> >>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>> >>> >>>>>>> >> >> >>> are between 25-50% full. We are currently splitting PGs to distribute
>> >>> >>>>>>> >> >> >>> the load better across the disks, but we are having to do this 10 PGs
>> >>> >>>>>>> >> >> >>> at a time as we get blocked I/O. We have max_backfills and
>> >>> >>>>>>> >> >> >>> max_recovery set to 1, client op priority is set higher than recovery
>> >>> >>>>>>> >> >> >>> priority. We tried increasing the number of op threads but this didn't
>> >>> >>>>>>> >> >> >>> seem to help. It seems as soon as PGs are finished being checked, they
>> >>> >>>>>>> >> >> >>> become active and could be the cause for slow I/O while the other PGs
>> >>> >>>>>>> >> >> >>> are being checked.
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> What I don't understand is that the messages are delayed. As soon as
>> >>> >>>>>>> >> >> >>> the message is received by Ceph OSD process, it is very quickly
>> >>> >>>>>>> >> >> >>> committed to the journal and a response is sent back to the primary
>> >>> >>>>>>> >> >> >>> OSD which is received very quickly as well. I've adjust
>> >>> >>>>>>> >> >> >>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>> >>> >>>>>>> >> >> >>> doesn't solve the main problem. We don't have swap and there is 64 GB
>> >>> >>>>>>> >> >> >>> of RAM per nodes for 10 OSDs.
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> Is there something that could cause the kernel to get a packet but not
>> >>> >>>>>>> >> >> >>> be able to dispatch it to Ceph such that it could be explaining why we
>> >>> >>>>>>> >> >> >>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>> >>> >>>>>>> >> >> >>> to tracing Ceph messages from the network buffer through the kernel to
>> >>> >>>>>>> >> >> >>> the Ceph process?
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> We can really use some pointers no matter how outrageous. We've have
>> >>> >>>>>>> >> >> >>> over 6 people looking into this for weeks now and just can't think of
>> >>> >>>>>>> >> >> >>> anything else.
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> Thanks,
>> >>> >>>>>>> >> >> >>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> Version: Mailvelope v1.1.0
>> >>> >>>>>>> >> >> >>> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>> >>> >>>>>>> >> >> >>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>> >>> >>>>>>> >> >> >>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>> >>> >>>>>>> >> >> >>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>> >>> >>>>>>> >> >> >>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>> >>> >>>>>>> >> >> >>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>> >>> >>>>>>> >> >> >>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>> >>> >>>>>>> >> >> >>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>> >>> >>>>>>> >> >> >>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>> >>> >>>>>>> >> >> >>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>> >>> >>>>>>> >> >> >>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>> >>> >>>>>>> >> >> >>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>> >>> >>>>>>> >> >> >>> l7OF
>> >>> >>>>>>> >> >> >>> =OI++
>> >>> >>>>>>> >> >> >>> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> ----------------
>> >>> >>>>>>> >> >> >>> Robert LeBlanc
>> >>> >>>>>>> >> >> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>> >>> >>>>>>> >> >> >>> > We dropped the replication on our cluster from 4 to 3 and it looks
>> >>> >>>>>>> >> >> >>> > like all the blocked I/O has stopped (no entries in the log for the
>> >>> >>>>>>> >> >> >>> > last 12 hours). This makes me believe that there is some issue with
>> >>> >>>>>>> >> >> >>> > the number of sockets or some other TCP issue. We have not messed with
>> >>> >>>>>>> >> >> >>> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>> >>> >>>>>>> >> >> >>> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>> >>> >>>>>>> >> >> >>> > processes and 16K system wide.
>> >>> >>>>>>> >> >> >>> >
>> >>> >>>>>>> >> >> >>> > Does this seem like the right spot to be looking? What are some
>> >>> >>>>>>> >> >> >>> > configuration items we should be looking at?
>> >>> >>>>>>> >> >> >>> >
>> >>> >>>>>>> >> >> >>> > Thanks,
>> >>> >>>>>>> >> >> >>> > ----------------
>> >>> >>>>>>> >> >> >>> > Robert LeBlanc
>> >>> >>>>>>> >> >> >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>> >
>> >>> >>>>>>> >> >> >>> >
>> >>> >>>>>>> >> >> >>> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>> >>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> >> Hash: SHA256
>> >>> >>>>>>> >> >> >>> >>
>> >>> >>>>>>> >> >> >>> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>> >>> >>>>>>> >> >> >>> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>> >>> >>>>>>> >> >> >>> >> seems that there were some major reworks in the network handling in
>> >>> >>>>>>> >> >> >>> >> the kernel to efficiently handle that network rate. If I remember
>> >>> >>>>>>> >> >> >>> >> right we also saw a drop in CPU utilization. I'm starting to think
>> >>> >>>>>>> >> >> >>> >> that we did see packet loss while congesting our ISLs in our initial
>> >>> >>>>>>> >> >> >>> >> testing, but we could not tell where the dropping was happening. We
>> >>> >>>>>>> >> >> >>> >> saw some on the switches, but it didn't seem to be bad if we weren't
>> >>> >>>>>>> >> >> >>> >> trying to congest things. We probably already saw this issue, just
>> >>> >>>>>>> >> >> >>> >> didn't know it.
>> >>> >>>>>>> >> >> >>> >> - ----------------
>> >>> >>>>>>> >> >> >>> >> Robert LeBlanc
>> >>> >>>>>>> >> >> >>> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>> >>
>> >>> >>>>>>> >> >> >>> >>
>> >>> >>>>>>> >> >> >>> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>> >>> >>>>>>> >> >> >>> >>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>> >>> >>>>>>> >> >> >>> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>> >>> >>>>>>> >> >> >>> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>> >>> >>>>>>> >> >> >>> >>> drivers might cause problems though.
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>> Here's ifconfig from one of the nodes:
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>> ens513f1: flags=4163  mtu 1500
>> >>> >>>>>>> >> >> >>> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>> >>> >>>>>>> >> >> >>> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>> >>> >>>>>>> >> >> >>> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>> >>> >>>>>>> >> >> >>> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>> >>> >>>>>>> >> >> >>> >>>         RX errors 0  dropped 0  overruns 0  frame 0
>> >>> >>>>>>> >> >> >>> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>> >>> >>>>>>> >> >> >>> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>> Mark
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> >>>> Hash: SHA256
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> OK, here is the update on the saga...
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> I traced some more of blocked I/Os and it seems that communication
>> >>> >>>>>>> >> >> >>> >>>> between two hosts seemed worse than others. I did a two way ping flood
>> >>> >>>>>>> >> >> >>> >>>> between the two hosts using max packet sizes (1500). After 1.5M
>> >>> >>>>>>> >> >> >>> >>>> packets, no lost pings. Then then had the ping flood running while I
>> >>> >>>>>>> >> >> >>> >>>> put Ceph load on the cluster and the dropped pings started increasing
>> >>> >>>>>>> >> >> >>> >>>> after stopping the Ceph workload the pings stopped dropping.
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> I then ran iperf between all the nodes with the same results, so that
>> >>> >>>>>>> >> >> >>> >>>> ruled out Ceph to a large degree. I then booted in the the
>> >>> >>>>>>> >> >> >>> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>> >>> >>>>>>> >> >> >>> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>> >>> >>>>>>> >> >> >>> >>>> need the network enhancements in the 4.x series to work well.
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
>> >>> >>>>>>> >> >> >>> >>>> kernel to see where this issue in introduced. Both of the clusters
>> >>> >>>>>>> >> >> >>> >>>> with this issue are running 4.x, other than that, they are pretty
>> >>> >>>>>>> >> >> >>> >>>> differing hardware and network configs.
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> Thanks,
>> >>> >>>>>>> >> >> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>> Version: Mailvelope v1.1.0
>> >>> >>>>>>> >> >> >>> >>>> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>> >>> >>>>>>> >> >> >>> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>> >>> >>>>>>> >> >> >>> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>> >>> >>>>>>> >> >> >>> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>> >>> >>>>>>> >> >> >>> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>> >>> >>>>>>> >> >> >>> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>> >>> >>>>>>> >> >> >>> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>> >>> >>>>>>> >> >> >>> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>> >>> >>>>>>> >> >> >>> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>> >>> >>>>>>> >> >> >>> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>> >>> >>>>>>> >> >> >>> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>> >>> >>>>>>> >> >> >>> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>> >>> >>>>>>> >> >> >>> >>>> 4OEo
>> >>> >>>>>>> >> >> >>> >>>> =P33I
>> >>> >>>>>>> >> >> >>> >>>> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>> ----------------
>> >>> >>>>>>> >> >> >>> >>>> Robert LeBlanc
>> >>> >>>>>>> >> >> >>> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>> >>> >>>>>>> >> >> >>> >>>> wrote:
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> >>>>> Hash: SHA256
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>> >>> >>>>>>> >> >> >>> >>>>> pinging hosts with "No buffer space available" (hosts are currently
>> >>> >>>>>>> >> >> >>> >>>>> configured for 4GB to test SSD caching rather than page cache). I
>> >>> >>>>>>> >> >> >>> >>>>> found that MTU under 32K worked reliable for ping, but still had the
>> >>> >>>>>>> >> >> >>> >>>>> blocked I/O.
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>> >>> >>>>>>> >> >> >>> >>>>> the blocked I/O.
>> >>> >>>>>>> >> >> >>> >>>>> - ----------------
>> >>> >>>>>>> >> >> >>> >>>>> Robert LeBlanc
>> >>> >>>>>>> >> >> >>> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
>> >>> >>>>>>> >> >> >>> >>>>>>> between when osd.17 started sending the osd_repop message and when
>> >>> >>>>>>> >> >> >>> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>> >>> >>>>>>> >> >> >>> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
>> >>> >>>>>>> >> >> >>> >>>>>>> delayed for many 10s of seconds?
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>> >>> >>>>>>> >> >> >>> >>>>>> has
>> >>> >>>>>>> >> >> >>> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>> sage
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>> What kernel are you running?
>> >>> >>>>>>> >> >> >>> >>>>>>> -Sam
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> >>>>>>>> Hash: SHA256
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>> >>> >>>>>>> >> >> >>> >>>>>>>> extracted what I think are important entries from the logs for the
>> >>> >>>>>>> >> >> >>> >>>>>>>> first blocked request. NTP is running all the servers so the logs
>> >>> >>>>>>> >> >> >>> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>> >>> >>>>>>> >> >> >>> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>> >>> >>>>>>> >> >> >>> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>> >>> >>>>>>> >> >> >>> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>> >>> >>>>>>> >> >> >>> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>> >>> >>>>>>> >> >> >>> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>> >>> >>>>>>> >> >> >>> >>>>>>>> transfer).
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> It looks like osd.17 is receiving responses to start the communication
>> >>> >>>>>>> >> >> >>> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>> >>> >>>>>>> >> >> >>> >>>>>>>> later. To me it seems that the message is getting received but not
>> >>> >>>>>>> >> >> >>> >>>>>>>> passed to another thread right away or something. This test was done
>> >>> >>>>>>> >> >> >>> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>> >>> >>>>>>> >> >> >>> >>>>>>>> thread.
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>> >>> >>>>>>> >> >> >>> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>> >>> >>>>>>> >> >> >>> >>>>>>>> some help.
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> Single Test started about
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:52:36
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>> >> >> >>> >>>>>>>> 30.439150 secs
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:06.487451:
>> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,16
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>> >>> >>>>>>> >> >> >>> >>>>>>>> 30.379680 secs
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>> >>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.406303:
>> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>> >>> >>>>>>> >> >> >>> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>> >>> >>>>>>> >> >> >>> >>>>>>>> 12:55:06.318144:
>> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,14
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>> >> >> >>> >>>>>>>> 30.954212 secs
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.044003:
>> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 16,17
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>> >>> >>>>>>> >> >> >>> >>>>>>>> 30.704367 secs
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>> >>> >>>>>>> >> >> >>> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>> >>> >>>>>>> >> >> >>> >>>>>>>> 2015-09-22 12:57:33.055404:
>> >>> >>>>>>> >> >> >>> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>> >>> >>>>>>> >> >> >>> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> >>> >>>>>>> >> >> >>> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>> >>> >>>>>>> >> >> >>> >>>>>>>>   currently waiting for subops from 13,17
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> Server   IP addr              OSD
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodev  - 192.168.55.11 - 12
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodew  - 192.168.55.12 - 13
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodex  - 192.168.55.13 - 16
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodey  - 192.168.55.14 - 17
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodez  - 192.168.55.15 - 14
>> >>> >>>>>>> >> >> >>> >>>>>>>> nodezz - 192.168.55.16 - 15
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> fio job:
>> >>> >>>>>>> >> >> >>> >>>>>>>> [rbd-test]
>> >>> >>>>>>> >> >> >>> >>>>>>>> readwrite=write
>> >>> >>>>>>> >> >> >>> >>>>>>>> blocksize=4M
>> >>> >>>>>>> >> >> >>> >>>>>>>> #runtime=60
>> >>> >>>>>>> >> >> >>> >>>>>>>> name=rbd-test
>> >>> >>>>>>> >> >> >>> >>>>>>>> #readwrite=randwrite
>> >>> >>>>>>> >> >> >>> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>> >>> >>>>>>> >> >> >>> >>>>>>>> #rwmixread=72
>> >>> >>>>>>> >> >> >>> >>>>>>>> #norandommap
>> >>> >>>>>>> >> >> >>> >>>>>>>> #size=1T
>> >>> >>>>>>> >> >> >>> >>>>>>>> #blocksize=4k
>> >>> >>>>>>> >> >> >>> >>>>>>>> ioengine=rbd
>> >>> >>>>>>> >> >> >>> >>>>>>>> rbdname=test2
>> >>> >>>>>>> >> >> >>> >>>>>>>> pool=rbd
>> >>> >>>>>>> >> >> >>> >>>>>>>> clientname=admin
>> >>> >>>>>>> >> >> >>> >>>>>>>> iodepth=8
>> >>> >>>>>>> >> >> >>> >>>>>>>> #numjobs=4
>> >>> >>>>>>> >> >> >>> >>>>>>>> #thread
>> >>> >>>>>>> >> >> >>> >>>>>>>> #group_reporting
>> >>> >>>>>>> >> >> >>> >>>>>>>> #time_based
>> >>> >>>>>>> >> >> >>> >>>>>>>> #direct=1
>> >>> >>>>>>> >> >> >>> >>>>>>>> #ramp_time=60
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> Thanks,
>> >>> >>>>>>> >> >> >>> >>>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>>>>>> Version: Mailvelope v1.1.0
>> >>> >>>>>>> >> >> >>> >>>>>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>> >>> >>>>>>> >> >> >>> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>> >>> >>>>>>> >> >> >>> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>> >>> >>>>>>> >> >> >>> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>> >>> >>>>>>> >> >> >>> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>> >>> >>>>>>> >> >> >>> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>> >>> >>>>>>> >> >> >>> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>> >>> >>>>>>> >> >> >>> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>> >>> >>>>>>> >> >> >>> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>> >>> >>>>>>> >> >> >>> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>> >>> >>>>>>> >> >> >>> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>> >>> >>>>>>> >> >> >>> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>> >>> >>>>>>> >> >> >>> >>>>>>>> J3hS
>> >>> >>>>>>> >> >> >>> >>>>>>>> =0J7F
>> >>> >>>>>>> >> >> >>> >>>>>>>> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>>>>>> ----------------
>> >>> >>>>>>> >> >> >>> >>>>>>>> Robert LeBlanc
>> >>> >>>>>>> >> >> >>> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>> >>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> Hash: SHA256
>> >>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>> You can search for the (mangled) name _split_collection
>> >>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> I'm not
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>> >>> >>>>>>> >> >> >>> >>>>>>>>> this, it was discussed not too long ago.
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
>> >>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> the
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> having to create new file and therefore split collections. This is
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> on
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> my test cluster with no other load.
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>> >>> >>>>>>> >> >> >>> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
>> >>> >>>>>>> >> >> >>> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>> >>> >>>>>>> >> >> >>> >>>>>>>>>> would be the most helpful for tracking this issue down?
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>> >>> >>>>>>> >> >> >>> >>>>>>>>> 20",
>> >>> >>>>>>> >> >> >>> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>> >>> >>>>>>> >> >> >>> >>>>>>>>> out
>> >>> >>>>>>> >> >> >>> >>>>>>>>> everything you need to track exactly what each Op is doing.
>> >>> >>>>>>> >> >> >>> >>>>>>>>> -Greg
>> >>> >>>>>>> >> >> >>> >>>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>> --
>> >>> >>>>>>> >> >> >>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>> >>>>>>> >> >> >>> >>>>>>>> in
>> >>> >>>>>>> >> >> >>> >>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> >>>>>>> >> >> >>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>>>
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>>> Version: Mailvelope v1.1.0
>> >>> >>>>>>> >> >> >>> >>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >>> >>>>>
>> >>> >>>>>>> >> >> >>> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>> >>> >>>>>>> >> >> >>> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>> >>> >>>>>>> >> >> >>> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>> >>> >>>>>>> >> >> >>> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>> >>> >>>>>>> >> >> >>> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>> >>> >>>>>>> >> >> >>> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>> >>> >>>>>>> >> >> >>> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>> >>> >>>>>>> >> >> >>> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>> >>> >>>>>>> >> >> >>> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>> >>> >>>>>>> >> >> >>> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>> >>> >>>>>>> >> >> >>> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>> >>> >>>>>>> >> >> >>> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>> >>> >>>>>>> >> >> >>> >>>>> gcZm
>> >>> >>>>>>> >> >> >>> >>>>> =CjwB
>> >>> >>>>>>> >> >> >>> >>>>> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>> --
>> >>> >>>>>>> >> >> >>> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> >>>>>>> >> >> >>> >>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> >>>>>>> >> >> >>> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>>>> >> >> >>> >>>>
>> >>> >>>>>>> >> >> >>> >>>
>> >>> >>>>>>> >> >> >>> >>
>> >>> >>>>>>> >> >> >>> >> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> >> Version: Mailvelope v1.1.0
>> >>> >>>>>>> >> >> >>> >> Comment: https://www.mailvelope.com
>> >>> >>>>>>> >> >> >>> >>
>> >>> >>>>>>> >> >> >>> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>> >>> >>>>>>> >> >> >>> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>> >>> >>>>>>> >> >> >>> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>> >>> >>>>>>> >> >> >>> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>> >>> >>>>>>> >> >> >>> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>> >>> >>>>>>> >> >> >>> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>> >>> >>>>>>> >> >> >>> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>> >>> >>>>>>> >> >> >>> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>> >>> >>>>>>> >> >> >>> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>> >>> >>>>>>> >> >> >>> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>> >>> >>>>>>> >> >> >>> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>> >>> >>>>>>> >> >> >>> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>> >>> >>>>>>> >> >> >>> >> ae22
>> >>> >>>>>>> >> >> >>> >> =AX+L
>> >>> >>>>>>> >> >> >>> >> -----END PGP SIGNATURE-----
>> >>> >>>>>>> >> >> >>> _______________________________________________
>> >>> >>>>>>> >> >> >>> ceph-users mailing list
>> >>> >>>>>>> >> >> >>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >>> >>>>>>> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> >>>
>> >>> >>>>>>> >> >> _______________________________________________
>> >>> >>>>>>> >> >> ceph-users mailing list
>> >>> >>>>>>> >> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >>> >>>>>>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> >>
>> >>> >>>>>>> >> --
>> >>> >>>>>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> >>>>>>> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> >>>>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>>>> >>
>> >>> >>>>>>> >>
>> >>> >>>>>>>
>> >>> >>>>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>>>> Version: Mailvelope v1.2.0
>> >>> >>>>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>>>
>> >>> >>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>> >>> >>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>> >>> >>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>> >>> >>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>> >>> >>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>> >>> >>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>> >>> >>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>> >>> >>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>> >>> >>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>> >>> >>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>> >>> >>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>> >>> >>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>> >>> >>>>>>> 6Kfk
>> >>> >>>>>>> =/gR6
>> >>> >>>>>>> -----END PGP SIGNATURE-----
>> >>> >>>>>>> --
>> >>> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> >>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> >>> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>
>> >>> >>>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>>> Version: Mailvelope v1.2.0
>> >>> >>>>> Comment: https://www.mailvelope.com
>> >>> >>>>>
>> >>> >>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>> >>> >>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>> >>> >>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>> >>> >>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>> >>> >>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>> >>> >>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>> >>> >>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>> >>> >>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>> >>> >>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>> >>> >>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>> >>> >>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>> >>> >>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>> >>> >>>>> JFPi
>> >>> >>>>> =ofgq
>> >>> >>>>> -----END PGP SIGNATURE-----
>> >>> >>>>
>> >>> >>>> -----BEGIN PGP SIGNATURE-----
>> >>> >>>> Version: Mailvelope v1.2.0
>> >>> >>>> Comment: https://www.mailvelope.com
>> >>> >>>>
>> >>> >>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>> >>> >>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>> >>> >>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>> >>> >>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>> >>> >>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>> >>> >>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>> >>> >>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>> >>> >>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>> >>> >>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>> >>> >>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>> >>> >>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>> >>> >>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>> >>> >>>> JYNo
>> >>> >>>> =msX2
>> >>> >>>> -----END PGP SIGNATURE-----
>> >>> >>
>> >>> >> -----BEGIN PGP SIGNATURE-----
>> >>> >> Version: Mailvelope v1.2.0
>> >>> >> Comment: https://www.mailvelope.com
>> >>> >>
>> >>> >> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>> >>> >> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>> >>> >> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>> >>> >> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>> >>> >> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>> >>> >> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>> >>> >> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>> >>> >> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>> >>> >> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>> >>> >> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>> >>> >> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>> >>> >> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>> >>> >> rZW5
>> >>> >> =dKn9
>> >>> >> -----END PGP SIGNATURE-----
>> >>>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> >
>> > Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWHpfaCRDmVDuy+mK58QAAkxEQAJ8Wh0o0rkrhzyHqxUbB
7N3H4ydA9qH9CFcuIUGG/vQTjHJthMdaxMKTH95nvRqoBtL6ynC9aDJd+vMo
ntlPdWq2JaDe2fGTxXu97nIur77RAz5fngXlcfVc9+gZtzne6UJSP0kdXC0u
aOtU0jaT72mZmL1DYtrPMV2WeQlxe6eatKE45JNuJtYOWIxY/Ne8L+WtEzs5
jUCuTBo3gzB/hEwY16jaVu3/UFgEISDuPMe9yPaMkJyQIdZd2D8mTLvncw5Y
dgFZ9//7/vl+8EpveQVfaZWxllW5BxWmz7S0CSMXzVVwNH5DCOS7ZXx/oi6o
64oi5jv24DOvkcUpJ0IFktcT8gb2iLwoaHi21SgE5pdfFPI6Ef+FApRZ6Dd0
VGUhWS/5B6mqAUszdiQgTDFqy+l3WmDXPX2+6thb77umO1q1KpctxFjWvhSz
Vk8xrrqkwlxl3CxugNeVys1PI2ymM5TAk/YttqXp5xpyj9w6ipKnoFsSIy7R
oGZeV5x4NdbMDnPzV9ydewgJA1UEkwvEjCB7fiE25OYjY3yqLm7ElgJVtX5e
Njx7a/yrXjJkB3ogYyZzCuOmGldALzbyK8vv6J2X2hWVYDDI4choPYova5Oz
IwkI/E/8dmO4KAP2d71Uhoc4fxP6BV5fpvKQ0k1Kf/DAx3sDGCw2bgwK59fM
r5t1
=TwBO
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Potential OSD deadlock?
       [not found]                                                                                                                   ` <2FD6AADF-88BB-4DFF-B6C2-E103F16B55A8-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
  2015-10-09 14:14                                                                                                                     ` Max A. Krasilnikov
@ 2015-10-16  8:21                                                                                                                     ` Max A. Krasilnikov
  1 sibling, 0 replies; 45+ messages in thread
From: Max A. Krasilnikov @ 2015-10-16  8:21 UTC (permalink / raw)
  To: Jan Schermer; +Cc: Sage Weil, ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel

Hello!

On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:

> Have you tried running iperf between the nodes? Capturing a pcap of the (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?

> What about some intermediate MTU like 8000 - does that work?
> Oh and if there's any bonding/trunking involved, beware that you need to set the same MTU and offloads on all interfaces on certains kernels - flags like MTU/offloads should propagate between the master/slave interfaces but in reality it's not the case and they get reset even if you unplug/replug the ethernet cable.

I'm sorry for long time to answer, but I have fixed problem with Jumbo frames
with sysctl:
#####
net.ipv4.tcp_moderate_rcvbuf = 0
#####
net.ipv4.tcp_rmem= 1024000 8738000 1677721600
net.ipv4.tcp_wmem= 1024000 8738000 1677721600
net.ipv4.tcp_mem= 1024000 8738000 1677721600
net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

And now i can load my cluster without any slow requests. The essential setting
is net.ipv4.tcp_moderate_rcvbuf = 0. All other are just tunings.

> Jan

>> On 09 Oct 2015, at 13:21, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
>> 
>> Hello!
>> 
>> On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
>> 
>>> Are there any errors on the NICs? (ethtool -s ethX)
>> 
>> No errors. Neither on nodes, nor on switches.
>> 
>>> Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled?
>> 
>> flow control disabled everywhere.
>> 
>>> We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happen with a cluster like Ceph. It's better to let the frame drop/retransmit in this case (and you should size it so it doesn't happen in any case).
>>> And how about NIC offloads? Do they play nice with jumbo frames? I wouldn't put my money on that...
>> 
>> I tried to completely disable all offloads and setting mtu back to 9000 after.
>> No luck.
>> I am speaking with my NOC about MTU in 10G network. If I have update, I will
>> write here. I can hardly beleave that it is ceph side, but nothing is
>> impossible.
>> 
>>> Jan
>> 
>> 
>>>> On 09 Oct 2015, at 10:48, Max A. Krasilnikov <pseudo-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org> wrote:
>>>> 
>>>> Hello!
>>>> 
>>>> On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
>>>> 
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>> 
>>>>> Sage,
>>>> 
>>>>> After trying to bisect this issue (all test moved the bisect towards
>>>>> Infernalis) and eventually testing the Infernalis branch again, it
>>>>> looks like the problem still exists although it is handled a tad
>>>>> better in Infernalis. I'm going to test against Firefly/Giant next
>>>>> week and then try and dive into the code to see if I can expose any
>>>>> thing.
>>>> 
>>>>> If I can do anything to provide you with information, please let me know.
>>>> 
>>>> I have fixed my troubles by setting MTU back to 1500 from 9000 in 2x10G network
>>>> between nodes (2x Cisco Nexus 5020, one link per switch, LACP, linux bounding
>>>> driver: bonding mode=4 lacp_rate=1 xmit_hash_policy=1 miimon=100, Intel 82599ES
>>>> Adapter, non-intel sfp+). When setting it to 9000 on nodes and 9216 on Nexus 5020
>>>> switch with Jumbo frames enabled i have performance drop and slow requests. When
>>>> setting 1500 on nodes and not touching Nexus all problems are fixed.
>>>> 
>>>> I have rebooted all my ceph services when changing MTU and changing things to
>>>> 9000 and 1500 several times in order to be sure. It is reproducable in my
>>>> environment.
>>>> 
>>>>> Thanks,
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.0
>>>>> Comment: https://www.mailvelope.com
>>>> 
>>>>> wsFcBAEBCAAQBQJWF1QlCRDmVDuy+mK58QAAWLgP/2l+TkcpeKihDxF8h/kw
>>>>> YFffNWODNfOMq8FVDQkQceo2mFCFc29JnBYiAeqW+XPelwuU5S86LG998aUB
>>>>> BvIU4EHaJNJ31X1NCIA7nwi8rXlFYfSG2qQn58+IzqZoWCQM5vD/THISV1rP
>>>>> qQKtoOAEuRxz+vOAJGI1A1xJSOiFwTRjs4LjE1zYjSP26LdEF61D/lb+AVzV
>>>>> ufxi/ci6mAla/4VTAH4VqEviDgC8AbAZnWFGfUPcTUxJQS99kFrfjJnWvgyF
>>>>> V9EmWtQCvhRO74hQLBqspOwdAxEJesPfGcJT1LjR0eEAMWvbGPtaqbSFAEWa
>>>>> jjyy5wP9+4NnGLdhba6UBtLphjqTcl0e2vVwRj0zLhI14moAOlbhIKmZ1Dt+
>>>>> 1P6vfgOUGvO76xgDMwrVKRoQgWJO/0Tup9+oqInnNYgf4W+ZWsLgLgo7ETAF
>>>>> VcI7LP1wkwAI3lz5YphY/TnKNGs6i+wVjKBamOt3R1yz9WeylaG0T6xgGHrs
>>>>> VugrRSUuO+ND9+mE5EsUgITCZoaavXJESJMb30XkK6hYGB+T/q+hBafc6Wle
>>>>> Jgs+aT2m1erdSyZn0ZC9a6CjWmwJXY6FCSGhE53BbefBxmCFxn+8tVav+Q8W
>>>>> 7s14TntP6ex4ca7eTwGuSXC9FU5fAVa+3+3aXDAC1QPAkeVkXyB716W1XG6b
>>>>> BCFo
>>>>> =GJL4
>>>>> -----END PGP SIGNATURE-----
>>>>> ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> 
>>>> 
>>>>> On Wed, Oct 7, 2015 at 1:25 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>> 
>>>>>> We forgot to upload the ceph.log yesterday. It is there now.
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc  wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA256
>>>>>>> 
>>>>>>> I upped the debug on about everything and ran the test for about 40
>>>>>>> minutes. I took OSD.19 on ceph1 doen and then brought it back in.
>>>>>>> There was at least one op on osd.19 that was blocked for over 1,000
>>>>>>> seconds. Hopefully this will have something that will cast a light on
>>>>>>> what is going on.
>>>>>>> 
>>>>>>> We are going to upgrade this cluster to Infernalis tomorrow and rerun
>>>>>>> the test to verify the results from the dev cluster. This cluster
>>>>>>> matches the hardware of our production cluster but is not yet in
>>>>>>> production so we can safely wipe it to downgrade back to Hammer.
>>>>>>> 
>>>>>>> Logs are located at http://dev.v3trae.net/~jlavoy/ceph/logs/
>>>>>>> 
>>>>>>> Let me know what else we can do to help.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>> Version: Mailvelope v1.2.0
>>>>>>> Comment: https://www.mailvelope.com
>>>>>>> 
>>>>>>> wsFcBAEBCAAQBQJWFFwACRDmVDuy+mK58QAAs/UP/1L+y7DEfHqD/5OpkiNQ
>>>>>>> xuEEDm7fNJK58tLRmKsCrDrsFUvWCjiqUwboPg/E40e2GN7Lt+VkhMUEUWoo
>>>>>>> e3L20ig04c8Zu6fE/SXX3lnvayxsWTPcMnYI+HsmIV9E/efDLVLEf6T4fvXg
>>>>>>> 5dKLiqQ8Apu+UMVfd1+aKKDdLdnYlgBCZcIV9AQe1GB8X2VJJhmNWh6TQ3Xr
>>>>>>> gNXDexBdYjFBLu84FXOITd3ZtyUkgx/exCUMmwsJSc90jduzipS5hArvf7LN
>>>>>>> HD6m1gBkZNbfWfc/4nzqOQnKdY1pd9jyoiQM70jn0R5b2BlZT0wLjiAJm+07
>>>>>>> eCCQ99TZHFyeu1LyovakrYncXcnPtP5TfBFZW952FWQugupvxPCcaduz+GJV
>>>>>>> OhPAJ9dv90qbbGCO+8kpTMAD1aHgt/7+0/hKZTg8WMHhua68SFCXmdGAmqje
>>>>>>> IkIKswIAX4/uIoo5mK4TYB5HdEMJf9DzBFd+1RzzfRrrRalVkBfsu5ChFTx3
>>>>>>> mu5LAMwKTslvILMxAct0JwnwkOX5Gd+OFvmBRdm16UpDaDTQT2DfykylcmJd
>>>>>>> Cf9rPZxUv0ZHtZyTTyP2e6vgrc7UM/Ie5KonABxQ11mGtT8ysra3c9kMhYpw
>>>>>>> D6hcAZGtdvpiBRXBC5gORfiFWFxwu5kQ+daUhgUIe/O/EWyeD0rirZoqlLnZ
>>>>>>> EDrG
>>>>>>> =BZVw
>>>>>>> -----END PGP SIGNATURE-----
>>>>>>> ----------------
>>>>>>> Robert LeBlanc
>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 6, 2015 at 2:36 PM, Robert LeBlanc  wrote:
>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>> Hash: SHA256
>>>>>>>> 
>>>>>>>> On my second test (a much longer one), it took nearly an hour, but a
>>>>>>>> few messages have popped up over a 20 window. Still far less than I
>>>>>>>> have been seeing.
>>>>>>>> - ----------------
>>>>>>>> Robert LeBlanc
>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Oct 6, 2015 at 2:00 PM, Robert LeBlanc  wrote:
>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>> Hash: SHA256
>>>>>>>>> 
>>>>>>>>> I'll capture another set of logs. Is there any other debugging you
>>>>>>>>> want turned up? I've seen the same thing where I see the message
>>>>>>>>> dispatched to the secondary OSD, but the message just doesn't show up
>>>>>>>>> for 30+ seconds in the secondary OSD logs.
>>>>>>>>> - ----------------
>>>>>>>>> Robert LeBlanc
>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 6, 2015 at 1:34 PM, Sage Weil  wrote:
>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>> 
>>>>>>>>>>> I can't think of anything. In my dev cluster the only thing that has
>>>>>>>>>>> changed is the Ceph versions (no reboot). What I like is even though
>>>>>>>>>>> the disks are 100% utilized, it is preforming as I expect now. Client
>>>>>>>>>>> I/O is slightly degraded during the recovery, but no blocked I/O when
>>>>>>>>>>> the OSD boots or during the recovery period. This is with
>>>>>>>>>>> max_backfills set to 20, one backfill max in our production cluster is
>>>>>>>>>>> painful on OSD boot/recovery. I was able to reproduce this issue on
>>>>>>>>>>> our dev cluster very easily and very quickly with these settings. So
>>>>>>>>>>> far two tests and an hour later, only the blocked I/O when the OSD is
>>>>>>>>>>> marked out. We would love to see that go away too, but this is far
>>>>>>>>>>                                           (me too!)
>>>>>>>>>>> better than what we have now. This dev cluster also has
>>>>>>>>>>> osd_client_message_cap set to default (100).
>>>>>>>>>>> 
>>>>>>>>>>> We need to stay on the Hammer version of Ceph and I'm willing to take
>>>>>>>>>>> the time to bisect this. If this is not a problem in Firefly/Giant,
>>>>>>>>>>> you you prefer a bisect to find the introduction of the problem
>>>>>>>>>>> (Firefly/Giant -> Hammer) or the introduction of the resolution
>>>>>>>>>>> (Hammer -> Infernalis)? Do you have some hints to reduce hitting a
>>>>>>>>>>> commit that prevents a clean build as that is my most limiting factor?
>>>>>>>>>> 
>>>>>>>>>> Nothing comes to mind.  I think the best way to find this is still to see
>>>>>>>>>> it happen in the logs with hammer.  The frustrating thing with that log
>>>>>>>>>> dump you sent is that although I see plenty of slow request warnings in
>>>>>>>>>> the osd logs, I don't see the requests arriving.  Maybe the logs weren't
>>>>>>>>>> turned up for long enough?
>>>>>>>>>> 
>>>>>>>>>> sage
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> - ----------------
>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 6, 2015 at 12:32 PM, Sage Weil  wrote:
>>>>>>>>>>>> On Tue, 6 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>> 
>>>>>>>>>>>>> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
>>>>>>>>>>>>> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
>>>>>>>>>>>>> messages when the OSD was marked out:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
>>>>>>>>>>>>> cluster [WRN] 17 slow requests, 3 included below; oldest blocked for >
>>>>>>>>>>>>> 34.476006 secs
>>>>>>>>>>>>> 2015-10-06 11:52:46.961056 osd.13 192.168.55.12:6800/20870 82 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.913474 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.047475: osd_op(client.600962.0:474
>>>>>>>>>>>>> rbd_data.338102ae8944a.0000000000005270 [read 3302912~4096] 8.c74a4538
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 2015-10-06 11:52:46.961066 osd.13 192.168.55.12:6800/20870 83 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.697545 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.263403: osd_op(client.600960.0:583
>>>>>>>>>>>>> rbd_data.3380f74b0dc51.000000000001ee75 [read 1016832~4096] 8.778d1be3
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 2015-10-06 11:52:46.961074 osd.13 192.168.55.12:6800/20870 84 :
>>>>>>>>>>>>> cluster [WRN] slow request 32.668006 seconds old, received at
>>>>>>>>>>>>> 2015-10-06 11:52:14.292942: osd_op(client.600955.0:571
>>>>>>>>>>>>> rbd_data.3380f74b0dc51.0000000000019b09 [read 1034240~4096] 8.e87a6f58
>>>>>>>>>>>>> ack+read+known_if_redirected e58744) currently waiting for peered
>>>>>>>>>>>>> 
>>>>>>>>>>>>> But I'm not seeing the blocked messages when the OSD came back in. The
>>>>>>>>>>>>> OSD spindles have been running at 100% during this test. I have seen
>>>>>>>>>>>>> slowed I/O from the clients as expected from the extra load, but so
>>>>>>>>>>>>> far no blocked messages. I'm going to run some more tests.
>>>>>>>>>>>> 
>>>>>>>>>>>> Good to hear.
>>>>>>>>>>>> 
>>>>>>>>>>>> FWIW I looked through the logs and all of the slow request no flag point
>>>>>>>>>>>> messages came from osd.163... and the logs don't show when they arrived.
>>>>>>>>>>>> My guess is this OSD has a slower disk than the others, or something else
>>>>>>>>>>>> funny is going on?
>>>>>>>>>>>> 
>>>>>>>>>>>> I spot checked another OSD at random (60) where I saw a slow request.  It
>>>>>>>>>>>> was stuck peering for 10s of seconds... waiting on a pg log message from
>>>>>>>>>>>> osd.163.
>>>>>>>>>>>> 
>>>>>>>>>>>> sage
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wsFcBAEBCAAQBQJWFAzRCRDmVDuy+mK58QAASRYP/jrbKy5mptq/cSqJvB47
>>>>>>>>>>>>> F/gEatsqU4/TwyIJg137DQTkONbHKnLgCZqsJLnCZRH8fFqtvY6g/Q/AA7Ks
>>>>>>>>>>>>> ouo5gvbjKM7pOm/uUn8kU44Xe15f/bkVHvWBECZzg8YJwinPAisp5R0m1HBC
>>>>>>>>>>>>> HLvsbeqV00m72TyfsZX4aj7lHdyvcdcIH2EVgX/db092VVXczK4q2gRoNr0Y
>>>>>>>>>>>>> 77BEr2Y/gPj5LM4b/aDG5AWY8dJZRlNz+B1CyLS+kIDXSaAbzul2UbAG6jNE
>>>>>>>>>>>>> KJEVxndMPfHLIdwg55+q8VTMIjqXcCM47cQhWFrKChgVD8byJxpc6E0TqOxs
>>>>>>>>>>>>> 1gtNE8AILoCSYKnwQZan+TBDGxki7rQxzMdNI+NLfhy1Mwd3lSCPsDtD7W/i
>>>>>>>>>>>>> tzNTr6aGz+wr+OPDQV5zrzLaPZYF3FLWN4n6RYNfnDramYzD76v+7kjdW4dE
>>>>>>>>>>>>> 5UVCtE7KGLCZ21fu6sln1b9q6lYXNtohAmAunIdqpo3FmHusRySyZzYKu1+9
>>>>>>>>>>>>> zg/LHiArD/ddjkPxVWCTFBS17g/bESRcv2MsA30GS8J6k1zlQaLX5KeGg6Ql
>>>>>>>>>>>>> WJSmW8gFfEbXj/7JTrVtQWTdgjsegaySFnDisTWUR/hEM/NuKii4xfjI32M/
>>>>>>>>>>>>> luUMXHZ8lTHk9C8MfZcpyPGvwp2FliD9LqaWOVPWtWZJcerEWcZVlEApg4qb
>>>>>>>>>>>>> fo5a
>>>>>>>>>>>>> =ahEi
>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 6, 2015 at 6:37 AM, Sage Weil  wrote:
>>>>>>>>>>>>>> On Mon, 5 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With some off-list help, we have adjusted
>>>>>>>>>>>>>>> osd_client_message_cap=10000. This seems to have helped a bit and we
>>>>>>>>>>>>>>> have seen some OSDs have a value up to 4,000 for client messages. But
>>>>>>>>>>>>>>> it does not solve the problem with the blocked I/O.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One thing that I have noticed is that almost exactly 30 seconds elapse
>>>>>>>>>>>>>>> between an OSD boots and the first blocked I/O message. I don't know
>>>>>>>>>>>>>>> if the OSD doesn't have time to get it's brain right about a PG before
>>>>>>>>>>>>>>> it starts servicing it or what exactly.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm downloading the logs from yesterday now; sorry it's taking so long.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On another note, I tried upgrading our CentOS dev cluster from Hammer
>>>>>>>>>>>>>>> to master and things didn't go so well. The OSDs would not start
>>>>>>>>>>>>>>> because /var/lib/ceph was not owned by ceph. I chowned the directory
>>>>>>>>>>>>>>> and all OSDs and the OSD then started, but never became active in the
>>>>>>>>>>>>>>> cluster. It just sat there after reading all the PGs. There were
>>>>>>>>>>>>>>> sockets open to the monitor, but no OSD to OSD sockets. I tried
>>>>>>>>>>>>>>> downgrading to the Infernalis branch and still no luck getting the
>>>>>>>>>>>>>>> OSDs to come up. The OSD processes were idle after the initial boot.
>>>>>>>>>>>>>>> All packages were installed from gitbuilder.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Did you chown -R ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>       https://github.com/ceph/ceph/blob/infernalis/doc/release-notes.rst#upgrading-from-hammer
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My guess is you only chowned the root dir, and the OSD didn't throw
>>>>>>>>>>>>>> an error when it encountered the other files?  If you can generate a debug
>>>>>>>>>>>>>> osd = 20 log, that would be helpful.. thanks!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWE0F5CRDmVDuy+mK58QAAaCYQAJuFcCvRUJ46k0rYrMcc
>>>>>>>>>>>>>>> YlrSrGwS57GJS/JjaFHsvBV7KTobEMNeMkSv4PTGpwylNV9Dx4Ad74DDqX4g
>>>>>>>>>>>>>>> 6hZDe0rE+uEI7tW9Lqp+MN7eaU2lDuwLt/pOzZI14jTskUYTlNi3HjlN67mQ
>>>>>>>>>>>>>>> aiX1rbrJL6FFkuMOn/YqHpMbxI5ZOUZc1s7RDhASOPIs4z/CxpDfluW6fZA/
>>>>>>>>>>>>>>> y8C+pW6zzS9U/6jZwtGhBq4dvDBO41Lxb9WOehD8Aa/Qt6XNDzGw2KEkEkw7
>>>>>>>>>>>>>>> 8dBc7UFa2Wx3Tnzy238a/nKhtz6O6OrHsroA+HGWwCoxPWjOsz/xOoOmfwp+
>>>>>>>>>>>>>>> ALkY3id+t2uJEqzbL8/MgJ2RV1A+AZ7W1VWIJUOkDz0wR+KxQsxduHoD6rQy
>>>>>>>>>>>>>>> zg0fj2KSAlmVusYOPM1s1+jBsqNF3wcNxpbRoVuFqk0xMgGPrIdUNdZHg6bs
>>>>>>>>>>>>>>> D5sfkjNKexFe0ifFJ0cfv6UaGIKv4dK2eq3jUKgXHfh/qZmJbEB+zHaqJNyg
>>>>>>>>>>>>>>> CN6w6xu1FHLeVobKAWe5ZzKY5lxw6b8YG+ce/E2dvW73gSASPTvtv68gaT04
>>>>>>>>>>>>>>> 2SPF9Ql0fERL5EDY9Pc4MHpQVcS0XxxJA69CgnWgaG6fzq2eY7fALeMBVWlB
>>>>>>>>>>>>>>> fRj3zQwqJls/X8JZ3c4P4G0R6DP9bmMwGr++oYc3gWGrvgzxw3N7+ornd0jd
>>>>>>>>>>>>>>> GdXC
>>>>>>>>>>>>>>> =Aigq
>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 3:04 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have eight nodes running the fio job rbd_test_real to different RBD
>>>>>>>>>>>>>>>> volumes. I've included the CRUSH map in the tarball.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I stopped one OSD process and marked it out. I let it recover for a
>>>>>>>>>>>>>>>> few minutes and then I started the process again and marked it in. I
>>>>>>>>>>>>>>>> started getting block I/O messages during the recovery.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The logs are located at http://162.144.87.113/files/ushou1.tar.xz
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEZRcCRDmVDuy+mK58QAALbEQAK5pFiixJarUdLm50zp/
>>>>>>>>>>>>>>>> 3AGgGBPrieExKmoZZLCoMGfOLfxZDbN2ybtopKDQDfrTqndE/6Xi9UXqTOdW
>>>>>>>>>>>>>>>> jDc9U1wusgG0CKPsY1SMYnB9akvaDwtdh5q5k4VpN2zsG9R6lRojHeNQR3Nf
>>>>>>>>>>>>>>>> 56QevJL4/e5lC3sLhVnxXXi2XKnHCVOHT+PYgNour2ZWt6OTLoFFxuSU3zLN
>>>>>>>>>>>>>>>> OtfXgrFiiNF0mrDpm0gg2l8a8N5SwP9mM233S2U/JiGAqsqoqkfd0okjDenC
>>>>>>>>>>>>>>>> ksesU/n7zordFpfLN3yjL6+X9pQ4YA6otZrq4wWtjWKO/H0b+6iIsf/AE131
>>>>>>>>>>>>>>>> R6a4Vufndpd3Ce+FNfM+iu3FmKk0KVfDAaF/tIP6S6XUzGVMAbpvpmqNL17o
>>>>>>>>>>>>>>>> boh3wPZEyK+7KiF4Qlt2KoI/FV24Yj8XiyMnKin3MbMYbammb4ER977VH7iI
>>>>>>>>>>>>>>>> sZyelNPSsYmmw/MF+AkA5KVgzQ4DAPflaejIgC5uw3dYKrn2AQE5CE9nN8Gz
>>>>>>>>>>>>>>>> GVVaGItu1Bvrz21QoT9o5v0dZ85zttFvtrKIYgSi4mdpC6XkzUbg9s9EB1/T
>>>>>>>>>>>>>>>> SEY+fau7W7TtiLpzCAIQ3zDvgsvkx2P6tKg5U8e93LVv9B+YI8i8mUxxv1j5
>>>>>>>>>>>>>>>> PHFi7KTgRUPm1FPMJDSyzvOgqyMj9AzaESl1Na6k529ILFIcyfko0niTT1oZ
>>>>>>>>>>>>>>>> 3EPx
>>>>>>>>>>>>>>>> =UDIV
>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sun, Oct 4, 2015 at 7:48 AM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>> On Sat, 3 Oct 2015, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We are still struggling with this and have tried a lot of different
>>>>>>>>>>>>>>>>>> things. Unfortunately, Inktank (now Red Hat) no longer provides
>>>>>>>>>>>>>>>>>> consulting services for non-Red Hat systems. If there are some
>>>>>>>>>>>>>>>>>> certified Ceph consultants in the US that we can do both remote and
>>>>>>>>>>>>>>>>>> on-site engagements, please let us know.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This certainly seems to be network related, but somewhere in the
>>>>>>>>>>>>>>>>>> kernel. We have tried increasing the network and TCP buffers, number
>>>>>>>>>>>>>>>>>> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
>>>>>>>>>>>>>>>>>> on the boxes, the disks are busy, but not constantly at 100% (they
>>>>>>>>>>>>>>>>>> cycle from <10% up to 100%, but not 100% for more than a few seconds
>>>>>>>>>>>>>>>>>> at a time). There seems to be no reasonable explanation why I/O is
>>>>>>>>>>>>>>>>>> blocked pretty frequently longer than 30 seconds. We have verified
>>>>>>>>>>>>>>>>>> Jumbo frames by pinging from/to each node with 9000 byte packets. The
>>>>>>>>>>>>>>>>>> network admins have verified that packets are not being dropped in the
>>>>>>>>>>>>>>>>>> switches for these nodes. We have tried different kernels including
>>>>>>>>>>>>>>>>>> the recent Google patch to cubic. This is showing up on three cluster
>>>>>>>>>>>>>>>>>> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
>>>>>>>>>>>>>>>>>> (from CentOS 7.1) with similar results.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The messages seem slightly different:
>>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
>>>>>>>>>>>>>>>>>> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>> 100.087155 secs
>>>>>>>>>>>>>>>>>> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.041999 seconds old, received at
>>>>>>>>>>>>>>>>>> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
>>>>>>>>>>>>>>>>>> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
>>>>>>>>>>>>>>>>>> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
>>>>>>>>>>>>>>>>>> points reached
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I don't know what "no flag points reached" means.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Just that the op hasn't been marked as reaching any interesting points
>>>>>>>>>>>>>>>>> (op->mark_*() calls).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Is it possible to gather a lot with debug ms = 20 and debug osd = 20?
>>>>>>>>>>>>>>>>> It's extremely verbose but it'll let us see where the op is getting
>>>>>>>>>>>>>>>>> blocked.  If you see the "slow request" message it means the op in
>>>>>>>>>>>>>>>>> received by ceph (that's when the clock starts), so I suspect it's not
>>>>>>>>>>>>>>>>> something we can blame on the network stack.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The problem is most pronounced when we have to reboot an OSD node (1
>>>>>>>>>>>>>>>>>> of 13), we will have hundreds of I/O blocked for some times up to 300
>>>>>>>>>>>>>>>>>> seconds. It takes a good 15 minutes for things to settle down. The
>>>>>>>>>>>>>>>>>> production cluster is very busy doing normally 8,000 I/O and peaking
>>>>>>>>>>>>>>>>>> at 15,000. This is all 4TB spindles with SSD journals and the disks
>>>>>>>>>>>>>>>>>> are between 25-50% full. We are currently splitting PGs to distribute
>>>>>>>>>>>>>>>>>> the load better across the disks, but we are having to do this 10 PGs
>>>>>>>>>>>>>>>>>> at a time as we get blocked I/O. We have max_backfills and
>>>>>>>>>>>>>>>>>> max_recovery set to 1, client op priority is set higher than recovery
>>>>>>>>>>>>>>>>>> priority. We tried increasing the number of op threads but this didn't
>>>>>>>>>>>>>>>>>> seem to help. It seems as soon as PGs are finished being checked, they
>>>>>>>>>>>>>>>>>> become active and could be the cause for slow I/O while the other PGs
>>>>>>>>>>>>>>>>>> are being checked.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> What I don't understand is that the messages are delayed. As soon as
>>>>>>>>>>>>>>>>>> the message is received by Ceph OSD process, it is very quickly
>>>>>>>>>>>>>>>>>> committed to the journal and a response is sent back to the primary
>>>>>>>>>>>>>>>>>> OSD which is received very quickly as well. I've adjust
>>>>>>>>>>>>>>>>>> min_free_kbytes and it seems to keep the OSDs from crashing, but
>>>>>>>>>>>>>>>>>> doesn't solve the main problem. We don't have swap and there is 64 GB
>>>>>>>>>>>>>>>>>> of RAM per nodes for 10 OSDs.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Is there something that could cause the kernel to get a packet but not
>>>>>>>>>>>>>>>>>> be able to dispatch it to Ceph such that it could be explaining why we
>>>>>>>>>>>>>>>>>> are seeing these blocked I/O for 30+ seconds. Is there some pointers
>>>>>>>>>>>>>>>>>> to tracing Ceph messages from the network buffer through the kernel to
>>>>>>>>>>>>>>>>>> the Ceph process?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> We can really use some pointers no matter how outrageous. We've have
>>>>>>>>>>>>>>>>>> over 6 people looking into this for weeks now and just can't think of
>>>>>>>>>>>>>>>>>> anything else.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
>>>>>>>>>>>>>>>>>> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
>>>>>>>>>>>>>>>>>> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
>>>>>>>>>>>>>>>>>> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
>>>>>>>>>>>>>>>>>> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
>>>>>>>>>>>>>>>>>> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
>>>>>>>>>>>>>>>>>> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
>>>>>>>>>>>>>>>>>> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
>>>>>>>>>>>>>>>>>> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
>>>>>>>>>>>>>>>>>> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
>>>>>>>>>>>>>>>>>> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
>>>>>>>>>>>>>>>>>> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
>>>>>>>>>>>>>>>>>> l7OF
>>>>>>>>>>>>>>>>>> =OI++
>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>> We dropped the replication on our cluster from 4 to 3 and it looks
>>>>>>>>>>>>>>>>>>> like all the blocked I/O has stopped (no entries in the log for the
>>>>>>>>>>>>>>>>>>> last 12 hours). This makes me believe that there is some issue with
>>>>>>>>>>>>>>>>>>> the number of sockets or some other TCP issue. We have not messed with
>>>>>>>>>>>>>>>>>>> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
>>>>>>>>>>>>>>>>>>> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
>>>>>>>>>>>>>>>>>>> processes and 16K system wide.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Does this seem like the right spot to be looking? What are some
>>>>>>>>>>>>>>>>>>> configuration items we should be looking at?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
>>>>>>>>>>>>>>>>>>>> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
>>>>>>>>>>>>>>>>>>>> seems that there were some major reworks in the network handling in
>>>>>>>>>>>>>>>>>>>> the kernel to efficiently handle that network rate. If I remember
>>>>>>>>>>>>>>>>>>>> right we also saw a drop in CPU utilization. I'm starting to think
>>>>>>>>>>>>>>>>>>>> that we did see packet loss while congesting our ISLs in our initial
>>>>>>>>>>>>>>>>>>>> testing, but we could not tell where the dropping was happening. We
>>>>>>>>>>>>>>>>>>>> saw some on the switches, but it didn't seem to be bad if we weren't
>>>>>>>>>>>>>>>>>>>> trying to congest things. We probably already saw this issue, just
>>>>>>>>>>>>>>>>>>>> didn't know it.
>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
>>>>>>>>>>>>>>>>>>>>> FWIW, we've got some 40GbE Intel cards in the community performance cluster
>>>>>>>>>>>>>>>>>>>>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
>>>>>>>>>>>>>>>>>>>>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
>>>>>>>>>>>>>>>>>>>>> drivers might cause problems though.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Here's ifconfig from one of the nodes:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ens513f1: flags=4163  mtu 1500
>>>>>>>>>>>>>>>>>>>>>       inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
>>>>>>>>>>>>>>>>>>>>>       inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
>>>>>>>>>>>>>>>>>>>>>       ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
>>>>>>>>>>>>>>>>>>>>>       RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
>>>>>>>>>>>>>>>>>>>>>       RX errors 0  dropped 0  overruns 0  frame 0
>>>>>>>>>>>>>>>>>>>>>       TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
>>>>>>>>>>>>>>>>>>>>>       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> OK, here is the update on the saga...
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I traced some more of blocked I/Os and it seems that communication
>>>>>>>>>>>>>>>>>>>>>> between two hosts seemed worse than others. I did a two way ping flood
>>>>>>>>>>>>>>>>>>>>>> between the two hosts using max packet sizes (1500). After 1.5M
>>>>>>>>>>>>>>>>>>>>>> packets, no lost pings. Then then had the ping flood running while I
>>>>>>>>>>>>>>>>>>>>>> put Ceph load on the cluster and the dropped pings started increasing
>>>>>>>>>>>>>>>>>>>>>> after stopping the Ceph workload the pings stopped dropping.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I then ran iperf between all the nodes with the same results, so that
>>>>>>>>>>>>>>>>>>>>>> ruled out Ceph to a large degree. I then booted in the the
>>>>>>>>>>>>>>>>>>>>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
>>>>>>>>>>>>>>>>>>>>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
>>>>>>>>>>>>>>>>>>>>>> need the network enhancements in the 4.x series to work well.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Does this sound familiar to anyone? I'll probably start bisecting the
>>>>>>>>>>>>>>>>>>>>>> kernel to see where this issue in introduced. Both of the clusters
>>>>>>>>>>>>>>>>>>>>>> with this issue are running 4.x, other than that, they are pretty
>>>>>>>>>>>>>>>>>>>>>> differing hardware and network configs.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
>>>>>>>>>>>>>>>>>>>>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
>>>>>>>>>>>>>>>>>>>>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
>>>>>>>>>>>>>>>>>>>>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
>>>>>>>>>>>>>>>>>>>>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
>>>>>>>>>>>>>>>>>>>>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
>>>>>>>>>>>>>>>>>>>>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
>>>>>>>>>>>>>>>>>>>>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
>>>>>>>>>>>>>>>>>>>>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
>>>>>>>>>>>>>>>>>>>>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
>>>>>>>>>>>>>>>>>>>>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
>>>>>>>>>>>>>>>>>>>>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
>>>>>>>>>>>>>>>>>>>>>> 4OEo
>>>>>>>>>>>>>>>>>>>>>> =P33I
>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
>>>>>>>>>>>>>>>>>>>>>>> pinging hosts with "No buffer space available" (hosts are currently
>>>>>>>>>>>>>>>>>>>>>>> configured for 4GB to test SSD caching rather than page cache). I
>>>>>>>>>>>>>>>>>>>>>>> found that MTU under 32K worked reliable for ping, but still had the
>>>>>>>>>>>>>>>>>>>>>>> blocked I/O.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
>>>>>>>>>>>>>>>>>>>>>>> the blocked I/O.
>>>>>>>>>>>>>>>>>>>>>>> - ----------------
>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I looked at the logs, it looks like there was a 53 second delay
>>>>>>>>>>>>>>>>>>>>>>>>> between when osd.17 started sending the osd_repop message and when
>>>>>>>>>>>>>>>>>>>>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
>>>>>>>>>>>>>>>>>>>>>>>>> once see a kernel issue which caused some messages to be mysteriously
>>>>>>>>>>>>>>>>>>>>>>>>> delayed for many 10s of seconds?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Every time we have seen this behavior and diagnosed it in the wild it
>>>>>>>>>>>>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>>>>>>>>>> been a network misconfiguration.  Usually related to jumbo frames.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> What kernel are you running?
>>>>>>>>>>>>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
>>>>>>>>>>>>>>>>>>>>>>>>>> extracted what I think are important entries from the logs for the
>>>>>>>>>>>>>>>>>>>>>>>>>> first blocked request. NTP is running all the servers so the logs
>>>>>>>>>>>>>>>>>>>>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
>>>>>>>>>>>>>>>>>>>>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
>>>>>>>>>>>>>>>>>>>>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
>>>>>>>>>>>>>>>>>>>>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
>>>>>>>>>>>>>>>>>>>>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
>>>>>>>>>>>>>>>>>>>>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual data
>>>>>>>>>>>>>>>>>>>>>>>>>> transfer).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It looks like osd.17 is receiving responses to start the communication
>>>>>>>>>>>>>>>>>>>>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
>>>>>>>>>>>>>>>>>>>>>>>>>> later. To me it seems that the message is getting received but not
>>>>>>>>>>>>>>>>>>>>>>>>>> passed to another thread right away or something. This test was done
>>>>>>>>>>>>>>>>>>>>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
>>>>>>>>>>>>>>>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
>>>>>>>>>>>>>>>>>>>>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
>>>>>>>>>>>>>>>>>>>>>>>>>> some help.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Single Test started about
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:52:36
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.439150 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:06.487451:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,16
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.379680 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.406303:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : cluster
>>>>>>>>>>>>>>>>>>>>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
>>>>>>>>>>>>>>>>>>>>>>>>>> 12:55:06.318144:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,14
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.954212 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.044003:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 16,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
>>>>>>>>>>>>>>>>>>>>>>>>>> 30.704367 secs
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
>>>>>>>>>>>>>>>>>>>>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
>>>>>>>>>>>>>>>>>>>>>>>>>> 2015-09-22 12:57:33.055404:
>>>>>>>>>>>>>>>>>>>>>>>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
>>>>>>>>>>>>>>>>>>>>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>>>>>>>>>>>>>>>>>>>>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
>>>>>>>>>>>>>>>>>>>>>>>>>> currently waiting for subops from 13,17
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Server   IP addr              OSD
>>>>>>>>>>>>>>>>>>>>>>>>>> nodev  - 192.168.55.11 - 12
>>>>>>>>>>>>>>>>>>>>>>>>>> nodew  - 192.168.55.12 - 13
>>>>>>>>>>>>>>>>>>>>>>>>>> nodex  - 192.168.55.13 - 16
>>>>>>>>>>>>>>>>>>>>>>>>>> nodey  - 192.168.55.14 - 17
>>>>>>>>>>>>>>>>>>>>>>>>>> nodez  - 192.168.55.15 - 14
>>>>>>>>>>>>>>>>>>>>>>>>>> nodezz - 192.168.55.16 - 15
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> fio job:
>>>>>>>>>>>>>>>>>>>>>>>>>> [rbd-test]
>>>>>>>>>>>>>>>>>>>>>>>>>> readwrite=write
>>>>>>>>>>>>>>>>>>>>>>>>>> blocksize=4M
>>>>>>>>>>>>>>>>>>>>>>> ####runtime=60
>>>>>>>>>>>>>>>>>>>>>>>>>> name=rbd-test
>>>>>>>>>>>>>>>>>>>>>>> ####readwrite=randwrite
>>>>>>>>>>>>>>>>>>>>>>> ####bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
>>>>>>>>>>>>>>>>>>>>>>> ####rwmixread=72
>>>>>>>>>>>>>>>>>>>>>>> ####norandommap
>>>>>>>>>>>>>>>>>>>>>>> ####size=1T
>>>>>>>>>>>>>>>>>>>>>>> ####blocksize=4k
>>>>>>>>>>>>>>>>>>>>>>>>>> ioengine=rbd
>>>>>>>>>>>>>>>>>>>>>>>>>> rbdname=test2
>>>>>>>>>>>>>>>>>>>>>>>>>> pool=rbd
>>>>>>>>>>>>>>>>>>>>>>>>>> clientname=admin
>>>>>>>>>>>>>>>>>>>>>>>>>> iodepth=8
>>>>>>>>>>>>>>>>>>>>>>> ####numjobs=4
>>>>>>>>>>>>>>>>>>>>>>> ####thread
>>>>>>>>>>>>>>>>>>>>>>> ####group_reporting
>>>>>>>>>>>>>>>>>>>>>>> ####time_based
>>>>>>>>>>>>>>>>>>>>>>> ####direct=1
>>>>>>>>>>>>>>>>>>>>>>> ####ramp_time=60
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
>>>>>>>>>>>>>>>>>>>>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
>>>>>>>>>>>>>>>>>>>>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
>>>>>>>>>>>>>>>>>>>>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
>>>>>>>>>>>>>>>>>>>>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
>>>>>>>>>>>>>>>>>>>>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
>>>>>>>>>>>>>>>>>>>>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
>>>>>>>>>>>>>>>>>>>>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
>>>>>>>>>>>>>>>>>>>>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
>>>>>>>>>>>>>>>>>>>>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
>>>>>>>>>>>>>>>>>>>>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
>>>>>>>>>>>>>>>>>>>>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
>>>>>>>>>>>>>>>>>>>>>>>>>> J3hS
>>>>>>>>>>>>>>>>>>>>>>>>>> =0J7F
>>>>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>>>>> ----------------
>>>>>>>>>>>>>>>>>>>>>>>>>> Robert LeBlanc
>>>>>>>>>>>>>>>>>>>>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hash: SHA256
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there some way to tell in the logs that this is happening?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> You can search for the (mangled) name _split_collection
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Bump up the split and merge thresholds. You can search the list for
>>>>>>>>>>>>>>>>>>>>>>>>>>> this, it was discussed not too long ago.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the sessions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> are aborted, they are reestablished and complete immediately.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> having to create new file and therefore split collections. This is
>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>> my test cluster with no other load.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hmm, that does make it seem less likely if you're really not creating
>>>>>>>>>>>>>>>>>>>>>>>>>>> new objects, if you're actually running fio in such a way that it's
>>>>>>>>>>>>>>>>>>>>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be the most helpful for tracking this issue down?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
>>>>>>>>>>>>>>>>>>>>>>>>>>> 20",
>>>>>>>>>>>>>>>>>>>>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
>>>>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>>>>>> everything you need to track exactly what each Op is doing.
>>>>>>>>>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
>>>>>>>>>>>>>>>>>>>>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
>>>>>>>>>>>>>>>>>>>>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
>>>>>>>>>>>>>>>>>>>>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
>>>>>>>>>>>>>>>>>>>>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
>>>>>>>>>>>>>>>>>>>>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
>>>>>>>>>>>>>>>>>>>>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
>>>>>>>>>>>>>>>>>>>>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
>>>>>>>>>>>>>>>>>>>>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
>>>>>>>>>>>>>>>>>>>>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
>>>>>>>>>>>>>>>>>>>>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
>>>>>>>>>>>>>>>>>>>>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
>>>>>>>>>>>>>>>>>>>>>>> gcZm
>>>>>>>>>>>>>>>>>>>>>>> =CjwB
>>>>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>>>> Version: Mailvelope v1.1.0
>>>>>>>>>>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
>>>>>>>>>>>>>>>>>>>> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
>>>>>>>>>>>>>>>>>>>> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
>>>>>>>>>>>>>>>>>>>> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
>>>>>>>>>>>>>>>>>>>> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
>>>>>>>>>>>>>>>>>>>> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
>>>>>>>>>>>>>>>>>>>> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
>>>>>>>>>>>>>>>>>>>> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
>>>>>>>>>>>>>>>>>>>> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
>>>>>>>>>>>>>>>>>>>> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
>>>>>>>>>>>>>>>>>>>> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
>>>>>>>>>>>>>>>>>>>> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
>>>>>>>>>>>>>>>>>>>> ae22
>>>>>>>>>>>>>>>>>>>> =AX+L
>>>>>>>>>>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>>>> 
>>>>>>>>>>> wsFcBAEBCAAQBQJWFBoOCRDmVDuy+mK58QAA7oYP/1yVPx66DovoUJiSDunA
>>>>>>>>>>> NjIXWnKzx77aQMDwueZ0woC8PvgsX4JpLVH90Gh1MOJWyt2L4Qp+n60loSiI
>>>>>>>>>>> Q5xU1NMYiup8YPlHqyslBxtqCPhcN1R8XhxN212R4uyVBIgjulkkEFiiQf8R
>>>>>>>>>>> 5Uq5rDy+Vqmbla3enekV9vpAJQhVdfxvhdnN9/tSC3I5JZm+6VW9PGmwvTL4
>>>>>>>>>>> HK5UIz8luvtBWCWXYm2m7ZCUKYq0oWfdVDGEpEV473yyYwoVyvTBFuNNNbpu
>>>>>>>>>>> kdxZ422Ztv2yj5phIQgU88Q/W5NY0awW25+16AMZNb6zCbF06hvQ9SjpydGu
>>>>>>>>>>> 6vokj3uCOImMZpdJlyMuj6IjIkB27bnJer7zVLM3tDzftPzwT8ia8M3LvMWE
>>>>>>>>>>> sD9Dl2jx5EdFZYPMxoHF4WnD4SQtUxr+cpcI/Ij96RfXz1cMbMbVdZbWXkfz
>>>>>>>>>>> gEY46SXuM8yMi7wzJHwd4kI9q8A+ZZDpsDuTyavMr1rqZX61H+Gzc3rNI7lc
>>>>>>>>>>> lkJ63hfYMPCdYggnUT8mAF+cwXxq66SclwbmBYM8lbrEPuuTZzZp7veLJr5g
>>>>>>>>>>> /PO1abPcJVYq5ZP7i1iELEac6WvDWcJgImvkF+JZAN57URNpdJA03KsVkIt7
>>>>>>>>>>> H5n1Y8zUv7QcVMwHo/Os30vfiPmUHxg9DFbtUU8otpcf3g+udDggWHeuiZiG
>>>>>>>>>>> 6Kfk
>>>>>>>>>>> =/gR6
>>>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>>> 
>>>>>>>>> wsFcBAEBCAAQBQJWFChuCRDmVDuy+mK58QAAfNsQAMGNu925hGNsCTuY4X7V
>>>>>>>>> x71rdicFIn41I12KYtmhWl0U/V9GpUwLkOAKzeAcQiK2FgBBYRle0pANqE2K
>>>>>>>>> Thf4YBJ5oEXZ72WOB14jaggiQkZwiTZLo6c69JLZADaM5NEXD/2mM77HyVLN
>>>>>>>>> SP5v7FSqtnlzA53aZ7hUZn5r20VfOl/peOJGJz7C393hy3gBjr+P4LKsLE2L
>>>>>>>>> QO0lNj4mJZVnVXbxqJp9Q8xn86vmfXK2sofqbAv2wjkT2C8gM9DkgLF+UJjc
>>>>>>>>> mCSL9EUDFHD82BGsWzvYYFci686bIUC9IxJXKLORYKjzH3ueGHhiK3/apIi4
>>>>>>>>> 7DA0159nObAVNNz8AvvJnnjK94KrfcqpD3inFT7++WiNWTWbYljC7eukEM8L
>>>>>>>>> QyrcMnbuomjT87I9wB9zNwa/Pt+AepdwSf7qAv1VVYrop3nJxp8bPVCzvkrr
>>>>>>>>> MV/gxv3esOF68nOoQ9yt8DyHFihpg0nqSPjY3xDS7qZ05u3jnWN4rgkNxmyR
>>>>>>>>> rOpwjVLUINAkVjfAM2FL2sW6wX1tKPd947CgMrAgcX0ChwZ1xYzt6xdS0p+R
>>>>>>>>> gciSgw7nfCvwFmpou0DnqUdTN3K0zvM9zDhQ/b9u7JW3CEZLJXMoi99C4n3g
>>>>>>>>> RfilE0rvScnx7uTI7mo94Pwy0MYFdGw04sNtFjwjIhRFPSsMUu+NSHDJe26U
>>>>>>>>> JFPi
>>>>>>>>> =ofgq
>>>>>>>>> -----END PGP SIGNATURE-----
>>>>>>>> 
>>>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>>>> Version: Mailvelope v1.2.0
>>>>>>>> Comment: https://www.mailvelope.com
>>>>>>>> 
>>>>>>>> wsFcBAEBCAAQBQJWFDDOCRDmVDuy+mK58QAA0kUP/1rfRQa5Us9b/VCvKrhk
>>>>>>>> BYrde1/FBybKBVXsuXVU8Dq124A1e4L682AhmQPUeVP8PQLoqS/VFSl0h7i6
>>>>>>>> 28AzydDaBTTjnrp6ZzVbtmKtm8WhmtSTFvWTlu/yJmRXAht9YozmFCByBfIY
>>>>>>>> GYvOhZzjvbxBKfwnwq97QkS7xfY2tss/BmaOvSVTX7naYaOF+HRwZMSt+BF4
>>>>>>>> 9vg9BLSL3Aic0BnvdM64TWkDaHp/3gwGSmyMn8Q2Sa9CqUTddKQx2HXN6doo
>>>>>>>> gIyxCj+dIw2Pt73u2NoiYv8ZhTuS3QYM4n0rRBxj8Wr/EeNwGAOwdDSgbOxf
>>>>>>>> OvDyozzmCpQyW3h/nkdQJW5mWsJmyDIiGxHDdUn7Vgemg+Bbod0ACdoJiwct
>>>>>>>> /BIRVQe2Ee1nZQFoKBOhvaWO6+ePJR7CVfLjMkZBTzKZBjt2tfkq17G5KTdS
>>>>>>>> EsehvG/+vfFJkANL5Xh6eo9ptlHbFW8I/44pvUtGi2JwsN487l56XR9DqEKM
>>>>>>>> 7Cmj9Ox205YxjqcBjhWIJQTok99lvrhDX9d7HHxIeTcmouvqPz4LTcCySRtC
>>>>>>>> xE/GcEGAAYWGPTwf9u8ULm9Rh2Z90OnKpqtCtuuWiwRRL9VU/tLlvqmHvEZM
>>>>>>>> 73qhiLQZka5I72B2SAEtJnDt2sX3NJ4unvH4zWKLRFTTm4M0qk6xUL1JfqNz
>>>>>>>> JYNo
>>>>>>>> =msX2
>>>>>>>> -----END PGP SIGNATURE-----
>>>>>> 
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.0
>>>>>> Comment: https://www.mailvelope.com
>>>>>> 
>>>>>> wsFcBAEBCAAQBQJWFXGPCRDmVDuy+mK58QAAx38P/1sn6TA8hH+F2kd1A2Pq
>>>>>> IU2cg1pFcH+kw21G8VO+BavfBaBoSETHEEuMXg5SszTIcL/HyziBLJos0C0j
>>>>>> Vu9I0/YtblQ15enzFqKFPosdc7qij9DPJxXRkx41sJZsxvSVky+URcPpcKk6
>>>>>> w8Lwuq9IupesQ19ZeJkCEWFVhKz/i2E9/VXfylBgFVlkICD+5pfx6/Aq7nCP
>>>>>> 4gboyha07zpPlDqoA7xgT+6v2zlYC80saGcA1m2XaAUdPF/17l6Mq9+Glv7E
>>>>>> 3KeUf7jmMTJQRGBZSInFgUpPwUQKvF5OSGb3YQlzofUy5Es+wH3ccqZ+mlIY
>>>>>> szuBLAtN6zhFFPCs6016hiragiUhLk97PItXaKdDJKecuyRdShlJrXJmtX+j
>>>>>> NdM14TkBPTiLtAd/IZEEhIIpdvQH8YSl3LnEZ5gywggaY4Pk3JLFIJPgLpEb
>>>>>> T8hJnuiaQaYxERQ0nRoBL4LAXARseSrOuVt2EAD50Yb/5JEwB9FQlN758rb1
>>>>>> AE/xhpK6d53+RlkPODKxXx816hXvDP6NADaC78XGmx+A4FfepdxBijGBsmOQ
>>>>>> 7SxAZe469K0E6EAfClc664VzwuvBEZjwTg1eK5Z6VS/FDTH/RxTKeFhlbUIT
>>>>>> XpezlP7XZ1/YRrJ/Eg7nb1Dv0MYQdu18tQ6QBv+C1ZsmxYLlHlcf6BZ3gNar
>>>>>> rZW5
>>>>>> =dKn9
>>>>>> -----END PGP SIGNATURE-----
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> -- 
>>>> WBR, Max A. Krasilnikov
>>>> ColoCall Data Center
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> -- 
>> WBR, Max A. Krasilnikov
>> ColoCall Data Center


-- 
WBR, Max A. Krasilnikov
ColoCall Data Center

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-10-16  8:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAANLjFpAXaDWQzumg1hCs=SC_0YvdYDd6vyodz5_ZBNvGWdkSw@mail.gmail.com>
     [not found] ` <CAANLjFrbhiqjfUqiEPEOO+aZNTWwMEeTzZmro+OSrPmUg8AZLg@mail.gmail.com>
     [not found]   ` <CAANLjFq=RpNygfSf2YsmtrpHiogHnAXqa=09jkwvHJkHb0Xyfw@mail.gmail.com>
     [not found]     ` <CAANLjFob5ZsJzYop0RjQsXw-4UfV4OvdhBjT2sGF-u6gq9NrhQ@mail.gmail.com>
     [not found]       ` <CAJ4mKGZ8WhY5JWD2P=5fQ+9EYQazEKETCTg8gU6qcZ03V3jrVw@mail.gmail.com>
     [not found]         ` <CAANLjFrcx_M7BxW5UCZxUk_Y7-s3NneBevnP+JHU0PA8pazpAw@mail.gmail.com>
     [not found]           ` <CAJ4mKGZLD4QfAfhZEXAe6KSbq9Z6wPXdUes_GJunfgnTf+K0Yg@mail.gmail.com>
     [not found]             ` <CAANLjFqn8t+w3hACiCvgi2zpG4TGFYzmAW5ocsa1TOjqueY6QQ@mail.gmail.com>
     [not found]               ` <CAJ4mKGY-XcFvbzSCRH4d5qJuwNE1uQ0Kdj7UCq5p0saFyaoiBA@mail.gmail.com>
     [not found]                 ` <CAANLjFrDqHhAtWPqBV+QEDcFU6xNum37puO6XHZE0zhNp9reSQ@mail.gmail.com>
     [not found]                   ` <CAJ4mKGYdJJfFERrOrdN7T8SzhdcyKhSDqJ69wOiee2aVj8vpEA@mail.gmail.com>
     [not found]                     ` <CAJ4mKGYdJJfFERrOrdN7T8SzhdcyKhSDqJ69wOiee2aVj8vpEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-22 21:22                       ` Potential OSD deadlock? Robert LeBlanc
     [not found]                         ` <CAANLjFqbt0y-Ri=q6hXuuS06Sgi1S6phRdb1MJTgR6LTyHPtvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-22 21:41                           ` Samuel Just
2015-09-22 21:45                             ` [ceph-users] " Robert LeBlanc
     [not found]                             ` <CAN=+7FVo6D2AoufELCP_qeiJ23i0XEOqQs_yYLrFNXwiiQSthw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-22 21:52                               ` Sage Weil
     [not found]                                 ` <alpine.DEB.2.00.1509221450490.11876-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-09-22 22:15                                   ` Robert LeBlanc
2015-09-23 18:48                                     ` [ceph-users] " Robert LeBlanc
     [not found]                                       ` <CAANLjFrrAyhJ=JR0+K3SX6G1ZsdcUyhTqFzUn_A0+bSh0D=bkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-23 19:10                                         ` Mark Nelson
     [not found]                                           ` <5602F921.3010204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-09-23 19:30                                             ` Robert LeBlanc
2015-09-25 20:40                                               ` [ceph-users] " Robert LeBlanc
     [not found]                                                 ` <CAANLjFrw0AtqvHu5ioRPH7yKjV5ZqfnGKou+fKLTZukLTnLbgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-03 20:10                                                   ` Robert LeBlanc
     [not found]                                                     ` <CAANLjFoafHQ1X8U7LUrvhh2h8fu3WgNpUsMDR+QkSWpfW8ad0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-04  6:16                                                       ` Josef Johansson
     [not found]                                                         ` <CAOnYue-T061jkAvpe3cwH7Et4xXY_dkW73KyKcbwiefgzgs8cA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-04 15:08                                                           ` Alex Gorbachev
2015-10-04 16:13                                                           ` Robert LeBlanc
     [not found]                                                             ` <CAANLjFosyB_cqk19Ax=5wgXDnSGOOa-_5sY1FuCZyYTx3rS9JQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-04 16:16                                                               ` Josef Johansson
2015-10-04 13:48                                                       ` Sage Weil
     [not found]                                                         ` <alpine.DEB.2.00.1510040646020.5233-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-04 21:04                                                           ` Robert LeBlanc
     [not found]                                                             ` <CAANLjFqgJbLEsBYEW=bk0h+Lmop-MLX=eA7qx98h-3Z6M1x7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06  3:35                                                               ` Robert LeBlanc
     [not found]                                                                 ` <CAANLjFruw-1yySqO=aY05c0bzuqdkBH0-WcKmyP_+JtSyA1kpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 12:37                                                                   ` Sage Weil
     [not found]                                                                     ` <alpine.DEB.2.00.1510060534010.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-06 14:30                                                                       ` Robert LeBlanc
     [not found]                                                                         ` <CAANLjFo==i7wivrGR9LJFs3GOrD2iQHLdCpEfR-AruHyOMLi-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 14:38                                                                           ` Sage Weil
2015-10-06 15:51                                                                             ` [ceph-users] " Robert LeBlanc
     [not found]                                                                               ` <CAANLjFpEPQbvpnMREu-kcPORK28V1CdWBe7655wHp-74AwwQUg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 16:19                                                                                 ` Sage Weil
2015-10-06 17:47                                                                                   ` [ceph-users] " Robert LeBlanc
2015-10-06 16:26                                                                             ` Ken Dreyer
     [not found]                                                                               ` <CALqRxCw=tV5h-xfaxsCwwqhK=zLiP=m_iGTn+dF0Op=Uth_vPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 16:40                                                                                 ` Sage Weil
2015-10-06 18:03                                                                     ` [ceph-users] " Robert LeBlanc
     [not found]                                                                       ` <CAANLjFosPfynahiTmC2r=wPGWKg8YQAak58XGt0MfVXC9bmXuw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 18:32                                                                         ` Sage Weil
     [not found]                                                                           ` <alpine.DEB.2.00.1510061122370.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-06 19:06                                                                             ` Robert LeBlanc
2015-10-06 19:34                                                                               ` [ceph-users] " Sage Weil
     [not found]                                                                                 ` <alpine.DEB.2.00.1510061232360.32037-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-06 20:00                                                                                   ` Robert LeBlanc
     [not found]                                                                                     ` <CAANLjFqnhS5fYdS_2h-5hz0x_TiyYjhMUCZSdc9991kdAzxeqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 20:36                                                                                       ` Robert LeBlanc
     [not found]                                                                                         ` <CAANLjFqXvWdHBVZUMVFMiQg_-55_ZQ_jxsFr9YquohnHR7M1cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 23:40                                                                                           ` Robert LeBlanc
2015-10-07 19:25                                                                                             ` [ceph-users] " Robert LeBlanc
     [not found]                                                                                               ` <CAANLjFr=MEOyqUqjUVkZcNcW7KeN4rMUe9oJxu7ZiL64pGmT4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-09  5:44                                                                                                 ` Robert LeBlanc
     [not found]                                                                                                   ` <CAANLjFoL2+wvP12v-ryg7Va6d7Cix_JFdVQ3ysSEtfxobkoCVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-09  8:48                                                                                                     ` Max A. Krasilnikov
     [not found]                                                                                                       ` <20151009084843.GL86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
2015-10-09  9:05                                                                                                         ` Jan Schermer
     [not found]                                                                                                           ` <F2832DEB-FB8D-47EE-B364-F92DAF711D35-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
2015-10-09 11:21                                                                                                             ` Max A. Krasilnikov
     [not found]                                                                                                               ` <20151009112124.GM86022-z2DuZ08HpnDk1uMJSBkQmQ@public.gmane.org>
2015-10-09 11:45                                                                                                                 ` Jan Schermer
     [not found]                                                                                                                   ` <2FD6AADF-88BB-4DFF-B6C2-E103F16B55A8-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org>
2015-10-09 14:14                                                                                                                     ` Max A. Krasilnikov
2015-10-16  8:21                                                                                                                     ` Max A. Krasilnikov
     [not found]                                                                                                   ` <CAANLjFquEvjDDT94ZL2mXQh5r_XWCxw3X=eFZ=c29gNHKt=2tw@mail.gmail.com>
2015-10-13 17:03                                                                                                     ` [ceph-users] " Sage Weil
     [not found]                                                                                                       ` <alpine.DEB.2.00.1510130956130.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-14  6:00                                                                                                         ` Haomai Wang
     [not found]                                                                                                           ` <CACJqLyaeognJ479tjv3S8u1ZpfRr2=qFbgmW1fMu2BcVPt_gNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-14 15:41                                                                                                             ` Robert LeBlanc
     [not found]                                                                                                               ` <CAANLjFrBNMeGkawcBUYqjWSjoWyQHCxjpEM291TmOp40HhCoSA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-14 17:08                                                                                                                 ` Sage Weil
     [not found]                                                                                                                   ` <alpine.DEB.2.00.1510140955240.6589-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2015-10-14 17:58                                                                                                                     ` Robert LeBlanc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.