All of lore.kernel.org
 help / color / mirror / Atom feed
* Very slow recovery/peering with latest master
@ 2015-09-16  3:04 Somnath Roy
  2015-09-16 18:35 ` Gregory Farnum
  0 siblings, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-16  3:04 UTC (permalink / raw)
  To: ceph-devel

Hi,
I am seeing very slow recovery when I am adding OSDs with the latest master.
Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).

I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?

Thanks & Regards
Somnath


________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-16  3:04 Very slow recovery/peering with latest master Somnath Roy
@ 2015-09-16 18:35 ` Gregory Farnum
  2015-09-16 21:19   ` Somnath Roy
  2015-09-23 22:48   ` Somnath Roy
  0 siblings, 2 replies; 17+ messages in thread
From: Gregory Farnum @ 2015-09-16 18:35 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi,
> I am seeing very slow recovery when I am adding OSDs with the latest master.
> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>
> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?

I don't think these values should impact peering time, but you could
configure them back to the old defaults and see if it changes.
-Greg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-16 18:35 ` Gregory Farnum
@ 2015-09-16 21:19   ` Somnath Roy
  2015-09-16 22:59     ` Joao Eduardo Luis
  2015-09-23 22:48   ` Somnath Roy
  1 sibling, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-16 21:19 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel


Sage/Greg,

Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.

Some observation :
----------------------

1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.

2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!

3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..

2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to mon.a at 10.60.194.10:6789/0
2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.

4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.


So, something related to mon communication getting slower ?
Let me know if more verbose logging is required and how should I share the log..

Thanks & Regards
Somnath

-----Original Message-----
From: Gregory Farnum [mailto:gfarnum@redhat.com]
Sent: Wednesday, September 16, 2015 11:35 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Very slow recovery/peering with latest master

On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi,
> I am seeing very slow recovery when I am adding OSDs with the latest master.
> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>
> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?

I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
-Greg

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-16 21:19   ` Somnath Roy
@ 2015-09-16 22:59     ` Joao Eduardo Luis
  0 siblings, 0 replies; 17+ messages in thread
From: Joao Eduardo Luis @ 2015-09-16 22:59 UTC (permalink / raw)
  To: Somnath Roy, Gregory Farnum; +Cc: ceph-devel

On 09/16/2015 10:19 PM, Somnath Roy wrote:
> 
> Sage/Greg,
> 
> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
> 
> Some observation :
> ----------------------
> 
> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
> 
> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
> 
> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
> 
> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to mon.a at 10.60.194.10:6789/0
> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.

Is this the only monitor in your cluster? Are there others?

Logs would certainly be helpful. The more the merrier, I'd think.

If you can't send them to the list, please find some place where we can
reach them, or drop them in cephdrop and point us to them.

Thanks!

  -Joao



> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
> 
> 
> So, something related to mon communication getting slower ?
> Let me know if more verbose logging is required and how should I share the log..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, September 16, 2015 11:35 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
> 
> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Hi,
>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>
>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
> 
> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
> -Greg
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-16 18:35 ` Gregory Farnum
  2015-09-16 21:19   ` Somnath Roy
@ 2015-09-23 22:48   ` Somnath Roy
  2015-09-23 23:06     ` Samuel Just
  1 sibling, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-23 22:48 UTC (permalink / raw)
  To: Samuel Just (sam.just@inktank.com), Sage Weil (sage@newdream.net)
  Cc: ceph-devel

Sam/Sage,
I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
Here is my Linux details..

root@emsnode5:~/wip-write-path-optimization/src# uname -a
Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:        14.04
Codename:       trusty

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Wednesday, September 16, 2015 2:20 PM
To: 'Gregory Farnum'
Cc: 'ceph-devel'
Subject: RE: Very slow recovery/peering with latest master


Sage/Greg,

Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.

Some observation :
----------------------

1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.

2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!

3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..

2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to mon.a at 10.60.194.10:6789/0
2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.

4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.


So, something related to mon communication getting slower ?
Let me know if more verbose logging is required and how should I share the log..

Thanks & Regards
Somnath

-----Original Message-----
From: Gregory Farnum [mailto:gfarnum@redhat.com]
Sent: Wednesday, September 16, 2015 11:35 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Very slow recovery/peering with latest master

On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi,
> I am seeing very slow recovery when I am adding OSDs with the latest master.
> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>
> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?

I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
-Greg

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-23 22:48   ` Somnath Roy
@ 2015-09-23 23:06     ` Samuel Just
  2015-09-23 23:18       ` Somnath Roy
  2015-09-23 23:19       ` Handzik, Joe
  0 siblings, 2 replies; 17+ messages in thread
From: Samuel Just @ 2015-09-23 23:06 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Samuel Just (sam.just@inktank.com), Sage Weil (sage@newdream.net),
	ceph-devel

Wow.  Why would that take so long?  I think you are correct that it's
only used for metadata, we could just add a config value to disable
it.
-Sam

On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Sam/Sage,
> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
> Here is my Linux details..
>
> root@emsnode5:~/wip-write-path-optimization/src# uname -a
> Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 14.04.2 LTS
> Release:        14.04
> Codename:       trusty
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Wednesday, September 16, 2015 2:20 PM
> To: 'Gregory Farnum'
> Cc: 'ceph-devel'
> Subject: RE: Very slow recovery/peering with latest master
>
>
> Sage/Greg,
>
> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>
> Some observation :
> ----------------------
>
> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
>
> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
>
> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
>
> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to mon.a at 10.60.194.10:6789/0
> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>
> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
>
>
> So, something related to mon communication getting slower ?
> Let me know if more verbose logging is required and how should I share the log..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, September 16, 2015 11:35 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
>
> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Hi,
>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>
>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
>
> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
> -Greg
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-23 23:06     ` Samuel Just
@ 2015-09-23 23:18       ` Somnath Roy
  2015-09-23 23:19       ` Handzik, Joe
  1 sibling, 0 replies; 17+ messages in thread
From: Somnath Roy @ 2015-09-23 23:18 UTC (permalink / raw)
  To: Samuel Just
  Cc: Samuel Just (sam.just@inktank.com), Sage Weil (sage@newdream.net),
	ceph-devel

I am not sure why it is taking time..I installed latest libblkid as well, but, same result. Yeah, config option will be better..I will add that along with my write-path pull request.

Thanks & Regards
Somnath

-----Original Message-----
From: Samuel Just [mailto:sjust@redhat.com] 
Sent: Wednesday, September 23, 2015 4:07 PM
To: Somnath Roy
Cc: Samuel Just (sam.just@inktank.com); Sage Weil (sage@newdream.net); ceph-devel
Subject: Re: Very slow recovery/peering with latest master

Wow.  Why would that take so long?  I think you are correct that it's only used for metadata, we could just add a config value to disable it.
-Sam

On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Sam/Sage,
> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
> Here is my Linux details..
>
> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No LSB 
> modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 14.04.2 LTS
> Release:        14.04
> Codename:       trusty
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Wednesday, September 16, 2015 2:20 PM
> To: 'Gregory Farnum'
> Cc: 'ceph-devel'
> Subject: RE: Very slow recovery/peering with latest master
>
>
> Sage/Greg,
>
> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>
> Some observation :
> ----------------------
>
> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
>
> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
>
> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
>
> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
> _send_mon_message to mon.a at 10.60.194.10:6789/0
> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>
> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
>
>
> So, something related to mon communication getting slower ?
> Let me know if more verbose logging is required and how should I share the log..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, September 16, 2015 11:35 AM
> To: Somnath Roy
> Cc: ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
>
> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Hi,
>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>
>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
>
> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
> -Greg
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-23 23:06     ` Samuel Just
  2015-09-23 23:18       ` Somnath Roy
@ 2015-09-23 23:19       ` Handzik, Joe
  2015-09-23 23:26         ` Somnath Roy
  1 sibling, 1 reply; 17+ messages in thread
From: Handzik, Joe @ 2015-09-23 23:19 UTC (permalink / raw)
  To: Samuel Just
  Cc: Somnath Roy, Samuel Just (sam.just@inktank.com),
	Sage Weil (sage@newdream.net),
	ceph-devel

I added that, there is code up the stack in calamari that consumes the path provided, which is intended in the future to facilitate disk monitoring and management.

Somnath, what does your disk configuration look like (filesystem, SSD/HDD, anything else you think could be relevant)? Did you configure your disks with ceph-disk, or by hand? I never saw this while testing my code, has anyone else heard of this behavior on master? The code has been in master for 2-3 months now I believe.

It would be nice to not need to disable this, but if this behavior exists and can't be explained by a misconfiguration or something else I'll need to figure out a different implementation.

Joe

> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> 
> Wow.  Why would that take so long?  I think you are correct that it's
> only used for metadata, we could just add a config value to disable
> it.
> -Sam
> 
>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Sam/Sage,
>> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
>> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
>> Here is my Linux details..
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
>> Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
>> No LSB modules are available.
>> Distributor ID: Ubuntu
>> Description:    Ubuntu 14.04.2 LTS
>> Release:        14.04
>> Codename:       trusty
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Wednesday, September 16, 2015 2:20 PM
>> To: 'Gregory Farnum'
>> Cc: 'ceph-devel'
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> 
>> Sage/Greg,
>> 
>> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>> 
>> Some observation :
>> ----------------------
>> 
>> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
>> 
>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
>> 
>> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
>> 
>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to mon.a at 10.60.194.10:6789/0
>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>> 
>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
>> 
>> 
>> So, something related to mon communication getting slower ?
>> Let me know if more verbose logging is required and how should I share the log..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> Sent: Wednesday, September 16, 2015 11:35 AM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Very slow recovery/peering with latest master
>> 
>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> Hi,
>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>> 
>>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
>> 
>> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
>> -Greg
>> 
>> ________________________________
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-23 23:19       ` Handzik, Joe
@ 2015-09-23 23:26         ` Somnath Roy
  2015-09-23 23:42           ` Handzik, Joe
  0 siblings, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-23 23:26 UTC (permalink / raw)
  To: Handzik, Joe, Samuel Just
  Cc: Samuel Just (sam.just@inktank.com), Sage Weil (sage@newdream.net),
	ceph-devel

<<inline

-----Original Message-----
From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com] 
Sent: Wednesday, September 23, 2015 4:20 PM
To: Samuel Just
Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil (sage@newdream.net); ceph-devel
Subject: Re: Very slow recovery/peering with latest master

I added that, there is code up the stack in calamari that consumes the path provided, which is intended in the future to facilitate disk monitoring and management.

[Somnath] Ok

Somnath, what does your disk configuration look like (filesystem, SSD/HDD, anything else you think could be relevant)? Did you configure your disks with ceph-disk, or by hand? I never saw this while testing my code, has anyone else heard of this behavior on master? The code has been in master for 2-3 months now I believe.
[Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 3.16.* kernel ? It could be Linux distribution/kernel specific.

It would be nice to not need to disable this, but if this behavior exists and can't be explained by a misconfiguration or something else I'll need to figure out a different implementation.

Joe

> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> 
> Wow.  Why would that take so long?  I think you are correct that it's 
> only used for metadata, we could just add a config value to disable 
> it.
> -Sam
> 
>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Sam/Sage,
>> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
>> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
>> Here is my Linux details..
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
>> LSB modules are available.
>> Distributor ID: Ubuntu
>> Description:    Ubuntu 14.04.2 LTS
>> Release:        14.04
>> Codename:       trusty
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Wednesday, September 16, 2015 2:20 PM
>> To: 'Gregory Farnum'
>> Cc: 'ceph-devel'
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> 
>> Sage/Greg,
>> 
>> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>> 
>> Some observation :
>> ----------------------
>> 
>> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
>> 
>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
>> 
>> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
>> 
>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>> 
>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
>> 
>> 
>> So, something related to mon communication getting slower ?
>> Let me know if more verbose logging is required and how should I share the log..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> Sent: Wednesday, September 16, 2015 11:35 AM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Very slow recovery/peering with latest master
>> 
>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> Hi,
>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>> 
>>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
>> 
>> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
>> -Greg
>> 
>> ________________________________
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-23 23:26         ` Somnath Roy
@ 2015-09-23 23:42           ` Handzik, Joe
  2015-09-24  1:31             ` Sage Weil
  2015-09-24 18:27             ` Gregory Farnum
  0 siblings, 2 replies; 17+ messages in thread
From: Handzik, Joe @ 2015-09-23 23:42 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Samuel Just, Samuel Just (sam.just@inktank.com),
	Sage Weil (sage@newdream.net),
	ceph-devel

Ok. When configuring with ceph-disk, it does something nifty and actually gives the OSD the uuid of the disk's partition as its fsid. I bootstrap off that to get an argument to pass into the function you have identified as the bottleneck. I ran it by sage and we both realized there would be cases where it wouldn't work...I'm sure neither of us realized the failure would take three minutes though. 

In the short term, it makes sense to create an option to disable or short-circuit the blkid code. I would prefer that the default be left with the code enabled, but I'm open to default disabled if others think this will be a widespread problem. You could also make sure your OSD fsids are set to match your disk partition uuids for now too, if that's a faster workaround for you (it'll get rid of the failure).

Joe

> On Sep 23, 2015, at 6:26 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> <<inline
> 
> -----Original Message-----
> From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com] 
> Sent: Wednesday, September 23, 2015 4:20 PM
> To: Samuel Just
> Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil (sage@newdream.net); ceph-devel
> Subject: Re: Very slow recovery/peering with latest master
> 
> I added that, there is code up the stack in calamari that consumes the path provided, which is intended in the future to facilitate disk monitoring and management.
> 
> [Somnath] Ok
> 
> Somnath, what does your disk configuration look like (filesystem, SSD/HDD, anything else you think could be relevant)? Did you configure your disks with ceph-disk, or by hand? I never saw this while testing my code, has anyone else heard of this behavior on master? The code has been in master for 2-3 months now I believe.
> [Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 3.16.* kernel ? It could be Linux distribution/kernel specific.
> 
> It would be nice to not need to disable this, but if this behavior exists and can't be explained by a misconfiguration or something else I'll need to figure out a different implementation.
> 
> Joe
> 
>> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
>> 
>> Wow.  Why would that take so long?  I think you are correct that it's 
>> only used for metadata, we could just add a config value to disable 
>> it.
>> -Sam
>> 
>>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> Sam/Sage,
>>> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
>>> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
>>> Here is my Linux details..
>>> 
>>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
>>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
>>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> 
>>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
>>> LSB modules are available.
>>> Distributor ID: Ubuntu
>>> Description:    Ubuntu 14.04.2 LTS
>>> Release:        14.04
>>> Codename:       trusty
>>> 
>>> Thanks & Regards
>>> Somnath
>>> 
>>> -----Original Message-----
>>> From: Somnath Roy
>>> Sent: Wednesday, September 16, 2015 2:20 PM
>>> To: 'Gregory Farnum'
>>> Cc: 'ceph-devel'
>>> Subject: RE: Very slow recovery/peering with latest master
>>> 
>>> 
>>> Sage/Greg,
>>> 
>>> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
>>> 
>>> Some observation :
>>> ----------------------
>>> 
>>> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
>>> 
>>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
>>> 
>>> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
>>> 
>>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
>>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
>>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
>>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
>>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>>> 
>>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
>>> 
>>> 
>>> So, something related to mon communication getting slower ?
>>> Let me know if more verbose logging is required and how should I share the log..
>>> 
>>> Thanks & Regards
>>> Somnath
>>> 
>>> -----Original Message-----
>>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>>> Sent: Wednesday, September 16, 2015 11:35 AM
>>> To: Somnath Roy
>>> Cc: ceph-devel
>>> Subject: Re: Very slow recovery/peering with latest master
>>> 
>>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>> Hi,
>>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
>>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
>>>> 
>>>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
>>> 
>>> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
>>> -Greg
>>> 
>>> ________________________________
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-23 23:42           ` Handzik, Joe
@ 2015-09-24  1:31             ` Sage Weil
  2015-09-24  6:32               ` Podoski, Igor
  2015-09-24 18:27             ` Gregory Farnum
  1 sibling, 1 reply; 17+ messages in thread
From: Sage Weil @ 2015-09-24  1:31 UTC (permalink / raw)
  To: Handzik, Joe
  Cc: Somnath Roy, Samuel Just, Samuel Just (sam.just@inktank.com), ceph-devel

On Wed, 23 Sep 2015, Handzik, Joe wrote:
> Ok. When configuring with ceph-disk, it does something nifty and 
> actually gives the OSD the uuid of the disk's partition as its fsid. I 
> bootstrap off that to get an argument to pass into the function you have 
> identified as the bottleneck. I ran it by sage and we both realized 
> there would be cases where it wouldn't work...I'm sure neither of us 
> realized the failure would take three minutes though.
> 
> In the short term, it makes sense to create an option to disable or 
> short-circuit the blkid code. I would prefer that the default be left 
> with the code enabled, but I'm open to default disabled if others think 
> this will be a widespread problem. You could also make sure your OSD 
> fsids are set to match your disk partition uuids for now too, if that's 
> a faster workaround for you (it'll get rid of the failure).

I think we should try to figure out where it is hanging.  Can you strace 
the blkid process to see what it is up to?

I opened http://tracker.ceph.com/issues/13219

I think as long as it behaves reliably with ceph-disk OSDs then we can 
have it on by default.

sage


> 
> Joe
> 
> > On Sep 23, 2015, at 6:26 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> > 
> > <<inline
> > 
> > -----Original Message-----
> > From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com] 
> > Sent: Wednesday, September 23, 2015 4:20 PM
> > To: Samuel Just
> > Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil (sage@newdream.net); ceph-devel
> > Subject: Re: Very slow recovery/peering with latest master
> > 
> > I added that, there is code up the stack in calamari that consumes the path provided, which is intended in the future to facilitate disk monitoring and management.
> > 
> > [Somnath] Ok
> > 
> > Somnath, what does your disk configuration look like (filesystem, SSD/HDD, anything else you think could be relevant)? Did you configure your disks with ceph-disk, or by hand? I never saw this while testing my code, has anyone else heard of this behavior on master? The code has been in master for 2-3 months now I believe.
> > [Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 3.16.* kernel ? It could be Linux distribution/kernel specific.
> > 
> > It would be nice to not need to disable this, but if this behavior exists and can't be explained by a misconfiguration or something else I'll need to figure out a different implementation.
> > 
> > Joe
> > 
> >> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> >> 
> >> Wow.  Why would that take so long?  I think you are correct that it's 
> >> only used for metadata, we could just add a config value to disable 
> >> it.
> >> -Sam
> >> 
> >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >>> Sam/Sage,
> >>> I debugged it down and found out that the get_device_by_uuid->blkid_find_dev_with_tag() call within FileStore::collect_metadata() is hanging for ~3 mins before returning a EINVAL. I saw this portion is newly added after hammer.
> >>> Commenting it out resolves the issue. BTW, I saw this value is stored as metadata but not used anywhere , am I missing anything ?
> >>> Here is my Linux details..
> >>> 
> >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux 
> >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 
> >>> UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> >>> 
> >>> 
> >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a No 
> >>> LSB modules are available.
> >>> Distributor ID: Ubuntu
> >>> Description:    Ubuntu 14.04.2 LTS
> >>> Release:        14.04
> >>> Codename:       trusty
> >>> 
> >>> Thanks & Regards
> >>> Somnath
> >>> 
> >>> -----Original Message-----
> >>> From: Somnath Roy
> >>> Sent: Wednesday, September 16, 2015 2:20 PM
> >>> To: 'Gregory Farnum'
> >>> Cc: 'ceph-devel'
> >>> Subject: RE: Very slow recovery/peering with latest master
> >>> 
> >>> 
> >>> Sage/Greg,
> >>> 
> >>> Yeah, as we expected, it is not happening probably because of recovery settings. I reverted it back in my ceph.conf , but, still seeing this problem.
> >>> 
> >>> Some observation :
> >>> ----------------------
> >>> 
> >>> 1. First of all, I don't think it is something related to my environment. I recreated the cluster with Hammer and this problem is not there.
> >>> 
> >>> 2. I have enabled the messenger/monclient log (Couldn't attach here) in one of the OSDs and found monitor is taking long time to detect the up OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16 16:16:07.180482 , so, 3 mins !!
> >>> 
> >>> 3. During this period, I saw monclient trying to communicate with monitor but not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 only..
> >>> 
> >>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: 
> >>> _send_mon_message to mon.a at 10.60.194.10:6789/0
> >>> 2015-09-16 16:16:07.180482 7f65377fe700  1 -- 10.60.194.10:6820/20102 
> >>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 
> >>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> >>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 remote, 10.60.194.10:6789/0, have pipe.
> >>> 
> >>> 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , problem is during coming up I guess.
> >>> 
> >>> 
> >>> So, something related to mon communication getting slower ?
> >>> Let me know if more verbose logging is required and how should I share the log..
> >>> 
> >>> Thanks & Regards
> >>> Somnath
> >>> 
> >>> -----Original Message-----
> >>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> >>> Sent: Wednesday, September 16, 2015 11:35 AM
> >>> To: Somnath Roy
> >>> Cc: ceph-devel
> >>> Subject: Re: Very slow recovery/peering with latest master
> >>> 
> >>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >>>> Hi,
> >>>> I am seeing very slow recovery when I am adding OSDs with the latest master.
> >>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) , cluster is taking a significant amount of time to reach in active+clean state (and even detecting all the up OSDs).
> >>>> 
> >>>> I saw the recovery/backfill default parameters are now changed (to lower value) , this probably explains the recovery scenario , but, will it affect the peering time during OSD startup as well ?
> >>> 
> >>> I don't think these values should impact peering time, but you could configure them back to the old defaults and see if it changes.
> >>> -Greg
> >>> 
> >>> ________________________________
> >>> 
> >>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> >> in the body of a message to majordomo@vger.kernel.org More majordomo 
> >> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-24  1:31             ` Sage Weil
@ 2015-09-24  6:32               ` Podoski, Igor
  2015-09-24 16:09                 ` Somnath Roy
  0 siblings, 1 reply; 17+ messages in thread
From: Podoski, Igor @ 2015-09-24  6:32 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Samuel Just, Samuel Just (sam.just@inktank.com),
	ceph-devel, Sage Weil, Handzik, Joe

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, September 24, 2015 3:32 AM
> To: Handzik, Joe
> Cc: Somnath Roy; Samuel Just; Samuel Just (sam.just@inktank.com); ceph-
> devel
> Subject: Re: Very slow recovery/peering with latest master
> 
> On Wed, 23 Sep 2015, Handzik, Joe wrote:
> > Ok. When configuring with ceph-disk, it does something nifty and
> > actually gives the OSD the uuid of the disk's partition as its fsid. I
> > bootstrap off that to get an argument to pass into the function you
> > have identified as the bottleneck. I ran it by sage and we both
> > realized there would be cases where it wouldn't work...I'm sure
> > neither of us realized the failure would take three minutes though.
> >
> > In the short term, it makes sense to create an option to disable or
> > short-circuit the blkid code. I would prefer that the default be left
> > with the code enabled, but I'm open to default disabled if others
> > think this will be a widespread problem. You could also make sure your
> > OSD fsids are set to match your disk partition uuids for now too, if
> > that's a faster workaround for you (it'll get rid of the failure).
> 
> I think we should try to figure out where it is hanging.  Can you strace the
> blkid process to see what it is up to?
> 
> I opened http://tracker.ceph.com/issues/13219
> 
> I think as long as it behaves reliably with ceph-disk OSDs then we can have it
> on by default.
> 
> sage
> 
> 
> >
> > Joe
> >
> > > On Sep 23, 2015, at 6:26 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> wrote:
> > >
> > > <<inline
> > >
> > > -----Original Message-----
> > > From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com]
> > > Sent: Wednesday, September 23, 2015 4:20 PM
> > > To: Samuel Just
> > > Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil
> > > (sage@newdream.net); ceph-devel
> > > Subject: Re: Very slow recovery/peering with latest master
> > >
> > > I added that, there is code up the stack in calamari that consumes the
> path provided, which is intended in the future to facilitate disk monitoring
> and management.
> > >
> > > [Somnath] Ok
> > >
> > > Somnath, what does your disk configuration look like (filesystem,
> SSD/HDD, anything else you think could be relevant)? Did you configure your
> disks with ceph-disk, or by hand? I never saw this while testing my code, has
> anyone else heard of this behavior on master? The code has been in master
> for 2-3 months now I believe.
> > > [Somnath] All SSD , I use mkcephfs to create cluster , I partitioned the
> disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu 3.16.*
> kernel ? It could be Linux distribution/kernel specific.

Somnath, maybe it is GPT related, what partition table do you have? I think parted and gdisk can create GPT partitions, but not fdisk (definitely not in version that I use).

You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe there is a mess.

Regards,
Igor.


> > >
> > > It would be nice to not need to disable this, but if this behavior exists and
> can't be explained by a misconfiguration or something else I'll need to figure
> out a different implementation.
> > >
> > > Joe
> > >
> > >> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> > >>
> > >> Wow.  Why would that take so long?  I think you are correct that
> > >> it's only used for metadata, we could just add a config value to
> > >> disable it.
> > >> -Sam
> > >>
> > >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
> <Somnath.Roy@sandisk.com> wrote:
> > >>> Sam/Sage,
> > >>> I debugged it down and found out that the get_device_by_uuid-
> >blkid_find_dev_with_tag() call within FileStore::collect_metadata() is
> hanging for ~3 mins before returning a EINVAL. I saw this portion is newly
> added after hammer.
> > >>> Commenting it out resolves the issue. BTW, I saw this value is stored as
> metadata but not used anywhere , am I missing anything ?
> > >>> Here is my Linux details..
> > >>>
> > >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux
> > >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
> > >>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> > >>>
> > >>>
> > >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
> No
> > >>> LSB modules are available.
> > >>> Distributor ID: Ubuntu
> > >>> Description:    Ubuntu 14.04.2 LTS
> > >>> Release:        14.04
> > >>> Codename:       trusty
> > >>>
> > >>> Thanks & Regards
> > >>> Somnath
> > >>>
> > >>> -----Original Message-----
> > >>> From: Somnath Roy
> > >>> Sent: Wednesday, September 16, 2015 2:20 PM
> > >>> To: 'Gregory Farnum'
> > >>> Cc: 'ceph-devel'
> > >>> Subject: RE: Very slow recovery/peering with latest master
> > >>>
> > >>>
> > >>> Sage/Greg,
> > >>>
> > >>> Yeah, as we expected, it is not happening probably because of
> recovery settings. I reverted it back in my ceph.conf , but, still seeing this
> problem.
> > >>>
> > >>> Some observation :
> > >>> ----------------------
> > >>>
> > >>> 1. First of all, I don't think it is something related to my environment. I
> recreated the cluster with Hammer and this problem is not there.
> > >>>
> > >>> 2. I have enabled the messenger/monclient log (Couldn't attach here)
> in one of the OSDs and found monitor is taking long time to detect the up
> OSDs. If you see the log, I have started OSD at 2015-09-16 16:13:07.042463 ,
> but, there is no communication (only getting KEEP_ALIVE) till 2015-09-16
> 16:16:07.180482 , so, 3 mins !!
> > >>>
> > >>> 3. During this period, I saw monclient trying to communicate with
> monitor but not able to probably. It is sending osd_boot at 2015-09-16
> 16:16:07.180482 only..
> > >>>
> > >>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
> > >>> _send_mon_message to mon.a at 10.60.194.10:6789/0
> > >>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
> > >>> 10.60.194.10:6820/20102
> > >>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
> > >>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680
> > >>> 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102
> submit_message osd_boot(osd.10 booted 0 features 72057594037927935
> v45) v6 remote, 10.60.194.10:6789/0, have pipe.
> > >>>
> > >>> 4. BTW, the osd down scenario is detected very quickly (ceph -w
> output) , problem is during coming up I guess.
> > >>>
> > >>>
> > >>> So, something related to mon communication getting slower ?
> > >>> Let me know if more verbose logging is required and how should I
> share the log..
> > >>>
> > >>> Thanks & Regards
> > >>> Somnath
> > >>>
> > >>> -----Original Message-----
> > >>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> > >>> Sent: Wednesday, September 16, 2015 11:35 AM
> > >>> To: Somnath Roy
> > >>> Cc: ceph-devel
> > >>> Subject: Re: Very slow recovery/peering with latest master
> > >>>
> > >>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
> <Somnath.Roy@sandisk.com> wrote:
> > >>>> Hi,
> > >>>> I am seeing very slow recovery when I am adding OSDs with the latest
> master.
> > >>>> Also, If I just restart all the OSDs (no IO is going on in the cluster) ,
> cluster is taking a significant amount of time to reach in active+clean state
> (and even detecting all the up OSDs).
> > >>>>
> > >>>> I saw the recovery/backfill default parameters are now changed (to
> lower value) , this probably explains the recovery scenario , but, will it affect
> the peering time during OSD startup as well ?
> > >>>
> > >>> I don't think these values should impact peering time, but you could
> configure them back to the old defaults and see if it changes.
> > >>> -Greg
> > >>>
> > >>> ________________________________
> > >>>
> > >>> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and
> destroy any and all copies of this message in your possession (whether hard
> copies or electronically stored copies).
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-24  6:32               ` Podoski, Igor
@ 2015-09-24 16:09                 ` Somnath Roy
  2015-09-28  8:01                   ` Chen, Xiaoxi
  0 siblings, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-24 16:09 UTC (permalink / raw)
  To: Podoski, Igor
  Cc: Samuel Just, Samuel Just (sam.just@inktank.com),
	ceph-devel, Sage Weil, Handzik, Joe

Yeah , Igor may be..
Meanwhile, I am able to get gdb trace of the hang..

(gdb) bt
#0  0x00007f6f6bf043bd in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6f6af3b066 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#2  0x00007f6f6af43ae2 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#3  0x00007f6f6af42788 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#4  0x00007f6f6af42a53 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#5  0x00007f6f6af3c17b in blkid_do_safeprobe () from /lib/x86_64-linux-gnu/libblkid.so.1
#6  0x00007f6f6af3e0c4 in blkid_verify () from /lib/x86_64-linux-gnu/libblkid.so.1
#7  0x00007f6f6af387fb in blkid_get_dev () from /lib/x86_64-linux-gnu/libblkid.so.1
#8  0x00007f6f6af38acb in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#9  0x00007f6f6af3946d in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
#10 0x00007f6f6af39892 in blkid_probe_all_new () from /lib/x86_64-linux-gnu/libblkid.so.1
#11 0x00007f6f6af3dc10 in blkid_find_dev_with_tag () from /lib/x86_64-linux-gnu/libblkid.so.1
#12 0x00007f6f6d3bf923 in get_device_by_uuid (dev_uuid=..., label=label@entry=0x7f6f6d535fe5 "PARTUUID", partition=partition@entry=0x7f6f347eb5a0 "", device=device@entry=0x7f6f347ec5a0 "")
    at common/blkdev.cc:193
#13 0x00007f6f6d147de5 in FileStore::collect_metadata (this=0x7f6f68893000, pm=0x7f6f21419598) at os/FileStore.cc:660
#14 0x00007f6f6cebfa9a in OSD::_collect_metadata (this=this@entry=0x7f6f6894f000, pm=pm@entry=0x7f6f21419598) at osd/OSD.cc:4586
#15 0x00007f6f6cec0614 in OSD::_send_boot (this=this@entry=0x7f6f6894f000) at osd/OSD.cc:4568
#16 0x00007f6f6cec203a in OSD::_maybe_boot (this=0x7f6f6894f000, oldest=1, newest=100) at osd/OSD.cc:4463
#17 0x00007f6f6cefc5e1 in Context::complete (this=0x7f6f3d3864e0, r=<optimized out>) at ./include/Context.h:64
#18 0x00007f6f6d2eed08 in Finisher::finisher_thread_entry (this=0x7ffee7272d70) at common/Finisher.cc:65
#19 0x00007f6f6befd182 in start_thread (arg=0x7f6f347ee700) at pthread_create.c:312
#20 0x00007f6f6a24347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


Strace was not helpful much since other threads are not block and keep printing the futex traces..

Thanks & Regards
Somnath

-----Original Message-----
From: Podoski, Igor [mailto:Igor.Podoski@ts.fujitsu.com]
Sent: Wednesday, September 23, 2015 11:33 PM
To: Somnath Roy
Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage Weil; Handzik, Joe
Subject: RE: Very slow recovery/peering with latest master

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, September 24, 2015 3:32 AM
> To: Handzik, Joe
> Cc: Somnath Roy; Samuel Just; Samuel Just (sam.just@inktank.com);
> ceph- devel
> Subject: Re: Very slow recovery/peering with latest master
>
> On Wed, 23 Sep 2015, Handzik, Joe wrote:
> > Ok. When configuring with ceph-disk, it does something nifty and
> > actually gives the OSD the uuid of the disk's partition as its fsid.
> > I bootstrap off that to get an argument to pass into the function
> > you have identified as the bottleneck. I ran it by sage and we both
> > realized there would be cases where it wouldn't work...I'm sure
> > neither of us realized the failure would take three minutes though.
> >
> > In the short term, it makes sense to create an option to disable or
> > short-circuit the blkid code. I would prefer that the default be
> > left with the code enabled, but I'm open to default disabled if
> > others think this will be a widespread problem. You could also make
> > sure your OSD fsids are set to match your disk partition uuids for
> > now too, if that's a faster workaround for you (it'll get rid of the failure).
>
> I think we should try to figure out where it is hanging.  Can you
> strace the blkid process to see what it is up to?
>
> I opened http://tracker.ceph.com/issues/13219
>
> I think as long as it behaves reliably with ceph-disk OSDs then we can
> have it on by default.
>
> sage
>
>
> >
> > Joe
> >
> > > On Sep 23, 2015, at 6:26 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> wrote:
> > >
> > > <<inline
> > >
> > > -----Original Message-----
> > > From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com]
> > > Sent: Wednesday, September 23, 2015 4:20 PM
> > > To: Samuel Just
> > > Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil
> > > (sage@newdream.net); ceph-devel
> > > Subject: Re: Very slow recovery/peering with latest master
> > >
> > > I added that, there is code up the stack in calamari that consumes
> > > the
> path provided, which is intended in the future to facilitate disk
> monitoring and management.
> > >
> > > [Somnath] Ok
> > >
> > > Somnath, what does your disk configuration look like (filesystem,
> SSD/HDD, anything else you think could be relevant)? Did you configure
> your disks with ceph-disk, or by hand? I never saw this while testing
> my code, has anyone else heard of this behavior on master? The code
> has been in master for 2-3 months now I believe.
> > > [Somnath] All SSD , I use mkcephfs to create cluster , I
> > > partitioned the
> disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu
> 3.16.* kernel ? It could be Linux distribution/kernel specific.

Somnath, maybe it is GPT related, what partition table do you have? I think parted and gdisk can create GPT partitions, but not fdisk (definitely not in version that I use).

You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe there is a mess.

Regards,
Igor.


> > >
> > > It would be nice to not need to disable this, but if this behavior
> > > exists and
> can't be explained by a misconfiguration or something else I'll need
> to figure out a different implementation.
> > >
> > > Joe
> > >
> > >> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> > >>
> > >> Wow.  Why would that take so long?  I think you are correct that
> > >> it's only used for metadata, we could just add a config value to
> > >> disable it.
> > >> -Sam
> > >>
> > >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
> <Somnath.Roy@sandisk.com> wrote:
> > >>> Sam/Sage,
> > >>> I debugged it down and found out that the get_device_by_uuid-
> >blkid_find_dev_with_tag() call within FileStore::collect_metadata()
> >is
> hanging for ~3 mins before returning a EINVAL. I saw this portion is
> newly added after hammer.
> > >>> Commenting it out resolves the issue. BTW, I saw this value is
> > >>> stored as
> metadata but not used anywhere , am I missing anything ?
> > >>> Here is my Linux details..
> > >>>
> > >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a Linux
> > >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
> > >>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> > >>>
> > >>>
> > >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
> No
> > >>> LSB modules are available.
> > >>> Distributor ID: Ubuntu
> > >>> Description:    Ubuntu 14.04.2 LTS
> > >>> Release:        14.04
> > >>> Codename:       trusty
> > >>>
> > >>> Thanks & Regards
> > >>> Somnath
> > >>>
> > >>> -----Original Message-----
> > >>> From: Somnath Roy
> > >>> Sent: Wednesday, September 16, 2015 2:20 PM
> > >>> To: 'Gregory Farnum'
> > >>> Cc: 'ceph-devel'
> > >>> Subject: RE: Very slow recovery/peering with latest master
> > >>>
> > >>>
> > >>> Sage/Greg,
> > >>>
> > >>> Yeah, as we expected, it is not happening probably because of
> recovery settings. I reverted it back in my ceph.conf , but, still
> seeing this problem.
> > >>>
> > >>> Some observation :
> > >>> ----------------------
> > >>>
> > >>> 1. First of all, I don't think it is something related to my
> > >>> environment. I
> recreated the cluster with Hammer and this problem is not there.
> > >>>
> > >>> 2. I have enabled the messenger/monclient log (Couldn't attach
> > >>> here)
> in one of the OSDs and found monitor is taking long time to detect the
> up OSDs. If you see the log, I have started OSD at 2015-09-16
> 16:13:07.042463 , but, there is no communication (only getting
> KEEP_ALIVE) till 2015-09-16
> 16:16:07.180482 , so, 3 mins !!
> > >>>
> > >>> 3. During this period, I saw monclient trying to communicate
> > >>> with
> monitor but not able to probably. It is sending osd_boot at 2015-09-16
> 16:16:07.180482 only..
> > >>>
> > >>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
> > >>> _send_mon_message to mon.a at 10.60.194.10:6789/0
> > >>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
> > >>> 10.60.194.10:6820/20102
> > >>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
> > >>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con
> > >>> 0x7f6542045680
> > >>> 2015-09-16 16:16:07.180496 7f65377fe700 20 --
> > >>> 10.60.194.10:6820/20102
> submit_message osd_boot(osd.10 booted 0 features 72057594037927935
> v45) v6 remote, 10.60.194.10:6789/0, have pipe.
> > >>>
> > >>> 4. BTW, the osd down scenario is detected very quickly (ceph -w
> output) , problem is during coming up I guess.
> > >>>
> > >>>
> > >>> So, something related to mon communication getting slower ?
> > >>> Let me know if more verbose logging is required and how should I
> share the log..
> > >>>
> > >>> Thanks & Regards
> > >>> Somnath
> > >>>
> > >>> -----Original Message-----
> > >>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> > >>> Sent: Wednesday, September 16, 2015 11:35 AM
> > >>> To: Somnath Roy
> > >>> Cc: ceph-devel
> > >>> Subject: Re: Very slow recovery/peering with latest master
> > >>>
> > >>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
> <Somnath.Roy@sandisk.com> wrote:
> > >>>> Hi,
> > >>>> I am seeing very slow recovery when I am adding OSDs with the
> > >>>> latest
> master.
> > >>>> Also, If I just restart all the OSDs (no IO is going on in the
> > >>>> cluster) ,
> cluster is taking a significant amount of time to reach in
> active+clean state (and even detecting all the up OSDs).
> > >>>>
> > >>>> I saw the recovery/backfill default parameters are now changed
> > >>>> (to
> lower value) , this probably explains the recovery scenario , but,
> will it affect the peering time during OSD startup as well ?
> > >>>
> > >>> I don't think these values should impact peering time, but you
> > >>> could
> configure them back to the old defaults and see if it changes.
> > >>> -Greg
> > >>>
> > >>> ________________________________
> > >>>
> > >>> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated recipient(s)
> named above. If the reader of this message is not the intended
> recipient, you are hereby notified that you have received this message
> in error and that any review, dissemination, distribution, or copying
> of this message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-23 23:42           ` Handzik, Joe
  2015-09-24  1:31             ` Sage Weil
@ 2015-09-24 18:27             ` Gregory Farnum
  1 sibling, 0 replies; 17+ messages in thread
From: Gregory Farnum @ 2015-09-24 18:27 UTC (permalink / raw)
  To: Handzik, Joe
  Cc: Somnath Roy, Samuel Just, Samuel Just (sam.just@inktank.com),
	Sage Weil (sage@newdream.net),
	ceph-devel

On Wed, Sep 23, 2015 at 4:42 PM, Handzik, Joe <joseph.t.handzik@hpe.com> wrote:
> Ok. When configuring with ceph-disk, it does something nifty and actually gives the OSD the uuid of the disk's partition as its fsid. I bootstrap off that to get an argument to pass into the function you have identified as the bottleneck. I ran it by sage and we both realized there would be cases where it wouldn't work...I'm sure neither of us realized the failure would take three minutes though.
>
> In the short term, it makes sense to create an option to disable or short-circuit the blkid code. I would prefer that the default be left with the code enabled, but I'm open to default disabled if others think this will be a widespread problem. You could also make sure your OSD fsids are set to match your disk partition uuids for now too, if that's a faster workaround for you (it'll get rid of the failure).

I'd leave it enabled by default — ceph-disk is the standard way of
setting up a Ceph cluster, and mkcephfs is definitely going bye-bye.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-24 16:09                 ` Somnath Roy
@ 2015-09-28  8:01                   ` Chen, Xiaoxi
  2015-09-28  8:42                     ` Somnath Roy
  0 siblings, 1 reply; 17+ messages in thread
From: Chen, Xiaoxi @ 2015-09-28  8:01 UTC (permalink / raw)
  To: Somnath Roy, Podoski, Igor
  Cc: Samuel Just, Samuel Just (sam.just@inktank.com),
	ceph-devel, Sage Weil, Handzik, Joe

FWIW, blkid works well in both GPT(created by parted) and MSDOS(created by fdisk) in my environment.

But blkid doesn't show the information of disk in external bay (which is connected by a JBOD controller) in my setup.

See below, SDB and SDH are SSDs attached to the front panel but the rest osd disks(0-9) are from an external bay.

/dev/sdc       976285652 294887592 681398060  31% /var/lib/ceph/mnt/osd-device-0-data
/dev/sdd       976285652 269840116 706445536  28% /var/lib/ceph/mnt/osd-device-1-data
/dev/sde       976285652 257610832 718674820  27% /var/lib/ceph/mnt/osd-device-2-data
/dev/sdf       976285652 293460620 682825032  31% /var/lib/ceph/mnt/osd-device-3-data
/dev/sdg       976285652 294444100 681841552  31% /var/lib/ceph/mnt/osd-device-4-data
/dev/sdi       976285652 288416840 687868812  30% /var/lib/ceph/mnt/osd-device-5-data
/dev/sdj       976285652 273090960 703194692  28% /var/lib/ceph/mnt/osd-device-6-data
/dev/sdk       976285652 302720828 673564824  32% /var/lib/ceph/mnt/osd-device-7-data
/dev/sdl       976285652 268207968 708077684  28% /var/lib/ceph/mnt/osd-device-8-data
/dev/sdm       976285652 293316752 682968900  31% /var/lib/ceph/mnt/osd-device-9-data
/dev/sdb1      292824376  10629024 282195352   4% /var/lib/ceph/mnt/osd-device-40-data
/dev/sdh1      292824376  11413956 281410420   4% /var/lib/ceph/mnt/osd-device-41-data



root@osd1:~# blkid 
/dev/sdb1: UUID="907806fe-1d29-4ef7-ad11-5a933a11601e" TYPE="xfs" 
/dev/sdh1: UUID="9dfe68ac-f297-4a02-8d21-50c194af4ff2" TYPE="xfs" 
/dev/sda1: UUID="cdf945ce-a345-4766-b89e-cecc33689016" TYPE="ext4" 
/dev/sda2: UUID="7a565029-deb9-4e68-835c-f097c2b1514e" TYPE="ext4" 
/dev/sda5: UUID="e61bfc35-932d-442f-a5ca-795897f62744" TYPE="swap"

 

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Friday, September 25, 2015 12:09 AM
> To: Podoski, Igor
> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage Weil;
> Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
> 
> Yeah , Igor may be..
> Meanwhile, I am able to get gdb trace of the hang..
> 
> (gdb) bt
> #0  0x00007f6f6bf043bd in read () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007f6f6af3b066 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #2  0x00007f6f6af43ae2 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #3  0x00007f6f6af42788 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #4  0x00007f6f6af42a53 in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #5  0x00007f6f6af3c17b in blkid_do_safeprobe () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #6  0x00007f6f6af3e0c4 in blkid_verify () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #7  0x00007f6f6af387fb in blkid_get_dev () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #8  0x00007f6f6af38acb in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #9  0x00007f6f6af3946d in ?? () from /lib/x86_64-linux-gnu/libblkid.so.1
> #10 0x00007f6f6af39892 in blkid_probe_all_new () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #11 0x00007f6f6af3dc10 in blkid_find_dev_with_tag () from /lib/x86_64-
> linux-gnu/libblkid.so.1
> #12 0x00007f6f6d3bf923 in get_device_by_uuid (dev_uuid=...,
> label=label@entry=0x7f6f6d535fe5 "PARTUUID",
> partition=partition@entry=0x7f6f347eb5a0 "",
> device=device@entry=0x7f6f347ec5a0 "")
>     at common/blkdev.cc:193
> #13 0x00007f6f6d147de5 in FileStore::collect_metadata (this=0x7f6f68893000,
> pm=0x7f6f21419598) at os/FileStore.cc:660
> #14 0x00007f6f6cebfa9a in OSD::_collect_metadata
> (this=this@entry=0x7f6f6894f000, pm=pm@entry=0x7f6f21419598) at
> osd/OSD.cc:4586
> #15 0x00007f6f6cec0614 in OSD::_send_boot
> (this=this@entry=0x7f6f6894f000) at osd/OSD.cc:4568
> #16 0x00007f6f6cec203a in OSD::_maybe_boot (this=0x7f6f6894f000,
> oldest=1, newest=100) at osd/OSD.cc:4463
> #17 0x00007f6f6cefc5e1 in Context::complete (this=0x7f6f3d3864e0,
> r=<optimized out>) at ./include/Context.h:64
> #18 0x00007f6f6d2eed08 in Finisher::finisher_thread_entry
> (this=0x7ffee7272d70) at common/Finisher.cc:65
> #19 0x00007f6f6befd182 in start_thread (arg=0x7f6f347ee700) at
> pthread_create.c:312
> #20 0x00007f6f6a24347d in clone ()
> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> 
> Strace was not helpful much since other threads are not block and keep
> printing the futex traces..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Podoski, Igor [mailto:Igor.Podoski@ts.fujitsu.com]
> Sent: Wednesday, September 23, 2015 11:33 PM
> To: Somnath Roy
> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage Weil;
> Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Thursday, September 24, 2015 3:32 AM
> > To: Handzik, Joe
> > Cc: Somnath Roy; Samuel Just; Samuel Just (sam.just@inktank.com);
> > ceph- devel
> > Subject: Re: Very slow recovery/peering with latest master
> >
> > On Wed, 23 Sep 2015, Handzik, Joe wrote:
> > > Ok. When configuring with ceph-disk, it does something nifty and
> > > actually gives the OSD the uuid of the disk's partition as its fsid.
> > > I bootstrap off that to get an argument to pass into the function
> > > you have identified as the bottleneck. I ran it by sage and we both
> > > realized there would be cases where it wouldn't work...I'm sure
> > > neither of us realized the failure would take three minutes though.
> > >
> > > In the short term, it makes sense to create an option to disable or
> > > short-circuit the blkid code. I would prefer that the default be
> > > left with the code enabled, but I'm open to default disabled if
> > > others think this will be a widespread problem. You could also make
> > > sure your OSD fsids are set to match your disk partition uuids for
> > > now too, if that's a faster workaround for you (it'll get rid of the failure).
> >
> > I think we should try to figure out where it is hanging.  Can you
> > strace the blkid process to see what it is up to?
> >
> > I opened http://tracker.ceph.com/issues/13219
> >
> > I think as long as it behaves reliably with ceph-disk OSDs then we can
> > have it on by default.
> >
> > sage
> >
> >
> > >
> > > Joe
> > >
> > > > On Sep 23, 2015, at 6:26 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> > wrote:
> > > >
> > > > <<inline
> > > >
> > > > -----Original Message-----
> > > > From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com]
> > > > Sent: Wednesday, September 23, 2015 4:20 PM
> > > > To: Samuel Just
> > > > Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil
> > > > (sage@newdream.net); ceph-devel
> > > > Subject: Re: Very slow recovery/peering with latest master
> > > >
> > > > I added that, there is code up the stack in calamari that consumes
> > > > the
> > path provided, which is intended in the future to facilitate disk
> > monitoring and management.
> > > >
> > > > [Somnath] Ok
> > > >
> > > > Somnath, what does your disk configuration look like (filesystem,
> > SSD/HDD, anything else you think could be relevant)? Did you configure
> > your disks with ceph-disk, or by hand? I never saw this while testing
> > my code, has anyone else heard of this behavior on master? The code
> > has been in master for 2-3 months now I believe.
> > > > [Somnath] All SSD , I use mkcephfs to create cluster , I
> > > > partitioned the
> > disk with fdisk beforehand. I am using XFS. Are you trying with Ubuntu
> > 3.16.* kernel ? It could be Linux distribution/kernel specific.
> 
> Somnath, maybe it is GPT related, what partition table do you have? I think
> parted and gdisk can create GPT partitions, but not fdisk (definitely not in
> version that I use).
> 
> You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe there is a
> mess.
> 
> Regards,
> Igor.
> 
> 
> > > >
> > > > It would be nice to not need to disable this, but if this behavior
> > > > exists and
> > can't be explained by a misconfiguration or something else I'll need
> > to figure out a different implementation.
> > > >
> > > > Joe
> > > >
> > > >> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> > > >>
> > > >> Wow.  Why would that take so long?  I think you are correct that
> > > >> it's only used for metadata, we could just add a config value to
> > > >> disable it.
> > > >> -Sam
> > > >>
> > > >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
> > <Somnath.Roy@sandisk.com> wrote:
> > > >>> Sam/Sage,
> > > >>> I debugged it down and found out that the get_device_by_uuid-
> > >blkid_find_dev_with_tag() call within FileStore::collect_metadata()
> > >is
> > hanging for ~3 mins before returning a EINVAL. I saw this portion is
> > newly added after hammer.
> > > >>> Commenting it out resolves the issue. BTW, I saw this value is
> > > >>> stored as
> > metadata but not used anywhere , am I missing anything ?
> > > >>> Here is my Linux details..
> > > >>>
> > > >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
> Linux
> > > >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
> > > >>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> > > >>>
> > > >>>
> > > >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a
> > No
> > > >>> LSB modules are available.
> > > >>> Distributor ID: Ubuntu
> > > >>> Description:    Ubuntu 14.04.2 LTS
> > > >>> Release:        14.04
> > > >>> Codename:       trusty
> > > >>>
> > > >>> Thanks & Regards
> > > >>> Somnath
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Somnath Roy
> > > >>> Sent: Wednesday, September 16, 2015 2:20 PM
> > > >>> To: 'Gregory Farnum'
> > > >>> Cc: 'ceph-devel'
> > > >>> Subject: RE: Very slow recovery/peering with latest master
> > > >>>
> > > >>>
> > > >>> Sage/Greg,
> > > >>>
> > > >>> Yeah, as we expected, it is not happening probably because of
> > recovery settings. I reverted it back in my ceph.conf , but, still
> > seeing this problem.
> > > >>>
> > > >>> Some observation :
> > > >>> ----------------------
> > > >>>
> > > >>> 1. First of all, I don't think it is something related to my
> > > >>> environment. I
> > recreated the cluster with Hammer and this problem is not there.
> > > >>>
> > > >>> 2. I have enabled the messenger/monclient log (Couldn't attach
> > > >>> here)
> > in one of the OSDs and found monitor is taking long time to detect the
> > up OSDs. If you see the log, I have started OSD at 2015-09-16
> > 16:13:07.042463 , but, there is no communication (only getting
> > KEEP_ALIVE) till 2015-09-16
> > 16:16:07.180482 , so, 3 mins !!
> > > >>>
> > > >>> 3. During this period, I saw monclient trying to communicate
> > > >>> with
> > monitor but not able to probably. It is sending osd_boot at 2015-09-16
> > 16:16:07.180482 only..
> > > >>>
> > > >>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
> > > >>> _send_mon_message to mon.a at 10.60.194.10:6789/0
> > > >>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
> > > >>> 10.60.194.10:6820/20102
> > > >>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
> > > >>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con
> > > >>> 0x7f6542045680
> > > >>> 2015-09-16 16:16:07.180496 7f65377fe700 20 --
> > > >>> 10.60.194.10:6820/20102
> > submit_message osd_boot(osd.10 booted 0 features 72057594037927935
> > v45) v6 remote, 10.60.194.10:6789/0, have pipe.
> > > >>>
> > > >>> 4. BTW, the osd down scenario is detected very quickly (ceph -w
> > output) , problem is during coming up I guess.
> > > >>>
> > > >>>
> > > >>> So, something related to mon communication getting slower ?
> > > >>> Let me know if more verbose logging is required and how should I
> > share the log..
> > > >>>
> > > >>> Thanks & Regards
> > > >>> Somnath
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> > > >>> Sent: Wednesday, September 16, 2015 11:35 AM
> > > >>> To: Somnath Roy
> > > >>> Cc: ceph-devel
> > > >>> Subject: Re: Very slow recovery/peering with latest master
> > > >>>
> > > >>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
> > <Somnath.Roy@sandisk.com> wrote:
> > > >>>> Hi,
> > > >>>> I am seeing very slow recovery when I am adding OSDs with the
> > > >>>> latest
> > master.
> > > >>>> Also, If I just restart all the OSDs (no IO is going on in the
> > > >>>> cluster) ,
> > cluster is taking a significant amount of time to reach in
> > active+clean state (and even detecting all the up OSDs).
> > > >>>>
> > > >>>> I saw the recovery/backfill default parameters are now changed
> > > >>>> (to
> > lower value) , this probably explains the recovery scenario , but,
> > will it affect the peering time during OSD startup as well ?
> > > >>>
> > > >>> I don't think these values should impact peering time, but you
> > > >>> could
> > configure them back to the old defaults and see if it changes.
> > > >>> -Greg
> > > >>>
> > > >>> ________________________________
> > > >>>
> > > >>> PLEASE NOTE: The information contained in this electronic mail
> > message is intended only for the use of the designated recipient(s)
> > named above. If the reader of this message is not the intended
> > recipient, you are hereby notified that you have received this message
> > in error and that any review, dissemination, distribution, or copying
> > of this message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> stored copies).
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >> in the body of a message to majordomo@vger.kernel.org More
> > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Very slow recovery/peering with latest master
  2015-09-28  8:01                   ` Chen, Xiaoxi
@ 2015-09-28  8:42                     ` Somnath Roy
  2015-09-28 12:21                       ` Handzik, Joe
  0 siblings, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-09-28  8:42 UTC (permalink / raw)
  To: Chen, Xiaoxi, Podoski, Igor
  Cc: Samuel Just, Samuel Just (sam.just@inktank.com),
	ceph-devel, Sage Weil, Handzik, Joe

Xiaoxi,
Thanks for giving me some pointers.
Now, with the help of strace I am able to figure out why it is taking so long in my setup to complete blkid* calls.
In my case, the partitions are showing properly even if it is connected to JBOD controller.

root@emsnode10:~/wip-write-path-optimization/src/os# strace -t -o /root/strace_blkid.txt blkid
/dev/sda1: UUID="d2060642-1af4-424f-9957-6a8dc77ff301" TYPE="ext4"
/dev/sda5: UUID="2a987cc0-e3cd-43d4-99cd-b8d8e58617e7" TYPE="swap"
/dev/sdy2: UUID="0ebd1631-52e7-4dc2-8bff-07102b877bfc" TYPE="xfs"
/dev/sdw2: UUID="29f1203b-6f44-45e3-8f6a-8ad1d392a208" TYPE="xfs"
/dev/sdt2: UUID="94f6bb55-ac61-499c-8552-600581e13dfa" TYPE="xfs"
/dev/sdr2: UUID="b629710e-915d-4c56-b6a5-4782e6d6215d" TYPE="xfs"
/dev/sdv2: UUID="69623b7f-9036-4a35-8298-dc7f5cecdb21" TYPE="xfs"
/dev/sds2: UUID="75d941c5-a85c-4c37-b409-02de34483314" TYPE="xfs"
/dev/sdx: UUID="cc84bc66-208b-4387-8470-071ec71532f2" TYPE="xfs"
/dev/sdu2: UUID="c9817831-8362-48a9-9a6c-920e0f04d029" TYPE="xfs"

But, it is taking time on the drives those are not reserved for this host. Basically, I am using 2 heads in front of a JBOF and I am using sg_persist to reserve the drives between 2 hosts.
Here is the strace output of blkid.

http://pastebin.com/qz2Z7Phj

You can see lot of input/output errors on accessing the drives which are not reserved for this host.

This is an inefficiency part of blkid* calls (?) since calls like fdisk/lsscsi are not taking time.

Regards
Somnath


-----Original Message-----
From: Chen, Xiaoxi [mailto:xiaoxi.chen@intel.com]
Sent: Monday, September 28, 2015 1:02 AM
To: Somnath Roy; Podoski, Igor
Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage Weil; Handzik, Joe
Subject: RE: Very slow recovery/peering with latest master

FWIW, blkid works well in both GPT(created by parted) and MSDOS(created by fdisk) in my environment.

But blkid doesn't show the information of disk in external bay (which is connected by a JBOD controller) in my setup.

See below, SDB and SDH are SSDs attached to the front panel but the rest osd disks(0-9) are from an external bay.

/dev/sdc       976285652 294887592 681398060  31% /var/lib/ceph/mnt/osd-device-0-data
/dev/sdd       976285652 269840116 706445536  28% /var/lib/ceph/mnt/osd-device-1-data
/dev/sde       976285652 257610832 718674820  27% /var/lib/ceph/mnt/osd-device-2-data
/dev/sdf       976285652 293460620 682825032  31% /var/lib/ceph/mnt/osd-device-3-data
/dev/sdg       976285652 294444100 681841552  31% /var/lib/ceph/mnt/osd-device-4-data
/dev/sdi       976285652 288416840 687868812  30% /var/lib/ceph/mnt/osd-device-5-data
/dev/sdj       976285652 273090960 703194692  28% /var/lib/ceph/mnt/osd-device-6-data
/dev/sdk       976285652 302720828 673564824  32% /var/lib/ceph/mnt/osd-device-7-data
/dev/sdl       976285652 268207968 708077684  28% /var/lib/ceph/mnt/osd-device-8-data
/dev/sdm       976285652 293316752 682968900  31% /var/lib/ceph/mnt/osd-device-9-data
/dev/sdb1      292824376  10629024 282195352   4% /var/lib/ceph/mnt/osd-device-40-data
/dev/sdh1      292824376  11413956 281410420   4% /var/lib/ceph/mnt/osd-device-41-data



root@osd1:~# blkid
/dev/sdb1: UUID="907806fe-1d29-4ef7-ad11-5a933a11601e" TYPE="xfs"
/dev/sdh1: UUID="9dfe68ac-f297-4a02-8d21-50c194af4ff2" TYPE="xfs"
/dev/sda1: UUID="cdf945ce-a345-4766-b89e-cecc33689016" TYPE="ext4"
/dev/sda2: UUID="7a565029-deb9-4e68-835c-f097c2b1514e" TYPE="ext4"
/dev/sda5: UUID="e61bfc35-932d-442f-a5ca-795897f62744" TYPE="swap"



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Friday, September 25, 2015 12:09 AM
> To: Podoski, Igor
> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage
> Weil; Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
>
> Yeah , Igor may be..
> Meanwhile, I am able to get gdb trace of the hang..
>
> (gdb) bt
> #0  0x00007f6f6bf043bd in read () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007f6f6af3b066 in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #2  0x00007f6f6af43ae2 in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #3  0x00007f6f6af42788 in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #4  0x00007f6f6af42a53 in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #5  0x00007f6f6af3c17b in blkid_do_safeprobe () from
> /lib/x86_64-linux-
> gnu/libblkid.so.1
> #6  0x00007f6f6af3e0c4 in blkid_verify () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #7  0x00007f6f6af387fb in blkid_get_dev () from /lib/x86_64-linux-
> gnu/libblkid.so.1
> #8  0x00007f6f6af38acb in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #9  0x00007f6f6af3946d in ?? () from
> /lib/x86_64-linux-gnu/libblkid.so.1
> #10 0x00007f6f6af39892 in blkid_probe_all_new () from
> /lib/x86_64-linux-
> gnu/libblkid.so.1
> #11 0x00007f6f6af3dc10 in blkid_find_dev_with_tag () from /lib/x86_64-
> linux-gnu/libblkid.so.1
> #12 0x00007f6f6d3bf923 in get_device_by_uuid (dev_uuid=...,
> label=label@entry=0x7f6f6d535fe5 "PARTUUID",
> partition=partition@entry=0x7f6f347eb5a0 "",
> device=device@entry=0x7f6f347ec5a0 "")
>     at common/blkdev.cc:193
> #13 0x00007f6f6d147de5 in FileStore::collect_metadata
> (this=0x7f6f68893000,
> pm=0x7f6f21419598) at os/FileStore.cc:660
> #14 0x00007f6f6cebfa9a in OSD::_collect_metadata
> (this=this@entry=0x7f6f6894f000, pm=pm@entry=0x7f6f21419598) at
> osd/OSD.cc:4586
> #15 0x00007f6f6cec0614 in OSD::_send_boot
> (this=this@entry=0x7f6f6894f000) at osd/OSD.cc:4568
> #16 0x00007f6f6cec203a in OSD::_maybe_boot (this=0x7f6f6894f000,
> oldest=1, newest=100) at osd/OSD.cc:4463
> #17 0x00007f6f6cefc5e1 in Context::complete (this=0x7f6f3d3864e0,
> r=<optimized out>) at ./include/Context.h:64
> #18 0x00007f6f6d2eed08 in Finisher::finisher_thread_entry
> (this=0x7ffee7272d70) at common/Finisher.cc:65
> #19 0x00007f6f6befd182 in start_thread (arg=0x7f6f347ee700) at
> pthread_create.c:312
> #20 0x00007f6f6a24347d in clone ()
> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
>
> Strace was not helpful much since other threads are not block and keep
> printing the futex traces..
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Podoski, Igor [mailto:Igor.Podoski@ts.fujitsu.com]
> Sent: Wednesday, September 23, 2015 11:33 PM
> To: Somnath Roy
> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage
> Weil; Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
>
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Thursday, September 24, 2015 3:32 AM
> > To: Handzik, Joe
> > Cc: Somnath Roy; Samuel Just; Samuel Just (sam.just@inktank.com);
> > ceph- devel
> > Subject: Re: Very slow recovery/peering with latest master
> >
> > On Wed, 23 Sep 2015, Handzik, Joe wrote:
> > > Ok. When configuring with ceph-disk, it does something nifty and
> > > actually gives the OSD the uuid of the disk's partition as its fsid.
> > > I bootstrap off that to get an argument to pass into the function
> > > you have identified as the bottleneck. I ran it by sage and we
> > > both realized there would be cases where it wouldn't work...I'm
> > > sure neither of us realized the failure would take three minutes though.
> > >
> > > In the short term, it makes sense to create an option to disable
> > > or short-circuit the blkid code. I would prefer that the default
> > > be left with the code enabled, but I'm open to default disabled if
> > > others think this will be a widespread problem. You could also
> > > make sure your OSD fsids are set to match your disk partition
> > > uuids for now too, if that's a faster workaround for you (it'll get rid of the failure).
> >
> > I think we should try to figure out where it is hanging.  Can you
> > strace the blkid process to see what it is up to?
> >
> > I opened http://tracker.ceph.com/issues/13219
> >
> > I think as long as it behaves reliably with ceph-disk OSDs then we
> > can have it on by default.
> >
> > sage
> >
> >
> > >
> > > Joe
> > >
> > > > On Sep 23, 2015, at 6:26 PM, Somnath Roy
> > > > <Somnath.Roy@sandisk.com>
> > wrote:
> > > >
> > > > <<inline
> > > >
> > > > -----Original Message-----
> > > > From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com]
> > > > Sent: Wednesday, September 23, 2015 4:20 PM
> > > > To: Samuel Just
> > > > Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil
> > > > (sage@newdream.net); ceph-devel
> > > > Subject: Re: Very slow recovery/peering with latest master
> > > >
> > > > I added that, there is code up the stack in calamari that
> > > > consumes the
> > path provided, which is intended in the future to facilitate disk
> > monitoring and management.
> > > >
> > > > [Somnath] Ok
> > > >
> > > > Somnath, what does your disk configuration look like
> > > > (filesystem,
> > SSD/HDD, anything else you think could be relevant)? Did you
> > configure your disks with ceph-disk, or by hand? I never saw this
> > while testing my code, has anyone else heard of this behavior on
> > master? The code has been in master for 2-3 months now I believe.
> > > > [Somnath] All SSD , I use mkcephfs to create cluster , I
> > > > partitioned the
> > disk with fdisk beforehand. I am using XFS. Are you trying with
> > Ubuntu
> > 3.16.* kernel ? It could be Linux distribution/kernel specific.
>
> Somnath, maybe it is GPT related, what partition table do you have? I
> think parted and gdisk can create GPT partitions, but not fdisk
> (definitely not in version that I use).
>
> You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe
> there is a mess.
>
> Regards,
> Igor.
>
>
> > > >
> > > > It would be nice to not need to disable this, but if this
> > > > behavior exists and
> > can't be explained by a misconfiguration or something else I'll need
> > to figure out a different implementation.
> > > >
> > > > Joe
> > > >
> > > >> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
> > > >>
> > > >> Wow.  Why would that take so long?  I think you are correct
> > > >> that it's only used for metadata, we could just add a config
> > > >> value to disable it.
> > > >> -Sam
> > > >>
> > > >>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
> > <Somnath.Roy@sandisk.com> wrote:
> > > >>> Sam/Sage,
> > > >>> I debugged it down and found out that the get_device_by_uuid-
> > >blkid_find_dev_with_tag() call within FileStore::collect_metadata()
> > >is
> > hanging for ~3 mins before returning a EINVAL. I saw this portion is
> > newly added after hammer.
> > > >>> Commenting it out resolves the issue. BTW, I saw this value is
> > > >>> stored as
> > metadata but not used anywhere , am I missing anything ?
> > > >>> Here is my Linux details..
> > > >>>
> > > >>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
> Linux
> > > >>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
> > > >>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> > > >>>
> > > >>>
> > > >>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release
> > > >>> -a
> > No
> > > >>> LSB modules are available.
> > > >>> Distributor ID: Ubuntu
> > > >>> Description:    Ubuntu 14.04.2 LTS
> > > >>> Release:        14.04
> > > >>> Codename:       trusty
> > > >>>
> > > >>> Thanks & Regards
> > > >>> Somnath
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Somnath Roy
> > > >>> Sent: Wednesday, September 16, 2015 2:20 PM
> > > >>> To: 'Gregory Farnum'
> > > >>> Cc: 'ceph-devel'
> > > >>> Subject: RE: Very slow recovery/peering with latest master
> > > >>>
> > > >>>
> > > >>> Sage/Greg,
> > > >>>
> > > >>> Yeah, as we expected, it is not happening probably because of
> > recovery settings. I reverted it back in my ceph.conf , but, still
> > seeing this problem.
> > > >>>
> > > >>> Some observation :
> > > >>> ----------------------
> > > >>>
> > > >>> 1. First of all, I don't think it is something related to my
> > > >>> environment. I
> > recreated the cluster with Hammer and this problem is not there.
> > > >>>
> > > >>> 2. I have enabled the messenger/monclient log (Couldn't attach
> > > >>> here)
> > in one of the OSDs and found monitor is taking long time to detect
> > the up OSDs. If you see the log, I have started OSD at 2015-09-16
> > 16:13:07.042463 , but, there is no communication (only getting
> > KEEP_ALIVE) till 2015-09-16
> > 16:16:07.180482 , so, 3 mins !!
> > > >>>
> > > >>> 3. During this period, I saw monclient trying to communicate
> > > >>> with
> > monitor but not able to probably. It is sending osd_boot at
> > 2015-09-16
> > 16:16:07.180482 only..
> > > >>>
> > > >>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
> > > >>> _send_mon_message to mon.a at 10.60.194.10:6789/0
> > > >>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
> > > >>> 10.60.194.10:6820/20102
> > > >>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
> > > >>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con
> > > >>> 0x7f6542045680
> > > >>> 2015-09-16 16:16:07.180496 7f65377fe700 20 --
> > > >>> 10.60.194.10:6820/20102
> > submit_message osd_boot(osd.10 booted 0 features 72057594037927935
> > v45) v6 remote, 10.60.194.10:6789/0, have pipe.
> > > >>>
> > > >>> 4. BTW, the osd down scenario is detected very quickly (ceph
> > > >>> -w
> > output) , problem is during coming up I guess.
> > > >>>
> > > >>>
> > > >>> So, something related to mon communication getting slower ?
> > > >>> Let me know if more verbose logging is required and how should
> > > >>> I
> > share the log..
> > > >>>
> > > >>> Thanks & Regards
> > > >>> Somnath
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> > > >>> Sent: Wednesday, September 16, 2015 11:35 AM
> > > >>> To: Somnath Roy
> > > >>> Cc: ceph-devel
> > > >>> Subject: Re: Very slow recovery/peering with latest master
> > > >>>
> > > >>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
> > <Somnath.Roy@sandisk.com> wrote:
> > > >>>> Hi,
> > > >>>> I am seeing very slow recovery when I am adding OSDs with the
> > > >>>> latest
> > master.
> > > >>>> Also, If I just restart all the OSDs (no IO is going on in
> > > >>>> the
> > > >>>> cluster) ,
> > cluster is taking a significant amount of time to reach in
> > active+clean state (and even detecting all the up OSDs).
> > > >>>>
> > > >>>> I saw the recovery/backfill default parameters are now
> > > >>>> changed (to
> > lower value) , this probably explains the recovery scenario , but,
> > will it affect the peering time during OSD startup as well ?
> > > >>>
> > > >>> I don't think these values should impact peering time, but you
> > > >>> could
> > configure them back to the old defaults and see if it changes.
> > > >>> -Greg
> > > >>>
> > > >>> ________________________________
> > > >>>
> > > >>> PLEASE NOTE: The information contained in this electronic mail
> > message is intended only for the use of the designated recipient(s)
> > named above. If the reader of this message is not the intended
> > recipient, you are hereby notified that you have received this
> > message in error and that any review, dissemination, distribution,
> > or copying of this message is strictly prohibited. If you have
> > received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and
> > all copies of this message in your possession (whether hard copies
> > or electronically
> stored copies).
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >> in the body of a message to majordomo@vger.kernel.org More
> > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Very slow recovery/peering with latest master
  2015-09-28  8:42                     ` Somnath Roy
@ 2015-09-28 12:21                       ` Handzik, Joe
  0 siblings, 0 replies; 17+ messages in thread
From: Handzik, Joe @ 2015-09-28 12:21 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Chen, Xiaoxi, Podoski, Igor, Samuel Just,
	Samuel Just (sam.just@inktank.com),
	ceph-devel, Sage Weil

That's really good info, thanks for tracking that down. Do you expect this to be a common configuration going forward in Ceph deployments? 

Joe

> On Sep 28, 2015, at 3:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Xiaoxi,
> Thanks for giving me some pointers.
> Now, with the help of strace I am able to figure out why it is taking so long in my setup to complete blkid* calls.
> In my case, the partitions are showing properly even if it is connected to JBOD controller.
> 
> root@emsnode10:~/wip-write-path-optimization/src/os# strace -t -o /root/strace_blkid.txt blkid
> /dev/sda1: UUID="d2060642-1af4-424f-9957-6a8dc77ff301" TYPE="ext4"
> /dev/sda5: UUID="2a987cc0-e3cd-43d4-99cd-b8d8e58617e7" TYPE="swap"
> /dev/sdy2: UUID="0ebd1631-52e7-4dc2-8bff-07102b877bfc" TYPE="xfs"
> /dev/sdw2: UUID="29f1203b-6f44-45e3-8f6a-8ad1d392a208" TYPE="xfs"
> /dev/sdt2: UUID="94f6bb55-ac61-499c-8552-600581e13dfa" TYPE="xfs"
> /dev/sdr2: UUID="b629710e-915d-4c56-b6a5-4782e6d6215d" TYPE="xfs"
> /dev/sdv2: UUID="69623b7f-9036-4a35-8298-dc7f5cecdb21" TYPE="xfs"
> /dev/sds2: UUID="75d941c5-a85c-4c37-b409-02de34483314" TYPE="xfs"
> /dev/sdx: UUID="cc84bc66-208b-4387-8470-071ec71532f2" TYPE="xfs"
> /dev/sdu2: UUID="c9817831-8362-48a9-9a6c-920e0f04d029" TYPE="xfs"
> 
> But, it is taking time on the drives those are not reserved for this host. Basically, I am using 2 heads in front of a JBOF and I am using sg_persist to reserve the drives between 2 hosts.
> Here is the strace output of blkid.
> 
> http://pastebin.com/qz2Z7Phj
> 
> You can see lot of input/output errors on accessing the drives which are not reserved for this host.
> 
> This is an inefficiency part of blkid* calls (?) since calls like fdisk/lsscsi are not taking time.
> 
> Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Chen, Xiaoxi [mailto:xiaoxi.chen@intel.com]
> Sent: Monday, September 28, 2015 1:02 AM
> To: Somnath Roy; Podoski, Igor
> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage Weil; Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
> 
> FWIW, blkid works well in both GPT(created by parted) and MSDOS(created by fdisk) in my environment.
> 
> But blkid doesn't show the information of disk in external bay (which is connected by a JBOD controller) in my setup.
> 
> See below, SDB and SDH are SSDs attached to the front panel but the rest osd disks(0-9) are from an external bay.
> 
> /dev/sdc       976285652 294887592 681398060  31% /var/lib/ceph/mnt/osd-device-0-data
> /dev/sdd       976285652 269840116 706445536  28% /var/lib/ceph/mnt/osd-device-1-data
> /dev/sde       976285652 257610832 718674820  27% /var/lib/ceph/mnt/osd-device-2-data
> /dev/sdf       976285652 293460620 682825032  31% /var/lib/ceph/mnt/osd-device-3-data
> /dev/sdg       976285652 294444100 681841552  31% /var/lib/ceph/mnt/osd-device-4-data
> /dev/sdi       976285652 288416840 687868812  30% /var/lib/ceph/mnt/osd-device-5-data
> /dev/sdj       976285652 273090960 703194692  28% /var/lib/ceph/mnt/osd-device-6-data
> /dev/sdk       976285652 302720828 673564824  32% /var/lib/ceph/mnt/osd-device-7-data
> /dev/sdl       976285652 268207968 708077684  28% /var/lib/ceph/mnt/osd-device-8-data
> /dev/sdm       976285652 293316752 682968900  31% /var/lib/ceph/mnt/osd-device-9-data
> /dev/sdb1      292824376  10629024 282195352   4% /var/lib/ceph/mnt/osd-device-40-data
> /dev/sdh1      292824376  11413956 281410420   4% /var/lib/ceph/mnt/osd-device-41-data
> 
> 
> 
> root@osd1:~# blkid
> /dev/sdb1: UUID="907806fe-1d29-4ef7-ad11-5a933a11601e" TYPE="xfs"
> /dev/sdh1: UUID="9dfe68ac-f297-4a02-8d21-50c194af4ff2" TYPE="xfs"
> /dev/sda1: UUID="cdf945ce-a345-4766-b89e-cecc33689016" TYPE="ext4"
> /dev/sda2: UUID="7a565029-deb9-4e68-835c-f097c2b1514e" TYPE="ext4"
> /dev/sda5: UUID="e61bfc35-932d-442f-a5ca-795897f62744" TYPE="swap"
> 
> 
> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Friday, September 25, 2015 12:09 AM
>> To: Podoski, Igor
>> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage
>> Weil; Handzik, Joe
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> Yeah , Igor may be..
>> Meanwhile, I am able to get gdb trace of the hang..
>> 
>> (gdb) bt
>> #0  0x00007f6f6bf043bd in read () at
>> ../sysdeps/unix/syscall-template.S:81
>> #1  0x00007f6f6af3b066 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #2  0x00007f6f6af43ae2 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #3  0x00007f6f6af42788 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #4  0x00007f6f6af42a53 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #5  0x00007f6f6af3c17b in blkid_do_safeprobe () from
>> /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #6  0x00007f6f6af3e0c4 in blkid_verify () from /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #7  0x00007f6f6af387fb in blkid_get_dev () from /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #8  0x00007f6f6af38acb in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #9  0x00007f6f6af3946d in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #10 0x00007f6f6af39892 in blkid_probe_all_new () from
>> /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #11 0x00007f6f6af3dc10 in blkid_find_dev_with_tag () from /lib/x86_64-
>> linux-gnu/libblkid.so.1
>> #12 0x00007f6f6d3bf923 in get_device_by_uuid (dev_uuid=...,
>> label=label@entry=0x7f6f6d535fe5 "PARTUUID",
>> partition=partition@entry=0x7f6f347eb5a0 "",
>> device=device@entry=0x7f6f347ec5a0 "")
>>    at common/blkdev.cc:193
>> #13 0x00007f6f6d147de5 in FileStore::collect_metadata
>> (this=0x7f6f68893000,
>> pm=0x7f6f21419598) at os/FileStore.cc:660
>> #14 0x00007f6f6cebfa9a in OSD::_collect_metadata
>> (this=this@entry=0x7f6f6894f000, pm=pm@entry=0x7f6f21419598) at
>> osd/OSD.cc:4586
>> #15 0x00007f6f6cec0614 in OSD::_send_boot
>> (this=this@entry=0x7f6f6894f000) at osd/OSD.cc:4568
>> #16 0x00007f6f6cec203a in OSD::_maybe_boot (this=0x7f6f6894f000,
>> oldest=1, newest=100) at osd/OSD.cc:4463
>> #17 0x00007f6f6cefc5e1 in Context::complete (this=0x7f6f3d3864e0,
>> r=<optimized out>) at ./include/Context.h:64
>> #18 0x00007f6f6d2eed08 in Finisher::finisher_thread_entry
>> (this=0x7ffee7272d70) at common/Finisher.cc:65
>> #19 0x00007f6f6befd182 in start_thread (arg=0x7f6f347ee700) at
>> pthread_create.c:312
>> #20 0x00007f6f6a24347d in clone ()
>> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>> 
>> 
>> Strace was not helpful much since other threads are not block and keep
>> printing the futex traces..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Podoski, Igor [mailto:Igor.Podoski@ts.fujitsu.com]
>> Sent: Wednesday, September 23, 2015 11:33 PM
>> To: Somnath Roy
>> Cc: Samuel Just; Samuel Just (sam.just@inktank.com); ceph-devel; Sage
>> Weil; Handzik, Joe
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Thursday, September 24, 2015 3:32 AM
>>> To: Handzik, Joe
>>> Cc: Somnath Roy; Samuel Just; Samuel Just (sam.just@inktank.com);
>>> ceph- devel
>>> Subject: Re: Very slow recovery/peering with latest master
>>> 
>>>> On Wed, 23 Sep 2015, Handzik, Joe wrote:
>>>> Ok. When configuring with ceph-disk, it does something nifty and
>>>> actually gives the OSD the uuid of the disk's partition as its fsid.
>>>> I bootstrap off that to get an argument to pass into the function
>>>> you have identified as the bottleneck. I ran it by sage and we
>>>> both realized there would be cases where it wouldn't work...I'm
>>>> sure neither of us realized the failure would take three minutes though.
>>>> 
>>>> In the short term, it makes sense to create an option to disable
>>>> or short-circuit the blkid code. I would prefer that the default
>>>> be left with the code enabled, but I'm open to default disabled if
>>>> others think this will be a widespread problem. You could also
>>>> make sure your OSD fsids are set to match your disk partition
>>>> uuids for now too, if that's a faster workaround for you (it'll get rid of the failure).
>>> 
>>> I think we should try to figure out where it is hanging.  Can you
>>> strace the blkid process to see what it is up to?
>>> 
>>> I opened http://tracker.ceph.com/issues/13219
>>> 
>>> I think as long as it behaves reliably with ceph-disk OSDs then we
>>> can have it on by default.
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Joe
>>>> 
>>>>> On Sep 23, 2015, at 6:26 PM, Somnath Roy
>>>>> <Somnath.Roy@sandisk.com>
>>> wrote:
>>>>> 
>>>>> <<inline
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Handzik, Joe [mailto:joseph.t.handzik@hpe.com]
>>>>> Sent: Wednesday, September 23, 2015 4:20 PM
>>>>> To: Samuel Just
>>>>> Cc: Somnath Roy; Samuel Just (sam.just@inktank.com); Sage Weil
>>>>> (sage@newdream.net); ceph-devel
>>>>> Subject: Re: Very slow recovery/peering with latest master
>>>>> 
>>>>> I added that, there is code up the stack in calamari that
>>>>> consumes the
>>> path provided, which is intended in the future to facilitate disk
>>> monitoring and management.
>>>>> 
>>>>> [Somnath] Ok
>>>>> 
>>>>> Somnath, what does your disk configuration look like
>>>>> (filesystem,
>>> SSD/HDD, anything else you think could be relevant)? Did you
>>> configure your disks with ceph-disk, or by hand? I never saw this
>>> while testing my code, has anyone else heard of this behavior on
>>> master? The code has been in master for 2-3 months now I believe.
>>>>> [Somnath] All SSD , I use mkcephfs to create cluster , I
>>>>> partitioned the
>>> disk with fdisk beforehand. I am using XFS. Are you trying with
>>> Ubuntu
>>> 3.16.* kernel ? It could be Linux distribution/kernel specific.
>> 
>> Somnath, maybe it is GPT related, what partition table do you have? I
>> think parted and gdisk can create GPT partitions, but not fdisk
>> (definitely not in version that I use).
>> 
>> You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe
>> there is a mess.
>> 
>> Regards,
>> Igor.
>> 
>> 
>>>>> 
>>>>> It would be nice to not need to disable this, but if this
>>>>> behavior exists and
>>> can't be explained by a misconfiguration or something else I'll need
>>> to figure out a different implementation.
>>>>> 
>>>>> Joe
>>>>> 
>>>>>> On Sep 23, 2015, at 6:07 PM, Samuel Just <sjust@redhat.com> wrote:
>>>>>> 
>>>>>> Wow.  Why would that take so long?  I think you are correct
>>>>>> that it's only used for metadata, we could just add a config
>>>>>> value to disable it.
>>>>>> -Sam
>>>>>> 
>>>>>>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
>>> <Somnath.Roy@sandisk.com> wrote:
>>>>>>> Sam/Sage,
>>>>>>> I debugged it down and found out that the get_device_by_uuid-
>>>> blkid_find_dev_with_tag() call within FileStore::collect_metadata()
>>>> is
>>> hanging for ~3 mins before returning a EINVAL. I saw this portion is
>>> newly added after hammer.
>>>>>>> Commenting it out resolves the issue. BTW, I saw this value is
>>>>>>> stored as
>>> metadata but not used anywhere , am I missing anything ?
>>>>>>> Here is my Linux details..
>>>>>>> 
>>>>>>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
>> Linux
>>>>>>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
>>>>>>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>> 
>>>>>>> 
>>>>>>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release
>>>>>>> -a
>>> No
>>>>>>> LSB modules are available.
>>>>>>> Distributor ID: Ubuntu
>>>>>>> Description:    Ubuntu 14.04.2 LTS
>>>>>>> Release:        14.04
>>>>>>> Codename:       trusty
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Somnath Roy
>>>>>>> Sent: Wednesday, September 16, 2015 2:20 PM
>>>>>>> To: 'Gregory Farnum'
>>>>>>> Cc: 'ceph-devel'
>>>>>>> Subject: RE: Very slow recovery/peering with latest master
>>>>>>> 
>>>>>>> 
>>>>>>> Sage/Greg,
>>>>>>> 
>>>>>>> Yeah, as we expected, it is not happening probably because of
>>> recovery settings. I reverted it back in my ceph.conf , but, still
>>> seeing this problem.
>>>>>>> 
>>>>>>> Some observation :
>>>>>>> ----------------------
>>>>>>> 
>>>>>>> 1. First of all, I don't think it is something related to my
>>>>>>> environment. I
>>> recreated the cluster with Hammer and this problem is not there.
>>>>>>> 
>>>>>>> 2. I have enabled the messenger/monclient log (Couldn't attach
>>>>>>> here)
>>> in one of the OSDs and found monitor is taking long time to detect
>>> the up OSDs. If you see the log, I have started OSD at 2015-09-16
>>> 16:13:07.042463 , but, there is no communication (only getting
>>> KEEP_ALIVE) till 2015-09-16
>>> 16:16:07.180482 , so, 3 mins !!
>>>>>>> 
>>>>>>> 3. During this period, I saw monclient trying to communicate
>>>>>>> with
>>> monitor but not able to probably. It is sending osd_boot at
>>> 2015-09-16
>>> 16:16:07.180482 only..
>>>>>>> 
>>>>>>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
>>>>>>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>>>>>>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
>>>>>>> 10.60.194.10:6820/20102
>>>>>>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
>>>>>>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con
>>>>>>> 0x7f6542045680
>>>>>>> 2015-09-16 16:16:07.180496 7f65377fe700 20 --
>>>>>>> 10.60.194.10:6820/20102
>>> submit_message osd_boot(osd.10 booted 0 features 72057594037927935
>>> v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>>>>>>> 
>>>>>>> 4. BTW, the osd down scenario is detected very quickly (ceph
>>>>>>> -w
>>> output) , problem is during coming up I guess.
>>>>>>> 
>>>>>>> 
>>>>>>> So, something related to mon communication getting slower ?
>>>>>>> Let me know if more verbose logging is required and how should
>>>>>>> I
>>> share the log..
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>>>>>>> Sent: Wednesday, September 16, 2015 11:35 AM
>>>>>>> To: Somnath Roy
>>>>>>> Cc: ceph-devel
>>>>>>> Subject: Re: Very slow recovery/peering with latest master
>>>>>>> 
>>>>>>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
>>> <Somnath.Roy@sandisk.com> wrote:
>>>>>>>> Hi,
>>>>>>>> I am seeing very slow recovery when I am adding OSDs with the
>>>>>>>> latest
>>> master.
>>>>>>>> Also, If I just restart all the OSDs (no IO is going on in
>>>>>>>> the
>>>>>>>> cluster) ,
>>> cluster is taking a significant amount of time to reach in
>>> active+clean state (and even detecting all the up OSDs).
>>>>>>>> 
>>>>>>>> I saw the recovery/backfill default parameters are now
>>>>>>>> changed (to
>>> lower value) , this probably explains the recovery scenario , but,
>>> will it affect the peering time during OSD startup as well ?
>>>>>>> 
>>>>>>> I don't think these values should impact peering time, but you
>>>>>>> could
>>> configure them back to the old defaults and see if it changes.
>>>>>>> -Greg
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> 
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution,
>>> or copying of this message is strictly prohibited. If you have
>>> received this communication in error, please notify the sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and
>>> all copies of this message in your possession (whether hard copies
>>> or electronically
>> stored copies).
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>> 
>> ________________________________
>> 
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically stored copies).
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-09-28 12:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-16  3:04 Very slow recovery/peering with latest master Somnath Roy
2015-09-16 18:35 ` Gregory Farnum
2015-09-16 21:19   ` Somnath Roy
2015-09-16 22:59     ` Joao Eduardo Luis
2015-09-23 22:48   ` Somnath Roy
2015-09-23 23:06     ` Samuel Just
2015-09-23 23:18       ` Somnath Roy
2015-09-23 23:19       ` Handzik, Joe
2015-09-23 23:26         ` Somnath Roy
2015-09-23 23:42           ` Handzik, Joe
2015-09-24  1:31             ` Sage Weil
2015-09-24  6:32               ` Podoski, Igor
2015-09-24 16:09                 ` Somnath Roy
2015-09-28  8:01                   ` Chen, Xiaoxi
2015-09-28  8:42                     ` Somnath Roy
2015-09-28 12:21                       ` Handzik, Joe
2015-09-24 18:27             ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.