Re: Mimic offline problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Mimic offline problem
       [not found] ` <4C31C6EC-573D-42AD-BC6C-F7F103DBEC16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-02 14:13   ` by morphin
       [not found]     ` <CAE-AtHrjekCcLmO7MUB86XG7VH+bBt6yJH+ZRe7Wnh3tKB8JWw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: by morphin @ 2018-10-02 14:13 UTC (permalink / raw)
  To: goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-idqoXFIVOFJgJs9I8MT0rw

One of ceph experts indicated that bluestore is somewhat preview tech
(as for Redhat).
So it could be best to checkout bluestore and rocksdb. There are some
tools to check health and also repair. But there are limited
documentation.
Anyone who has experince with it?
Anyone lead/help to a proper check would be great.
Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
tarihinde şunu yazdı:
>
> Hi all,
>
> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>
> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>
> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>
> So due to unresponsive OSDs cluster does not settle. This is our problem!
>
> This is the one we are very sure of. But we are not sure of the reason.
>
> Here is the latest ceph status:
> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>
> Here is the debug=20 log of an OSD which is same for all others:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>
> This the full log the same OSD.
> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>
> Here is the strace of the same OSD process:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
>
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>
> Any help to figure out what is going on would be great.
>
> Best Regards,
> Goktug Yildirim
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]     ` <CAE-AtHrjekCcLmO7MUB86XG7VH+bBt6yJH+ZRe7Wnh3tKB8JWw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-02 16:16       ` Darius Kasparavičius
       [not found]         ` <CANrNMwUaER4e97schGN+KnB_6aNF6SYpPEFvLD=qcpp1eeWEJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Darius Kasparavičius @ 2018-10-02 16:16 UTC (permalink / raw)
  To: morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Hello,

 Currently you have 15 objects missing. I would recommend finding them
and making backups of them. Ditch all other osds that are failing to
start and concentrate on bringing online those that have missing
objects. Then slowly turn off nodown and noout on the cluster and see
if it stabilises. If it stabilises leave these setting if not turn
them back on.
Now get some of the pg's that are blocked and querry the pgs to check
why they are blocked. Try removing as much blocks as possible and then
remove the norebalance/norecovery flags and see if it starts to fix
itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
<morphinwithyou@gmail.com> wrote:
>
> One of ceph experts indicated that bluestore is somewhat preview tech
> (as for Redhat).
> So it could be best to checkout bluestore and rocksdb. There are some
> tools to check health and also repair. But there are limited
> documentation.
> Anyone who has experince with it?
> Anyone lead/help to a proper check would be great.
> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
> tarihinde şunu yazdı:
> >
> > Hi all,
> >
> > We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >
> > I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >
> > For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> > However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >
> > So due to unresponsive OSDs cluster does not settle. This is our problem!
> >
> > This is the one we are very sure of. But we are not sure of the reason.
> >
> > Here is the latest ceph status:
> > https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >
> > This is the status after we started all of the OSDs 24 hours ago.
> > Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >
> > Here is the debug=20 log of an OSD which is same for all others:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> > As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >
> > This the full log the same OSD.
> > https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >
> > Here is the strace of the same OSD process:
> > https://paste.ubuntu.com/p/8n2kTvwnG6/
> >
> > Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >
> > Any help to figure out what is going on would be great.
> >
> > Best Regards,
> > Goktug Yildirim
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]         ` <CANrNMwUaER4e97schGN+KnB_6aNF6SYpPEFvLD=qcpp1eeWEJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-02 18:02           ` Goktug Yildirim
       [not found]             ` <44698713-3F6C-451D-B67B-2053D1B30557-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-02 18:02 UTC (permalink / raw)
  To: Darius Kasparavičius; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Hello Darius,

Thanks for reply!

The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops. 

So we have a small test like that:
* Stop all OSDs (168 of them)
* Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
* While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
* Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
* If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.

So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
One can think that is a lot of data there to scan. But it is just 33TB.

So at short we dont know which PG is stuck so we can remove it.

However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.

Any thought would be greatly appreciated.


> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
> 
> Hello,
> 
> Currently you have 15 objects missing. I would recommend finding them
> and making backups of them. Ditch all other osds that are failing to
> start and concentrate on bringing online those that have missing
> objects. Then slowly turn off nodown and noout on the cluster and see
> if it stabilises. If it stabilises leave these setting if not turn
> them back on.
> Now get some of the pg's that are blocked and querry the pgs to check
> why they are blocked. Try removing as much blocks as possible and then
> remove the norebalance/norecovery flags and see if it starts to fix
> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> <morphinwithyou@gmail.com> wrote:
>> 
>> One of ceph experts indicated that bluestore is somewhat preview tech
>> (as for Redhat).
>> So it could be best to checkout bluestore and rocksdb. There are some
>> tools to check health and also repair. But there are limited
>> documentation.
>> Anyone who has experince with it?
>> Anyone lead/help to a proper check would be great.
>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>> tarihinde şunu yazdı:
>>> 
>>> Hi all,
>>> 
>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>> 
>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>> 
>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>> 
>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>> 
>>> This is the one we are very sure of. But we are not sure of the reason.
>>> 
>>> Here is the latest ceph status:
>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>> 
>>> This is the status after we started all of the OSDs 24 hours ago.
>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>> 
>>> Here is the debug=20 log of an OSD which is same for all others:
>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>> 
>>> This the full log the same OSD.
>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>> 
>>> Here is the strace of the same OSD process:
>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> 
>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>> 
>>> Any help to figure out what is going on would be great.
>>> 
>>> Best Regards,
>>> Goktug Yildirim
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]             ` <44698713-3F6C-451D-B67B-2053D1B30557-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-02 19:51               ` Paul Emmerich
       [not found]                 ` <CAD9yTbHM6A0j4ghVzd6y=taoUaY-tZ+pfmhB=ZkVr+v-P-C7VQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-10-02 19:53               ` Darius Kasparavičius
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Emmerich @ 2018-10-02 19:51 UTC (permalink / raw)
  To: goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w; +Cc: ceph-devel, Ceph Users

(Didn't follow the whole story, so you might have already answered that)
Did you check what the OSDs are doing during the period of high disk
utilization?
As in:

* running perf top
* sampling a few stack traces from procfs or gdb
* or just high log settings
* running "status" on the admin socket locally


Paul

Am Di., 2. Okt. 2018 um 20:02 Uhr schrieb Goktug Yildirim
<goktug.yildirim@gmail.com>:
>
> Hello Darius,
>
> Thanks for reply!
>
> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>
> So we have a small test like that:
> * Stop all OSDs (168 of them)
> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>
> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> One can think that is a lot of data there to scan. But it is just 33TB.
>
> So at short we dont know which PG is stuck so we can remove it.
>
> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>
> Any thought would be greatly appreciated.
>
>
> > On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
> >
> > Hello,
> >
> > Currently you have 15 objects missing. I would recommend finding them
> > and making backups of them. Ditch all other osds that are failing to
> > start and concentrate on bringing online those that have missing
> > objects. Then slowly turn off nodown and noout on the cluster and see
> > if it stabilises. If it stabilises leave these setting if not turn
> > them back on.
> > Now get some of the pg's that are blocked and querry the pgs to check
> > why they are blocked. Try removing as much blocks as possible and then
> > remove the norebalance/norecovery flags and see if it starts to fix
> > itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> > <morphinwithyou@gmail.com> wrote:
> >>
> >> One of ceph experts indicated that bluestore is somewhat preview tech
> >> (as for Redhat).
> >> So it could be best to checkout bluestore and rocksdb. There are some
> >> tools to check health and also repair. But there are limited
> >> documentation.
> >> Anyone who has experince with it?
> >> Anyone lead/help to a proper check would be great.
> >> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
> >> tarihinde şunu yazdı:
> >>>
> >>> Hi all,
> >>>
> >>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>
> >>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>
> >>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>
> >>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>
> >>> This is the one we are very sure of. But we are not sure of the reason.
> >>>
> >>> Here is the latest ceph status:
> >>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>
> >>> This is the status after we started all of the OSDs 24 hours ago.
> >>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>
> >>> Here is the debug=20 log of an OSD which is same for all others:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>
> >>> This the full log the same OSD.
> >>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>
> >>> Here is the strace of the same OSD process:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>
> >>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>
> >>> Any help to figure out what is going on would be great.
> >>>
> >>> Best Regards,
> >>> Goktug Yildirim
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]             ` <44698713-3F6C-451D-B67B-2053D1B30557-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2018-10-02 19:51               ` Paul Emmerich
@ 2018-10-02 19:53               ` Darius Kasparavičius
       [not found]                 ` <CANrNMwW-RiVxtUnjTrZpiRXi6k06LxaeqRCqbXiuwwjTStWBhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Darius Kasparavičius @ 2018-10-02 19:53 UTC (permalink / raw)
  To: goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Hi,


I can see some issues from the osd log file. You have an extremely low
size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
recommend cranking up rocksdb cache size as much as possible. If you
have RAM you can also increase bluestores cache size for hdd. Default
is 1GB be as liberal as you can without getting OOM kills. You also
have lots of osd map loading and decoding in the log. Are you sure all
monitors/managers/osds are up to date? Plus make sure you aren't
forcing jemalloc loading. I had a funny interaction after upgrading to
mimic.
On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
<goktug.yildirim@gmail.com> wrote:
>
> Hello Darius,
>
> Thanks for reply!
>
> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>
> So we have a small test like that:
> * Stop all OSDs (168 of them)
> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>
> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> One can think that is a lot of data there to scan. But it is just 33TB.
>
> So at short we dont know which PG is stuck so we can remove it.
>
> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>
> Any thought would be greatly appreciated.
>
>
> > On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
> >
> > Hello,
> >
> > Currently you have 15 objects missing. I would recommend finding them
> > and making backups of them. Ditch all other osds that are failing to
> > start and concentrate on bringing online those that have missing
> > objects. Then slowly turn off nodown and noout on the cluster and see
> > if it stabilises. If it stabilises leave these setting if not turn
> > them back on.
> > Now get some of the pg's that are blocked and querry the pgs to check
> > why they are blocked. Try removing as much blocks as possible and then
> > remove the norebalance/norecovery flags and see if it starts to fix
> > itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> > <morphinwithyou@gmail.com> wrote:
> >>
> >> One of ceph experts indicated that bluestore is somewhat preview tech
> >> (as for Redhat).
> >> So it could be best to checkout bluestore and rocksdb. There are some
> >> tools to check health and also repair. But there are limited
> >> documentation.
> >> Anyone who has experince with it?
> >> Anyone lead/help to a proper check would be great.
> >> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
> >> tarihinde şunu yazdı:
> >>>
> >>> Hi all,
> >>>
> >>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>
> >>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>
> >>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>
> >>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>
> >>> This is the one we are very sure of. But we are not sure of the reason.
> >>>
> >>> Here is the latest ceph status:
> >>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>
> >>> This is the status after we started all of the OSDs 24 hours ago.
> >>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>
> >>> Here is the debug=20 log of an OSD which is same for all others:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>
> >>> This the full log the same OSD.
> >>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>
> >>> Here is the strace of the same OSD process:
> >>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>
> >>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>
> >>> Any help to figure out what is going on would be great.
> >>>
> >>> Best Regards,
> >>> Goktug Yildirim
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                 ` <CAD9yTbHM6A0j4ghVzd6y=taoUaY-tZ+pfmhB=ZkVr+v-P-C7VQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-02 20:01                   ` Goktug Yildirim
  0 siblings, 0 replies; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-02 20:01 UTC (permalink / raw)
  To: Paul Emmerich; +Cc: ceph-devel, Ceph Users

Thanks for the reply! My answers are inline.

> On 2 Oct 2018, at 21:51, Paul Emmerich <paul.emmerich@croit.io> wrote:
> 
> (Didn't follow the whole story, so you might have already answered that)
> Did you check what the OSDs are doing during the period of high disk
> utilization?
> As in:
> 
> * running perf top
Did not cross my mind. Thanks for the pop-up! Will do.
> * sampling a few stack traces from procfs or gdb
I have strace for OSD. https://paste.ubuntu.com/p/8n2kTvwnG6/
> * or just high log settings
They have default debug settings and also log disk is different. Indeed I have a fairly fast system. OS disks are Mirror SSD, WALs+DBs are mirrored NvME and OSD disks are NL-SAS. All hardware came from Dell (R730). Also 28 Core and 256GB RAM per server and 2x10Ge cluster and 2x10Gbe for public networks.
> * running "status" on the admin socket locally
I can run daemon and see status. I must have checked it but will do again.
> 
> 
> Paul
> 
> Am Di., 2. Okt. 2018 um 20:02 Uhr schrieb Goktug Yildirim
> <goktug.yildirim@gmail.com>:
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> Currently you have 15 objects missing. I would recommend finding them
>>> and making backups of them. Ditch all other osds that are failing to
>>> start and concentrate on bringing online those that have missing
>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>> if it stabilises. If it stabilises leave these setting if not turn
>>> them back on.
>>> Now get some of the pg's that are blocked and querry the pgs to check
>>> why they are blocked. Try removing as much blocks as possible and then
>>> remove the norebalance/norecovery flags and see if it starts to fix
>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>> <morphinwithyou@gmail.com> wrote:
>>>> 
>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>> (as for Redhat).
>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>> tools to check health and also repair. But there are limited
>>>> documentation.
>>>> Anyone who has experince with it?
>>>> Anyone lead/help to a proper check would be great.
>>>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>>>> tarihinde şunu yazdı:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>> 
>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>> 
>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>> 
>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>> 
>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>> 
>>>>> Here is the latest ceph status:
>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>> 
>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>> 
>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>> 
>>>>> This the full log the same OSD.
>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>> 
>>>>> Here is the strace of the same OSD process:
>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>> 
>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>> 
>>>>> Any help to figure out what is going on would be great.
>>>>> 
>>>>> Best Regards,
>>>>> Goktug Yildirim
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                 ` <CANrNMwW-RiVxtUnjTrZpiRXi6k06LxaeqRCqbXiuwwjTStWBhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-02 20:11                   ` Goktug Yildirim
       [not found]                     ` <F96A1781-7767-40A4-B814-C59C0CDD5E68-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-02 20:11 UTC (permalink / raw)
  To: Darius Kasparavičius; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Hi,

Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing. 
For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.

For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!

I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/

> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
> 
> Hi,
> 
> 
> I can see some issues from the osd log file. You have an extremely low
> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> recommend cranking up rocksdb cache size as much as possible. If you
> have RAM you can also increase bluestores cache size for hdd. Default
> is 1GB be as liberal as you can without getting OOM kills. You also
> have lots of osd map loading and decoding in the log. Are you sure all
> monitors/managers/osds are up to date? Plus make sure you aren't
> forcing jemalloc loading. I had a funny interaction after upgrading to
> mimic.
> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> <goktug.yildirim@gmail.com> wrote:
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> Currently you have 15 objects missing. I would recommend finding them
>>> and making backups of them. Ditch all other osds that are failing to
>>> start and concentrate on bringing online those that have missing
>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>> if it stabilises. If it stabilises leave these setting if not turn
>>> them back on.
>>> Now get some of the pg's that are blocked and querry the pgs to check
>>> why they are blocked. Try removing as much blocks as possible and then
>>> remove the norebalance/norecovery flags and see if it starts to fix
>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>> <morphinwithyou@gmail.com> wrote:
>>>> 
>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>> (as for Redhat).
>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>> tools to check health and also repair. But there are limited
>>>> documentation.
>>>> Anyone who has experince with it?
>>>> Anyone lead/help to a proper check would be great.
>>>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>>>> tarihinde şunu yazdı:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>> 
>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>> 
>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>> 
>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>> 
>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>> 
>>>>> Here is the latest ceph status:
>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>> 
>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>> 
>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>> 
>>>>> This the full log the same OSD.
>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>> 
>>>>> Here is the strace of the same OSD process:
>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>> 
>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>> 
>>>>> Any help to figure out what is going on would be great.
>>>>> 
>>>>> Best Regards,
>>>>> Goktug Yildirim
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                     ` <F96A1781-7767-40A4-B814-C59C0CDD5E68-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-02 21:53                       ` Göktuğ Yıldırım
       [not found]                         ` <E516BA34-165C-4542-B53D-B022B913874A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Göktuğ Yıldırım @ 2018-10-02 21:53 UTC (permalink / raw)
  To: Darius Kasparavičius; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Anyone heart about osd_find_best_info_ignore_history_les = true ?
Is that be usefull here? There is such a less information about it.

Goktug Yildirim <goktug.yildirim@gmail.com> şunları yazdı (2 Eki 2018 22:11):

> Hi,
> 
> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing. 
> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> 
> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> 
> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> 
>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>> 
>> Hi,
>> 
>> 
>> I can see some issues from the osd log file. You have an extremely low
>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>> recommend cranking up rocksdb cache size as much as possible. If you
>> have RAM you can also increase bluestores cache size for hdd. Default
>> is 1GB be as liberal as you can without getting OOM kills. You also
>> have lots of osd map loading and decoding in the log. Are you sure all
>> monitors/managers/osds are up to date? Plus make sure you aren't
>> forcing jemalloc loading. I had a funny interaction after upgrading to
>> mimic.
>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>> <goktug.yildirim@gmail.com> wrote:
>>> 
>>> Hello Darius,
>>> 
>>> Thanks for reply!
>>> 
>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>> 
>>> So we have a small test like that:
>>> * Stop all OSDs (168 of them)
>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>> 
>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>> 
>>> So at short we dont know which PG is stuck so we can remove it.
>>> 
>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>> 
>>> Any thought would be greatly appreciated.
>>> 
>>> 
>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> Currently you have 15 objects missing. I would recommend finding them
>>>> and making backups of them. Ditch all other osds that are failing to
>>>> start and concentrate on bringing online those that have missing
>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>> them back on.
>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>> why they are blocked. Try removing as much blocks as possible and then
>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>> <morphinwithyou@gmail.com> wrote:
>>>>> 
>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>>> (as for Redhat).
>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>>> tools to check health and also repair. But there are limited
>>>>> documentation.
>>>>> Anyone who has experince with it?
>>>>> Anyone lead/help to a proper check would be great.
>>>>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>>>>> tarihinde şunu yazdı:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>> 
>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>>> 
>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>>> 
>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>>> 
>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>>> 
>>>>>> Here is the latest ceph status:
>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>> 
>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>>> 
>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>>> 
>>>>>> This the full log the same OSD.
>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>> 
>>>>>> Here is the strace of the same OSD process:
>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>> 
>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>>> 
>>>>>> Any help to figure out what is going on would be great.
>>>>>> 
>>>>>> Best Regards,
>>>>>> Goktug Yildirim
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                         ` <E516BA34-165C-4542-B53D-B022B913874A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03  3:14                           ` Sage Weil
       [not found]                             ` <alpine.DEB.2.11.1810030311230.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03  3:14 UTC (permalink / raw)
  To: Göktuğ Yıldırım
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8382 bytes --]

osd_find_best_info_ignore_history_les is a dangerous option and you should 
only use it in very specific circumstances when directed by a developer.  
In such cases it will allow a stuck PG to peer.  But you're not getting to 
that point...you're seeing some sort of resource exhaustion.

The noup trick works when OSDs are way behind on maps and all need to 
catch up.  The way to tell if they are behind is by looking at the 'ceph 
daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha 
t the mons have.  Were they really caught up when you unset noup?

I'm just catching up and haven't read the whole thread but I haven't seen 
anything that explains why teh OSDs are dong lots of disk IO.  Catching up 
on maps could explain it but not why they wouldn't peer once they were all 
marked up...

sage


On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:

> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> Is that be usefull here? There is such a less information about it.
> 
> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> 
> > Hi,
> > 
> > Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing. 
> > For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> > 
> > For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> > 
> > I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> > 
> >> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> Hi,
> >> 
> >> 
> >> I can see some issues from the osd log file. You have an extremely low
> >> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >> recommend cranking up rocksdb cache size as much as possible. If you
> >> have RAM you can also increase bluestores cache size for hdd. Default
> >> is 1GB be as liberal as you can without getting OOM kills. You also
> >> have lots of osd map loading and decoding in the log. Are you sure all
> >> monitors/managers/osds are up to date? Plus make sure you aren't
> >> forcing jemalloc loading. I had a funny interaction after upgrading to
> >> mimic.
> >> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> 
> >>> Hello Darius,
> >>> 
> >>> Thanks for reply!
> >>> 
> >>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>> 
> >>> So we have a small test like that:
> >>> * Stop all OSDs (168 of them)
> >>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>> 
> >>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>> 
> >>> So at short we dont know which PG is stuck so we can remove it.
> >>> 
> >>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>> 
> >>> Any thought would be greatly appreciated.
> >>> 
> >>> 
> >>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> Hello,
> >>>> 
> >>>> Currently you have 15 objects missing. I would recommend finding them
> >>>> and making backups of them. Ditch all other osds that are failing to
> >>>> start and concentrate on bringing online those that have missing
> >>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>> them back on.
> >>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>> why they are blocked. Try removing as much blocks as possible and then
> >>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>> 
> >>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>> (as for Redhat).
> >>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>> tools to check health and also repair. But there are limited
> >>>>> documentation.
> >>>>> Anyone who has experince with it?
> >>>>> Anyone lead/help to a proper check would be great.
> >>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>>> tarihinde şunu yazdı:
> >>>>>> 
> >>>>>> Hi all,
> >>>>>> 
> >>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>> 
> >>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>>>> 
> >>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>>>> 
> >>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>>>> 
> >>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>> 
> >>>>>> Here is the latest ceph status:
> >>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>> 
> >>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>>>> 
> >>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>>>> 
> >>>>>> This the full log the same OSD.
> >>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>> 
> >>>>>> Here is the strace of the same OSD process:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>> 
> >>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>>>> 
> >>>>>> Any help to figure out what is going on would be great.
> >>>>>> 
> >>>>>> Best Regards,
> >>>>>> Goktug Yildirim
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> 
> > 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                             ` <alpine.DEB.2.11.1810030311230.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-03 12:16                               ` Goktug Yildirim
       [not found]                                 ` <683EC5EE-8112-4D9F-B4A8-DBCDE46A7AD8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-03 12:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 14313 bytes --]

Hi Sage,

Thank you for your response. Now I am sure this incident is going to be resolved.

The problem started when 7 server crashed same time and they came back after ~5 minutes. 

Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.

After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.

When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
OSDs get stuck that we even can't run “ceph pg osd.0 query”.

Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones. 
With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem. 

Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now. 

This is what happens when the cluster starts. I believe it explains the whole story very nicely.
https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing <https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing>

This is our ceph.conf:
https://paste.ubuntu.com/p/8sQhfPDXnW/ <https://paste.ubuntu.com/p/8sQhfPDXnW/>

This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
https://paste.ubuntu.com/p/g5t8xnrjjZ/ <https://paste.ubuntu.com/p/g5t8xnrjjZ/>

This is pg dump:
https://paste.ubuntu.com/p/zYqsN5T95h/ <https://paste.ubuntu.com/p/zYqsN5T95h/>

This is iostat & perf top:
https://paste.ubuntu.com/p/Pgf3mcXXX8/ <https://paste.ubuntu.com/p/Pgf3mcXXX8/>

This strace output of ceph-osd:
https://paste.ubuntu.com/p/YCdtfh5qX8/ <https://paste.ubuntu.com/p/YCdtfh5qX8/>

This is OSD log (default debug):
https://paste.ubuntu.com/p/Z2JrrBzzkM/ <https://paste.ubuntu.com/p/Z2JrrBzzkM/>

This is leader MON log (default debug):
https://paste.ubuntu.com/p/RcGmsVKmzG/ <https://paste.ubuntu.com/p/RcGmsVKmzG/>

These are OSDs failed to start. Total number is 58.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/ <https://paste.ubuntu.com/p/ZfRD5ZtvpS/>
https://paste.ubuntu.com/p/pkRdVjCH4D/ <https://paste.ubuntu.com/p/pkRdVjCH4D/>
https://paste.ubuntu.com/p/zJTf2fzSj9/ <https://paste.ubuntu.com/p/zJTf2fzSj9/>
https://paste.ubuntu.com/p/xpJRK6YhRX/ <https://paste.ubuntu.com/p/xpJRK6YhRX/>
https://paste.ubuntu.com/p/SY3576dNbJ/ <https://paste.ubuntu.com/p/SY3576dNbJ/>
https://paste.ubuntu.com/p/smyT6Y976b/ <https://paste.ubuntu.com/p/smyT6Y976b/>


This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing <https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing>

This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing <https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing>

As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??

2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180


> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> 
> osd_find_best_info_ignore_history_les is a dangerous option and you should 
> only use it in very specific circumstances when directed by a developer.  
> In such cases it will allow a stuck PG to peer.  But you're not getting to 
> that point...you're seeing some sort of resource exhaustion.
> 
> The noup trick works when OSDs are way behind on maps and all need to 
> catch up.  The way to tell if they are behind is by looking at the 'ceph 
> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha 
> t the mons have.  Were they really caught up when you unset noup?
> 
> I'm just catching up and haven't read the whole thread but I haven't seen 
> anything that explains why teh OSDs are dong lots of disk IO.  Catching up 
> on maps could explain it but not why they wouldn't peer once they were all 
> marked up...
> 
> sage
> 
> 
> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> 
>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>> Is that be usefull here? There is such a less information about it.
>> 
>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>> 
>>> Hi,
>>> 
>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing. 
>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>> 
>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>> 
>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>> 
>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> I can see some issues from the osd log file. You have an extremely low
>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>> mimic.
>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> 
>>>>> Hello Darius,
>>>>> 
>>>>> Thanks for reply!
>>>>> 
>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>>> 
>>>>> So we have a small test like that:
>>>>> * Stop all OSDs (168 of them)
>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>>> 
>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>>> 
>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>>> 
>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>>> 
>>>>> Any thought would be greatly appreciated.
>>>>> 
>>>>> 
>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>>>>> and making backups of them. Ditch all other osds that are failing to
>>>>>> start and concentrate on bringing online those that have missing
>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>>>> them back on.
>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>> 
>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>>>>> (as for Redhat).
>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>>>>> tools to check health and also repair. But there are limited
>>>>>>> documentation.
>>>>>>> Anyone who has experince with it?
>>>>>>> Anyone lead/help to a proper check would be great.
>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>>>>>>> tarihinde şunu yazdı:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>>>> 
>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>>>>> 
>>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>>>>> 
>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>>>>> 
>>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>>>>> 
>>>>>>>> Here is the latest ceph status:
>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>>>> 
>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>>>>> 
>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>>>>> 
>>>>>>>> This the full log the same OSD.
>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>>>> 
>>>>>>>> Here is the strace of the same OSD process:
>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>> 
>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>>>>> 
>>>>>>>> Any help to figure out what is going on would be great.
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Goktug Yildirim
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>> 
>> 


[-- Attachment #1.2: Type: text/html, Size: 18769 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                 ` <683EC5EE-8112-4D9F-B4A8-DBCDE46A7AD8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03 12:27                                   ` Darius Kasparavičius
       [not found]                                     ` <CANrNMwVcE=Zyp6YbAk-1MB0zYYUt9_tqZSxkTBF5QVShjGqPZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Darius Kasparavičius @ 2018-10-03 12:27 UTC (permalink / raw)
  To: Goktug YILDIRIM; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Hello,


You can also reduce the osd map updates by adding this to your ceph
config file. "osd crush update on start = false". This should remove
and update that is generated when osd starts.

2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
e14 handle_command mon_command({"prefix": "osd crush
set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
"osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
dispatch
2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
e14 handle_command mon_command({"prefix": "osd crush create-or-move",
"id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
"root=default"]} v 0) v1
2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
"osd crush create-or-move", "id": 47, "weight":3.6396, "args":
["host=SRV-SEKUARK8", "root=default"]}]: dispatch
2018-10-03 14:03:21.538 7fe15eddb700  0
mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
'osd.47' initial_weight 3.6396 at location
{host=SRV-SEKUARK8,root=default}
2018-10-03 14:03:22.250 7fe1615e0700  1
mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
enabled


On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
<goktug.yildirim@gmail.com> wrote:
>
> Hi Sage,
>
> Thank you for your response. Now I am sure this incident is going to be resolved.
>
> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>
> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>
> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>
> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>
> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>
> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>
> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>
> This is our ceph.conf:
> https://paste.ubuntu.com/p/8sQhfPDXnW/
>
> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>
> This is pg dump:
> https://paste.ubuntu.com/p/zYqsN5T95h/
>
> This is iostat & perf top:
> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>
> This strace output of ceph-osd:
> https://paste.ubuntu.com/p/YCdtfh5qX8/
>
> This is OSD log (default debug):
> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>
> This is leader MON log (default debug):
> https://paste.ubuntu.com/p/RcGmsVKmzG/
>
> These are OSDs failed to start. Total number is 58.
> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> https://paste.ubuntu.com/p/pkRdVjCH4D/
> https://paste.ubuntu.com/p/zJTf2fzSj9/
> https://paste.ubuntu.com/p/xpJRK6YhRX/
> https://paste.ubuntu.com/p/SY3576dNbJ/
> https://paste.ubuntu.com/p/smyT6Y976b/
>
>
> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>
> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>
> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>
> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>
>
> On 3 Oct 2018, at 05:14, Sage Weil <sage@newdream.net> wrote:
>
> osd_find_best_info_ignore_history_les is a dangerous option and you should
> only use it in very specific circumstances when directed by a developer.
> In such cases it will allow a stuck PG to peer.  But you're not getting to
> that point...you're seeing some sort of resource exhaustion.
>
> The noup trick works when OSDs are way behind on maps and all need to
> catch up.  The way to tell if they are behind is by looking at the 'ceph
> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> t the mons have.  Were they really caught up when you unset noup?
>
> I'm just catching up and haven't read the whole thread but I haven't seen
> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> on maps could explain it but not why they wouldn't peer once they were all
> marked up...
>
> sage
>
>
> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>
> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> Is that be usefull here? There is such a less information about it.
>
> Goktug Yildirim <goktug.yildirim@gmail.com> şunları yazdı (2 Eki 2018 22:11):
>
> Hi,
>
> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>
> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>
> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>
> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>
> Hi,
>
>
> I can see some issues from the osd log file. You have an extremely low
> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> recommend cranking up rocksdb cache size as much as possible. If you
> have RAM you can also increase bluestores cache size for hdd. Default
> is 1GB be as liberal as you can without getting OOM kills. You also
> have lots of osd map loading and decoding in the log. Are you sure all
> monitors/managers/osds are up to date? Plus make sure you aren't
> forcing jemalloc loading. I had a funny interaction after upgrading to
> mimic.
> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> <goktug.yildirim@gmail.com> wrote:
>
>
> Hello Darius,
>
> Thanks for reply!
>
> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>
> So we have a small test like that:
> * Stop all OSDs (168 of them)
> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>
> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> One can think that is a lot of data there to scan. But it is just 33TB.
>
> So at short we dont know which PG is stuck so we can remove it.
>
> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>
> Any thought would be greatly appreciated.
>
>
> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>
> Hello,
>
> Currently you have 15 objects missing. I would recommend finding them
> and making backups of them. Ditch all other osds that are failing to
> start and concentrate on bringing online those that have missing
> objects. Then slowly turn off nodown and noout on the cluster and see
> if it stabilises. If it stabilises leave these setting if not turn
> them back on.
> Now get some of the pg's that are blocked and querry the pgs to check
> why they are blocked. Try removing as much blocks as possible and then
> remove the norebalance/norecovery flags and see if it starts to fix
> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> <morphinwithyou@gmail.com> wrote:
>
>
> One of ceph experts indicated that bluestore is somewhat preview tech
> (as for Redhat).
> So it could be best to checkout bluestore and rocksdb. There are some
> tools to check health and also repair. But there are limited
> documentation.
> Anyone who has experince with it?
> Anyone lead/help to a proper check would be great.
> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
> tarihinde şunu yazdı:
>
>
> Hi all,
>
> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>
> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>
> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>
> So due to unresponsive OSDs cluster does not settle. This is our problem!
>
> This is the one we are very sure of. But we are not sure of the reason.
>
> Here is the latest ceph status:
> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>
> Here is the debug=20 log of an OSD which is same for all others:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>
> This the full log the same OSD.
> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>
> Here is the strace of the same OSD process:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
>
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>
> Any help to figure out what is going on would be great.
>
> Best Regards,
> Goktug Yildirim
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                     ` <CANrNMwVcE=Zyp6YbAk-1MB0zYYUt9_tqZSxkTBF5QVShjGqPZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-03 13:43                                       ` Goktug Yildirim
       [not found]                                         ` <F58B4646-D64C-469E-BA46-4ED6B5C96F8A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-03 13:43 UTC (permalink / raw)
  To: Darius Kasparavičius; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 15250 bytes --]

Hello,

It seems nothing has changed.

OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>


> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> Hello,
> 
> 
> You can also reduce the osd map updates by adding this to your ceph
> config file. "osd crush update on start = false". This should remove
> and update that is generated when osd starts.
> 
> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> "root=default"]} v 0) v1
> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0
> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> 'osd.47' initial_weight 3.6396 at location
> {host=SRV-SEKUARK8,root=default}
> 2018-10-03 14:03:22.250 7fe1615e0700  1
> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> enabled
> 
> 
> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> Hi Sage,
>> 
>> Thank you for your response. Now I am sure this incident is going to be resolved.
>> 
>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>> 
>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>> 
>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>> 
>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>> 
>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>> 
>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>> 
>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>> 
>> This is our ceph.conf:
>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>> 
>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>> 
>> This is pg dump:
>> https://paste.ubuntu.com/p/zYqsN5T95h/
>> 
>> This is iostat & perf top:
>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>> 
>> This strace output of ceph-osd:
>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>> 
>> This is OSD log (default debug):
>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>> 
>> This is leader MON log (default debug):
>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>> 
>> These are OSDs failed to start. Total number is 58.
>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> https://paste.ubuntu.com/p/smyT6Y976b/
>> 
>> 
>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>> 
>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>> 
>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>> 
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>> 
>> 
>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>> 
>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>> only use it in very specific circumstances when directed by a developer.
>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>> that point...you're seeing some sort of resource exhaustion.
>> 
>> The noup trick works when OSDs are way behind on maps and all need to
>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>> t the mons have.  Were they really caught up when you unset noup?
>> 
>> I'm just catching up and haven't read the whole thread but I haven't seen
>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>> on maps could explain it but not why they wouldn't peer once they were all
>> marked up...
>> 
>> sage
>> 
>> 
>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>> 
>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>> Is that be usefull here? There is such a less information about it.
>> 
>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>> 
>> Hi,
>> 
>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>> 
>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>> 
>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>> 
>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> Hi,
>> 
>> 
>> I can see some issues from the osd log file. You have an extremely low
>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>> recommend cranking up rocksdb cache size as much as possible. If you
>> have RAM you can also increase bluestores cache size for hdd. Default
>> is 1GB be as liberal as you can without getting OOM kills. You also
>> have lots of osd map loading and decoding in the log. Are you sure all
>> monitors/managers/osds are up to date? Plus make sure you aren't
>> forcing jemalloc loading. I had a funny interaction after upgrading to
>> mimic.
>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> Hello,
>> 
>> Currently you have 15 objects missing. I would recommend finding them
>> and making backups of them. Ditch all other osds that are failing to
>> start and concentrate on bringing online those that have missing
>> objects. Then slowly turn off nodown and noout on the cluster and see
>> if it stabilises. If it stabilises leave these setting if not turn
>> them back on.
>> Now get some of the pg's that are blocked and querry the pgs to check
>> why they are blocked. Try removing as much blocks as possible and then
>> remove the norebalance/norecovery flags and see if it starts to fix
>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> 
>> 
>> One of ceph experts indicated that bluestore is somewhat preview tech
>> (as for Redhat).
>> So it could be best to checkout bluestore and rocksdb. There are some
>> tools to check health and also repair. But there are limited
>> documentation.
>> Anyone who has experince with it?
>> Anyone lead/help to a proper check would be great.
>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>> tarihinde şunu yazdı:
>> 
>> 
>> Hi all,
>> 
>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>> 
>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>> 
>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>> 
>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>> 
>> This is the one we are very sure of. But we are not sure of the reason.
>> 
>> Here is the latest ceph status:
>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>> 
>> This is the status after we started all of the OSDs 24 hours ago.
>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>> 
>> Here is the debug=20 log of an OSD which is same for all others:
>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>> 
>> This the full log the same OSD.
>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>> 
>> Here is the strace of the same OSD process:
>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> 
>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>> 
>> Any help to figure out what is going on would be great.
>> 
>> Best Regards,
>> Goktug Yildirim
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
>> 


[-- Attachment #1.2: Type: text/html, Size: 18445 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                         ` <F58B4646-D64C-469E-BA46-4ED6B5C96F8A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03 14:46                                           ` Sage Weil
       [not found]                                             ` <alpine.DEB.2.11.1810031443440.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 14:46 UTC (permalink / raw)
  To: Goktug Yildirim
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 16387 bytes --]

Oh... I think this is the problem:

2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916

You are in the midst of creating new pgs, and unfortunately pg create is 
one of the last remaining places where the OSDs need to look at a full 
history of map changes between then and the current map epoch.  In this 
case, the pool was created in 60196 and it is now 72883, ~12k epochs 
later.

What is this new pool for?  Is it still empty, and if so, can we delete 
it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
and restarting the OSDs will make these pg_creates go away.

s

On Wed, 3 Oct 2018, Goktug Yildirim wrote:

> Hello,
> 
> It seems nothing has changed.
> 
> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> 
> 
> > On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > 
> > Hello,
> > 
> > 
> > You can also reduce the osd map updates by adding this to your ceph
> > config file. "osd crush update on start = false". This should remove
> > and update that is generated when osd starts.
> > 
> > 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> > e14 handle_command mon_command({"prefix": "osd crush
> > set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> > 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> > from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> > "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> > dispatch
> > 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> > e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> > "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> > "root=default"]} v 0) v1
> > 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> > from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> > "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> > ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> > 2018-10-03 14:03:21.538 7fe15eddb700  0
> > mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> > 'osd.47' initial_weight 3.6396 at location
> > {host=SRV-SEKUARK8,root=default}
> > 2018-10-03 14:03:22.250 7fe1615e0700  1
> > mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> > enabled
> > 
> > 
> > On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> > <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> Hi Sage,
> >> 
> >> Thank you for your response. Now I am sure this incident is going to be resolved.
> >> 
> >> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >> 
> >> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >> 
> >> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >> 
> >> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >> 
> >> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >> 
> >> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >> 
> >> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >> 
> >> This is our ceph.conf:
> >> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >> 
> >> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >> 
> >> This is pg dump:
> >> https://paste.ubuntu.com/p/zYqsN5T95h/
> >> 
> >> This is iostat & perf top:
> >> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >> 
> >> This strace output of ceph-osd:
> >> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >> 
> >> This is OSD log (default debug):
> >> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >> 
> >> This is leader MON log (default debug):
> >> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >> 
> >> These are OSDs failed to start. Total number is 58.
> >> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >> https://paste.ubuntu.com/p/SY3576dNbJ/
> >> https://paste.ubuntu.com/p/smyT6Y976b/
> >> 
> >> 
> >> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >> 
> >> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >> 
> >> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >> 
> >> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >> 
> >> 
> >> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >> 
> >> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >> only use it in very specific circumstances when directed by a developer.
> >> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >> that point...you're seeing some sort of resource exhaustion.
> >> 
> >> The noup trick works when OSDs are way behind on maps and all need to
> >> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >> t the mons have.  Were they really caught up when you unset noup?
> >> 
> >> I'm just catching up and haven't read the whole thread but I haven't seen
> >> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >> on maps could explain it but not why they wouldn't peer once they were all
> >> marked up...
> >> 
> >> sage
> >> 
> >> 
> >> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >> 
> >> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >> Is that be usefull here? There is such a less information about it.
> >> 
> >> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >> 
> >> Hi,
> >> 
> >> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >> 
> >> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >> 
> >> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >> 
> >> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> Hi,
> >> 
> >> 
> >> I can see some issues from the osd log file. You have an extremely low
> >> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >> recommend cranking up rocksdb cache size as much as possible. If you
> >> have RAM you can also increase bluestores cache size for hdd. Default
> >> is 1GB be as liberal as you can without getting OOM kills. You also
> >> have lots of osd map loading and decoding in the log. Are you sure all
> >> monitors/managers/osds are up to date? Plus make sure you aren't
> >> forcing jemalloc loading. I had a funny interaction after upgrading to
> >> mimic.
> >> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> 
> >> Hello Darius,
> >> 
> >> Thanks for reply!
> >> 
> >> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >> 
> >> So we have a small test like that:
> >> * Stop all OSDs (168 of them)
> >> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >> 
> >> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >> One can think that is a lot of data there to scan. But it is just 33TB.
> >> 
> >> So at short we dont know which PG is stuck so we can remove it.
> >> 
> >> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >> 
> >> Any thought would be greatly appreciated.
> >> 
> >> 
> >> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> Hello,
> >> 
> >> Currently you have 15 objects missing. I would recommend finding them
> >> and making backups of them. Ditch all other osds that are failing to
> >> start and concentrate on bringing online those that have missing
> >> objects. Then slowly turn off nodown and noout on the cluster and see
> >> if it stabilises. If it stabilises leave these setting if not turn
> >> them back on.
> >> Now get some of the pg's that are blocked and querry the pgs to check
> >> why they are blocked. Try removing as much blocks as possible and then
> >> remove the norebalance/norecovery flags and see if it starts to fix
> >> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> 
> >> 
> >> One of ceph experts indicated that bluestore is somewhat preview tech
> >> (as for Redhat).
> >> So it could be best to checkout bluestore and rocksdb. There are some
> >> tools to check health and also repair. But there are limited
> >> documentation.
> >> Anyone who has experince with it?
> >> Anyone lead/help to a proper check would be great.
> >> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >> tarihinde şunu yazdı:
> >> 
> >> 
> >> Hi all,
> >> 
> >> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >> 
> >> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >> 
> >> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >> 
> >> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >> 
> >> This is the one we are very sure of. But we are not sure of the reason.
> >> 
> >> Here is the latest ceph status:
> >> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >> 
> >> This is the status after we started all of the OSDs 24 hours ago.
> >> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >> 
> >> Here is the debug=20 log of an OSD which is same for all others:
> >> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >> 
> >> This the full log the same OSD.
> >> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >> 
> >> Here is the strace of the same OSD process:
> >> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >> 
> >> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >> 
> >> Any help to figure out what is going on would be great.
> >> 
> >> Best Regards,
> >> Goktug Yildirim
> >> 
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> >> 
> >> 
> >> 
> >> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                             ` <alpine.DEB.2.11.1810031443440.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-03 15:22                                               ` Goktug Yildirim
       [not found]                                                 ` <8C3EC489-68A9-4622-A283-CCAB9AE68D93-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-03 15:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, by morphin, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 17845 bytes --]

Sage,

Pool 66 is the only pool it shows right now. This a pool created months ago.
ceph osd lspools
66 mypool

As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.

ceph fs dump
dumped fsmap epoch 1 e1
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1

No filesystems configured

ceph fs ls
No filesystems enabled

Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.

Is there any way to find and delete it?


> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> 
> Oh... I think this is the problem:
> 
> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> 
> You are in the midst of creating new pgs, and unfortunately pg create is 
> one of the last remaining places where the OSDs need to look at a full 
> history of map changes between then and the current map epoch.  In this 
> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> later.
> 
> What is this new pool for?  Is it still empty, and if so, can we delete 
> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> and restarting the OSDs will make these pg_creates go away.
> 
> s
> 
> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> 
>> Hello,
>> 
>> It seems nothing has changed.
>> 
>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>> 
>> 
>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> 
>>> Hello,
>>> 
>>> 
>>> You can also reduce the osd map updates by adding this to your ceph
>>> config file. "osd crush update on start = false". This should remove
>>> and update that is generated when osd starts.
>>> 
>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> e14 handle_command mon_command({"prefix": "osd crush
>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>> dispatch
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>> "root=default"]} v 0) v1
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>> 'osd.47' initial_weight 3.6396 at location
>>> {host=SRV-SEKUARK8,root=default}
>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>> enabled
>>> 
>>> 
>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> Hi Sage,
>>>> 
>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>>> 
>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>>> 
>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>>> 
>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>>> 
>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>> 
>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>>> 
>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>>> 
>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>> 
>>>> This is our ceph.conf:
>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>> 
>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>> 
>>>> This is pg dump:
>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>> 
>>>> This is iostat & perf top:
>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>> 
>>>> This strace output of ceph-osd:
>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>> 
>>>> This is OSD log (default debug):
>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>> 
>>>> This is leader MON log (default debug):
>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>> 
>>>> These are OSDs failed to start. Total number is 58.
>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>> 
>>>> 
>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>> 
>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>> 
>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>>> 
>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>>> 
>>>> 
>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>>> 
>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>>> only use it in very specific circumstances when directed by a developer.
>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>>> that point...you're seeing some sort of resource exhaustion.
>>>> 
>>>> The noup trick works when OSDs are way behind on maps and all need to
>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>>> t the mons have.  Were they really caught up when you unset noup?
>>>> 
>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>>> on maps could explain it but not why they wouldn't peer once they were all
>>>> marked up...
>>>> 
>>>> sage
>>>> 
>>>> 
>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>> 
>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>> Is that be usefull here? There is such a less information about it.
>>>> 
>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>>>> 
>>>> Hi,
>>>> 
>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>>> 
>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>>> 
>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>>> 
>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> I can see some issues from the osd log file. You have an extremely low
>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>> mimic.
>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> 
>>>> Hello Darius,
>>>> 
>>>> Thanks for reply!
>>>> 
>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>> 
>>>> So we have a small test like that:
>>>> * Stop all OSDs (168 of them)
>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>> 
>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>> 
>>>> So at short we dont know which PG is stuck so we can remove it.
>>>> 
>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>> 
>>>> Any thought would be greatly appreciated.
>>>> 
>>>> 
>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> Currently you have 15 objects missing. I would recommend finding them
>>>> and making backups of them. Ditch all other osds that are failing to
>>>> start and concentrate on bringing online those that have missing
>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>> them back on.
>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>> why they are blocked. Try removing as much blocks as possible and then
>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> 
>>>> 
>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>> (as for Redhat).
>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>> tools to check health and also repair. But there are limited
>>>> documentation.
>>>> Anyone who has experince with it?
>>>> Anyone lead/help to a proper check would be great.
>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>>>> tarihinde şunu yazdı:
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>> 
>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>> 
>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>> 
>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>> 
>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>> 
>>>> Here is the latest ceph status:
>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>> 
>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>> 
>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>> 
>>>> This the full log the same OSD.
>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>> 
>>>> Here is the strace of the same OSD process:
>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>> 
>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>> 
>>>> Any help to figure out what is going on would be great.
>>>> 
>>>> Best Regards,
>>>> Goktug Yildirim
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 


[-- Attachment #1.2: Type: text/html, Size: 21687 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                 ` <8C3EC489-68A9-4622-A283-CCAB9AE68D93-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03 15:52                                                   ` Sage Weil
       [not found]                                                     ` <alpine.DEB.2.11.1810031543160.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 15:52 UTC (permalink / raw)
  To: Goktug Yildirim; +Cc: ceph-devel, by morphin, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 19369 bytes --]

On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> Sage,
> 
> Pool 66 is the only pool it shows right now. This a pool created months ago.
> ceph osd lspools
> 66 mypool
> 
> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> 
> ceph fs dump
> dumped fsmap epoch 1 e1
> enable_multiple, ever_enabled_multiple: 0,0
> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: -1
> 
> No filesystems configured
> 
> ceph fs ls
> No filesystems enabled
> 
> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> 
> Is there any way to find and delete it?

Ok, I'm concerned that the creating pg is in there if this is an old 
pool... did you perhaps run force_create_pg at some point?  Assuming you 
didn't, I think this is a bug in the process for rebuilding the mon 
store.. one that doesn't normally come up because the impact is this 
osdmap scan that is cheap in our test scenarios but clearly not cheap for 
your aged cluster.

In any case, there is a way to clear those out of the mon, but it's a bit 
dicey. 

1. stop all mons
2. make a backup of all mons
3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
key=creating key on one of the mons
4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
5. hex edit the structure to remove all of the creating pgs, and adds pool 
66 to the created_pgs member.
6. verify with ceph-dencoder dump that the edit was correct...
7. inject the updated structure into all of the mons
8. start all mons

4-6 will probably be an iterative process... let's start by getting the 
structure out and dumping the current value?  

The code to refer to to understand the structure is src/mon/CreatingPGs.h 
encode/decode methods.

sage


> 
> 
> > On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > 
> > Oh... I think this is the problem:
> > 
> > 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> > 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> > 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> > 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> > 
> > You are in the midst of creating new pgs, and unfortunately pg create is 
> > one of the last remaining places where the OSDs need to look at a full 
> > history of map changes between then and the current map epoch.  In this 
> > case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> > later.
> > 
> > What is this new pool for?  Is it still empty, and if so, can we delete 
> > it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> > and restarting the OSDs will make these pg_creates go away.
> > 
> > s
> > 
> > On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > 
> >> Hello,
> >> 
> >> It seems nothing has changed.
> >> 
> >> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >> 
> >> 
> >>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> 
> >>> You can also reduce the osd map updates by adding this to your ceph
> >>> config file. "osd crush update on start = false". This should remove
> >>> and update that is generated when osd starts.
> >>> 
> >>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> e14 handle_command mon_command({"prefix": "osd crush
> >>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>> dispatch
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>> "root=default"]} v 0) v1
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>> 'osd.47' initial_weight 3.6396 at location
> >>> {host=SRV-SEKUARK8,root=default}
> >>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>> enabled
> >>> 
> >>> 
> >>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> Hi Sage,
> >>>> 
> >>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>>> 
> >>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>>> 
> >>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>>> 
> >>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>>> 
> >>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>> 
> >>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>>> 
> >>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>>> 
> >>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>> 
> >>>> This is our ceph.conf:
> >>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>> 
> >>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>> 
> >>>> This is pg dump:
> >>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>> 
> >>>> This is iostat & perf top:
> >>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>> 
> >>>> This strace output of ceph-osd:
> >>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>> 
> >>>> This is OSD log (default debug):
> >>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>> 
> >>>> This is leader MON log (default debug):
> >>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>> 
> >>>> These are OSDs failed to start. Total number is 58.
> >>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>> 
> >>>> 
> >>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>> 
> >>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>> 
> >>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>>> 
> >>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>>> 
> >>>> 
> >>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>> 
> >>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>>> only use it in very specific circumstances when directed by a developer.
> >>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>>> that point...you're seeing some sort of resource exhaustion.
> >>>> 
> >>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>>> t the mons have.  Were they really caught up when you unset noup?
> >>>> 
> >>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>>> on maps could explain it but not why they wouldn't peer once they were all
> >>>> marked up...
> >>>> 
> >>>> sage
> >>>> 
> >>>> 
> >>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>> 
> >>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>> Is that be usefull here? There is such a less information about it.
> >>>> 
> >>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>>> 
> >>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>>> 
> >>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>>> 
> >>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> Hi,
> >>>> 
> >>>> 
> >>>> I can see some issues from the osd log file. You have an extremely low
> >>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>> mimic.
> >>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> 
> >>>> Hello Darius,
> >>>> 
> >>>> Thanks for reply!
> >>>> 
> >>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>>> 
> >>>> So we have a small test like that:
> >>>> * Stop all OSDs (168 of them)
> >>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>>> 
> >>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>> 
> >>>> So at short we dont know which PG is stuck so we can remove it.
> >>>> 
> >>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>>> 
> >>>> Any thought would be greatly appreciated.
> >>>> 
> >>>> 
> >>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> Hello,
> >>>> 
> >>>> Currently you have 15 objects missing. I would recommend finding them
> >>>> and making backups of them. Ditch all other osds that are failing to
> >>>> start and concentrate on bringing online those that have missing
> >>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>> them back on.
> >>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>> why they are blocked. Try removing as much blocks as possible and then
> >>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> 
> >>>> 
> >>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>> (as for Redhat).
> >>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>> tools to check health and also repair. But there are limited
> >>>> documentation.
> >>>> Anyone who has experince with it?
> >>>> Anyone lead/help to a proper check would be great.
> >>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>> tarihinde şunu yazdı:
> >>>> 
> >>>> 
> >>>> Hi all,
> >>>> 
> >>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>> 
> >>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>> 
> >>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>> 
> >>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>> 
> >>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>> 
> >>>> Here is the latest ceph status:
> >>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>> 
> >>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>> 
> >>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>> 
> >>>> This the full log the same OSD.
> >>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>> 
> >>>> Here is the strace of the same OSD process:
> >>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>> 
> >>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>> 
> >>>> Any help to figure out what is going on would be great.
> >>>> 
> >>>> Best Regards,
> >>>> Goktug Yildirim
> >>>> 
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                     ` <alpine.DEB.2.11.1810031543160.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-03 16:09                                                       ` Goktug Yildirim
       [not found]                                                         ` <C10AD823-76D4-4E2B-BE2E-BF434373C81B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-03 16:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, by morphin, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 20024 bytes --]

We are starting to work on it. First step is getting the structure out and dumping the current value as you say.

And you were correct we did not run force_create_pg.

> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> 
> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> Sage,
>> 
>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>> ceph osd lspools
>> 66 mypool
>> 
>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
>> 
>> ceph fs dump
>> dumped fsmap epoch 1 e1
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: -1
>> 
>> No filesystems configured
>> 
>> ceph fs ls
>> No filesystems enabled
>> 
>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
>> 
>> Is there any way to find and delete it?
> 
> Ok, I'm concerned that the creating pg is in there if this is an old 
> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> didn't, I think this is a bug in the process for rebuilding the mon 
> store.. one that doesn't normally come up because the impact is this 
> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> your aged cluster.
> 
> In any case, there is a way to clear those out of the mon, but it's a bit 
> dicey. 
> 
> 1. stop all mons
> 2. make a backup of all mons
> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> key=creating key on one of the mons
> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> 66 to the created_pgs member.
> 6. verify with ceph-dencoder dump that the edit was correct...
> 7. inject the updated structure into all of the mons
> 8. start all mons
> 
> 4-6 will probably be an iterative process... let's start by getting the 
> structure out and dumping the current value?  
> 
> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> encode/decode methods.
> 
> sage
> 
> 
>> 
>> 
>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>> 
>>> Oh... I think this is the problem:
>>> 
>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>> 
>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>> one of the last remaining places where the OSDs need to look at a full 
>>> history of map changes between then and the current map epoch.  In this 
>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>> later.
>>> 
>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>> and restarting the OSDs will make these pg_creates go away.
>>> 
>>> s
>>> 
>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> 
>>>> Hello,
>>>> 
>>>> It seems nothing has changed.
>>>> 
>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>>> 
>>>> 
>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> 
>>>>> You can also reduce the osd map updates by adding this to your ceph
>>>>> config file. "osd crush update on start = false". This should remove
>>>>> and update that is generated when osd starts.
>>>>> 
>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>>>> dispatch
>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>>>> "root=default"]} v 0) v1
>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>>>> 'osd.47' initial_weight 3.6396 at location
>>>>> {host=SRV-SEKUARK8,root=default}
>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>>>> enabled
>>>>> 
>>>>> 
>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> Hi Sage,
>>>>>> 
>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>>>>> 
>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>>>>> 
>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>>>>> 
>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>>>>> 
>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>>>> 
>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>>>>> 
>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>>>>> 
>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>>>> 
>>>>>> This is our ceph.conf:
>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>>>> 
>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>>>> 
>>>>>> This is pg dump:
>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>>>> 
>>>>>> This is iostat & perf top:
>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>>>> 
>>>>>> This strace output of ceph-osd:
>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>>>> 
>>>>>> This is OSD log (default debug):
>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>>>> 
>>>>>> This is leader MON log (default debug):
>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>>>> 
>>>>>> These are OSDs failed to start. Total number is 58.
>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>>>> 
>>>>>> 
>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>>>> 
>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>>>> 
>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>>>>> 
>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>>>>> 
>>>>>> 
>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>>>>> 
>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>>>>> only use it in very specific circumstances when directed by a developer.
>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>>>>> that point...you're seeing some sort of resource exhaustion.
>>>>>> 
>>>>>> The noup trick works when OSDs are way behind on maps and all need to
>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>>>>> 
>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>>>>> on maps could explain it but not why they wouldn't peer once they were all
>>>>>> marked up...
>>>>>> 
>>>>>> sage
>>>>>> 
>>>>>> 
>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>> 
>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>>>> Is that be usefull here? There is such a less information about it.
>>>>>> 
>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>>>>> 
>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>>>>> 
>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>>>>> 
>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I can see some issues from the osd log file. You have an extremely low
>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>>>> mimic.
>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hello Darius,
>>>>>> 
>>>>>> Thanks for reply!
>>>>>> 
>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>>>> 
>>>>>> So we have a small test like that:
>>>>>> * Stop all OSDs (168 of them)
>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>>>> 
>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>>>> 
>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>>>> 
>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>>>> 
>>>>>> Any thought would be greatly appreciated.
>>>>>> 
>>>>>> 
>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>>>>> and making backups of them. Ditch all other osds that are failing to
>>>>>> start and concentrate on bringing online those that have missing
>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>>>> them back on.
>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>>>> (as for Redhat).
>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>>>> tools to check health and also repair. But there are limited
>>>>>> documentation.
>>>>>> Anyone who has experince with it?
>>>>>> Anyone lead/help to a proper check would be great.
>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>>>>>> tarihinde şunu yazdı:
>>>>>> 
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>> 
>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>>> 
>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>>> 
>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>>> 
>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>>> 
>>>>>> Here is the latest ceph status:
>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>> 
>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>>> 
>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>>> 
>>>>>> This the full log the same OSD.
>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>> 
>>>>>> Here is the strace of the same OSD process:
>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>> 
>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>>> 
>>>>>> Any help to figure out what is going on would be great.
>>>>>> 
>>>>>> Best Regards,
>>>>>> Goktug Yildirim
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[-- Attachment #1.2: Type: text/html, Size: 41769 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                         ` <C10AD823-76D4-4E2B-BE2E-BF434373C81B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03 16:45                                                           ` Sage Weil
       [not found]                                                             ` <7C03273B-DC38-40CC-99F4-AFECE672D80A@usishi.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 16:45 UTC (permalink / raw)
  To: Goktug Yildirim; +Cc: ceph-devel, by morphin, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 21167 bytes --]

On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
> 
> And you were correct we did not run force_create_pg.

Great.

So, eager to see what the current structure is... please attach once you 
have it.

The new replacement one should look like this (when hexdump -C'd):

00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
0000001e

...except that from byte 6 you want to put in a recent OSDMap epoch, in 
hex, little endian (least significant byte first), in place of the 0x10 
that is there now.  It should dump like this:

$ ceph-dencoder type creating_pgs_t import myfile decode dump_json
{
    "last_scan_epoch": 16,   <--- but with a recent epoch here
    "creating_pgs": [],
    "queue": [],
    "created_pools": [
        66
    ]
}

sage


 > 
> > On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > 
> > On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >> Sage,
> >> 
> >> Pool 66 is the only pool it shows right now. This a pool created months ago.
> >> ceph osd lspools
> >> 66 mypool
> >> 
> >> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> >> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> >> 
> >> ceph fs dump
> >> dumped fsmap epoch 1 e1
> >> enable_multiple, ever_enabled_multiple: 0,0
> >> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >> legacy client fscid: -1
> >> 
> >> No filesystems configured
> >> 
> >> ceph fs ls
> >> No filesystems enabled
> >> 
> >> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> >> 
> >> Is there any way to find and delete it?
> > 
> > Ok, I'm concerned that the creating pg is in there if this is an old 
> > pool... did you perhaps run force_create_pg at some point?  Assuming you 
> > didn't, I think this is a bug in the process for rebuilding the mon 
> > store.. one that doesn't normally come up because the impact is this 
> > osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> > your aged cluster.
> > 
> > In any case, there is a way to clear those out of the mon, but it's a bit 
> > dicey. 
> > 
> > 1. stop all mons
> > 2. make a backup of all mons
> > 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> > key=creating key on one of the mons
> > 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> > 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> > 66 to the created_pgs member.
> > 6. verify with ceph-dencoder dump that the edit was correct...
> > 7. inject the updated structure into all of the mons
> > 8. start all mons
> > 
> > 4-6 will probably be an iterative process... let's start by getting the 
> > structure out and dumping the current value?  
> > 
> > The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> > encode/decode methods.
> > 
> > sage
> > 
> > 
> >> 
> >> 
> >>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> 
> >>> Oh... I think this is the problem:
> >>> 
> >>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>> 
> >>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>> one of the last remaining places where the OSDs need to look at a full 
> >>> history of map changes between then and the current map epoch.  In this 
> >>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>> later.
> >>> 
> >>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>> and restarting the OSDs will make these pg_creates go away.
> >>> 
> >>> s
> >>> 
> >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> 
> >>>> Hello,
> >>>> 
> >>>> It seems nothing has changed.
> >>>> 
> >>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>> 
> >>>> 
> >>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>> 
> >>>>> Hello,
> >>>>> 
> >>>>> 
> >>>>> You can also reduce the osd map updates by adding this to your ceph
> >>>>> config file. "osd crush update on start = false". This should remove
> >>>>> and update that is generated when osd starts.
> >>>>> 
> >>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>> dispatch
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>> "root=default"]} v 0) v1
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>> {host=SRV-SEKUARK8,root=default}
> >>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>>>> enabled
> >>>>> 
> >>>>> 
> >>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> Hi Sage,
> >>>>>> 
> >>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>>>>> 
> >>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>>>>> 
> >>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>>>>> 
> >>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>>>>> 
> >>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>> 
> >>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>>>>> 
> >>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>>>>> 
> >>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>> 
> >>>>>> This is our ceph.conf:
> >>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>> 
> >>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>> 
> >>>>>> This is pg dump:
> >>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>> 
> >>>>>> This is iostat & perf top:
> >>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>> 
> >>>>>> This strace output of ceph-osd:
> >>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>> 
> >>>>>> This is OSD log (default debug):
> >>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>> 
> >>>>>> This is leader MON log (default debug):
> >>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>> 
> >>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>> 
> >>>>>> 
> >>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>> 
> >>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>> 
> >>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>>>>> 
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>>>>> 
> >>>>>> 
> >>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>>>>> only use it in very specific circumstances when directed by a developer.
> >>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>> 
> >>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>>>>> t the mons have.  Were they really caught up when you unset noup?
> >>>>>> 
> >>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>>>>> on maps could explain it but not why they wouldn't peer once they were all
> >>>>>> marked up...
> >>>>>> 
> >>>>>> sage
> >>>>>> 
> >>>>>> 
> >>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>> 
> >>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>>>> Is that be usefull here? There is such a less information about it.
> >>>>>> 
> >>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>>>>> 
> >>>>>> Hi,
> >>>>>> 
> >>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>>>>> 
> >>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>>>>> 
> >>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>>>>> 
> >>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> Hi,
> >>>>>> 
> >>>>>> 
> >>>>>> I can see some issues from the osd log file. You have an extremely low
> >>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>>>> mimic.
> >>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> Hello Darius,
> >>>>>> 
> >>>>>> Thanks for reply!
> >>>>>> 
> >>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>>>>> 
> >>>>>> So we have a small test like that:
> >>>>>> * Stop all OSDs (168 of them)
> >>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>>>>> 
> >>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>>>> 
> >>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>> 
> >>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>>>>> 
> >>>>>> Any thought would be greatly appreciated.
> >>>>>> 
> >>>>>> 
> >>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> Hello,
> >>>>>> 
> >>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>>>>> and making backups of them. Ditch all other osds that are failing to
> >>>>>> start and concentrate on bringing online those that have missing
> >>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>>>> them back on.
> >>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>>> (as for Redhat).
> >>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>>> tools to check health and also repair. But there are limited
> >>>>>> documentation.
> >>>>>> Anyone who has experince with it?
> >>>>>> Anyone lead/help to a proper check would be great.
> >>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>>>> tarihinde şunu yazdı:
> >>>>>> 
> >>>>>> 
> >>>>>> Hi all,
> >>>>>> 
> >>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>> 
> >>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>>>> 
> >>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>>>> 
> >>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>>>> 
> >>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>> 
> >>>>>> Here is the latest ceph status:
> >>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>> 
> >>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>>>> 
> >>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>>>> 
> >>>>>> This the full log the same OSD.
> >>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>> 
> >>>>>> Here is the strace of the same OSD process:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>> 
> >>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>>>> 
> >>>>>> Any help to figure out what is going on would be great.
> >>>>>> 
> >>>>>> Best Regards,
> >>>>>> Goktug Yildirim
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                               ` <7C03273B-DC38-40CC-99F4-AFECE672D80A-DtrUWsE0is3QT0dZR+AlfA@public.gmane.org>
@ 2018-10-03 18:26                                                                 ` Sage Weil
       [not found]                                                                   ` <alpine.DEB.2.11.1810031826290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 18:26 UTC (permalink / raw)
  To: Göktuğ Yıldırım
  Cc: ceph-devel, by morphin, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 22795 bytes --]

On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> If I didn't do it wrong, I got the output as below.
> 
> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
> 
> ceph-dencoder type creating_pgs_t import dump dump_json

Sorry, should be

ceph-dencoder type creating_pgs_t import dump decode dump_json

s

> {
>     "last_scan_epoch": 0,
>     "creating_pgs": [],
>     "queue": [],
>     "created_pools": []
> }
> 
> You can find the "dump" link below.
> 
> dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> 
> 
> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
> 
> >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
> >> 
> >> And you were correct we did not run force_create_pg.
> > 
> > Great.
> > 
> > So, eager to see what the current structure is... please attach once you 
> > have it.
> > 
> > The new replacement one should look like this (when hexdump -C'd):
> > 
> > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
> > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
> > 0000001e
> > 
> > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
> > hex, little endian (least significant byte first), in place of the 0x10 
> > that is there now.  It should dump like this:
> > 
> > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> > {
> >    "last_scan_epoch": 16,   <--- but with a recent epoch here
> >    "creating_pgs": [],
> >    "queue": [],
> >    "created_pools": [
> >        66
> >    ]
> > }
> > 
> > sage
> > 
> > 
> >> 
> >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> 
> >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>> Sage,
> >>>> 
> >>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
> >>>> ceph osd lspools
> >>>> 66 mypool
> >>>> 
> >>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> >>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> >>>> 
> >>>> ceph fs dump
> >>>> dumped fsmap epoch 1 e1
> >>>> enable_multiple, ever_enabled_multiple: 0,0
> >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>>> legacy client fscid: -1
> >>>> 
> >>>> No filesystems configured
> >>>> 
> >>>> ceph fs ls
> >>>> No filesystems enabled
> >>>> 
> >>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> >>>> 
> >>>> Is there any way to find and delete it?
> >>> 
> >>> Ok, I'm concerned that the creating pg is in there if this is an old 
> >>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> >>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>> store.. one that doesn't normally come up because the impact is this 
> >>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> >>> your aged cluster.
> >>> 
> >>> In any case, there is a way to clear those out of the mon, but it's a bit 
> >>> dicey. 
> >>> 
> >>> 1. stop all mons
> >>> 2. make a backup of all mons
> >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>> key=creating key on one of the mons
> >>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> >>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> >>> 66 to the created_pgs member.
> >>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>> 7. inject the updated structure into all of the mons
> >>> 8. start all mons
> >>> 
> >>> 4-6 will probably be an iterative process... let's start by getting the 
> >>> structure out and dumping the current value?  
> >>> 
> >>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> >>> encode/decode methods.
> >>> 
> >>> sage
> >>> 
> >>> 
> >>>> 
> >>>> 
> >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>> 
> >>>>> Oh... I think this is the problem:
> >>>>> 
> >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>>>> 
> >>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>>>> one of the last remaining places where the OSDs need to look at a full 
> >>>>> history of map changes between then and the current map epoch.  In this 
> >>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>>>> later.
> >>>>> 
> >>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>>>> and restarting the OSDs will make these pg_creates go away.
> >>>>> 
> >>>>> s
> >>>>> 
> >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>> 
> >>>>>> Hello,
> >>>>>> 
> >>>>>> It seems nothing has changed.
> >>>>>> 
> >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>>>> 
> >>>>>> 
> >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>> 
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> 
> >>>>>>> You can also reduce the osd map updates by adding this to your ceph
> >>>>>>> config file. "osd crush update on start = false". This should remove
> >>>>>>> and update that is generated when osd starts.
> >>>>>>> 
> >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>>>> dispatch
> >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>>>> "root=default"]} v 0) v1
> >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>>>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>>>> {host=SRV-SEKUARK8,root=default}
> >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>>>>>> enabled
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> Hi Sage,
> >>>>>>>> 
> >>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>>>>>>> 
> >>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>>>>>>> 
> >>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>>>>>>> 
> >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>>>>>>> 
> >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>>>> 
> >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>>>>>>> 
> >>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>>>>>>> 
> >>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>>>> 
> >>>>>>>> This is our ceph.conf:
> >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>>>> 
> >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>>>> 
> >>>>>>>> This is pg dump:
> >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>>>> 
> >>>>>>>> This is iostat & perf top:
> >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>>>> 
> >>>>>>>> This strace output of ceph-osd:
> >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>>>> 
> >>>>>>>> This is OSD log (default debug):
> >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>>>> 
> >>>>>>>> This is leader MON log (default debug):
> >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>>>> 
> >>>>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>>>> 
> >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>>>> 
> >>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>>>>>>> 
> >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>>>>>>> only use it in very specific circumstances when directed by a developer.
> >>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>>>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>>>> 
> >>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>>>>>>> t the mons have.  Were they really caught up when you unset noup?
> >>>>>>>> 
> >>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
> >>>>>>>> marked up...
> >>>>>>>> 
> >>>>>>>> sage
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>>> 
> >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>>>>>> Is that be usefull here? There is such a less information about it.
> >>>>>>>> 
> >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>>>>>>> 
> >>>>>>>> Hi,
> >>>>>>>> 
> >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>>>>>>> 
> >>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>>>>>>> 
> >>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>>>>>>> 
> >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> Hi,
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> I can see some issues from the osd log file. You have an extremely low
> >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>>>>>> mimic.
> >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Hello Darius,
> >>>>>>>> 
> >>>>>>>> Thanks for reply!
> >>>>>>>> 
> >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>>>>>>> 
> >>>>>>>> So we have a small test like that:
> >>>>>>>> * Stop all OSDs (168 of them)
> >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>>>>>>> 
> >>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>>>>>> 
> >>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>>>> 
> >>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>>>>>>> 
> >>>>>>>> Any thought would be greatly appreciated.
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> Hello,
> >>>>>>>> 
> >>>>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>>>>>>> and making backups of them. Ditch all other osds that are failing to
> >>>>>>>> start and concentrate on bringing online those that have missing
> >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>>>>>> them back on.
> >>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>>>>> (as for Redhat).
> >>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>>>>> tools to check health and also repair. But there are limited
> >>>>>>>> documentation.
> >>>>>>>> Anyone who has experince with it?
> >>>>>>>> Anyone lead/help to a proper check would be great.
> >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>>>>>> tarihinde şunu yazdı:
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Hi all,
> >>>>>>>> 
> >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>>>> 
> >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>>>>>> 
> >>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>>>>>> 
> >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>>>>>> 
> >>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>>>> 
> >>>>>>>> Here is the latest ceph status:
> >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>>>> 
> >>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>>>>>> 
> >>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>>>>>> 
> >>>>>>>> This the full log the same OSD.
> >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>>>> 
> >>>>>>>> Here is the strace of the same OSD process:
> >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>> 
> >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>>>>>> 
> >>>>>>>> Any help to figure out what is going on would be great.
> >>>>>>>> 
> >>>>>>>> Best Regards,
> >>>>>>>> Goktug Yildirim
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                   ` <alpine.DEB.2.11.1810031826290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-03 19:01                                                                     ` Goktug YILDIRIM
       [not found]                                                                       ` <CACz15iKdK47N0mBJrRyDeEx7BH-uRKmvZ+OJ6Cc6u6-qxTeA6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug YILDIRIM @ 2018-10-03 19:01 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel, by morphin, Göktuğ Yıldırım,
	ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 24742 bytes --]

I changed the file name to make it clear.
When I use your command with "+decode"  I'm getting an error like this:

ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
error: buffer::malformed_input: void
creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand
old encoding version 2 < 111

My ceph version: 13.2.2

3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu
yazdı:

> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> > If I didn't do it wrong, I got the output as below.
> >
> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/
> get osd_pg_creating creating > dump
> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column
> families: [default]
> >
> > ceph-dencoder type creating_pgs_t import dump dump_json
>
> Sorry, should be
>
> ceph-dencoder type creating_pgs_t import dump decode dump_json
>
> s
>
> > {
> >     "last_scan_epoch": 0,
> >     "creating_pgs": [],
> >     "queue": [],
> >     "created_pools": []
> > }
> >
> > You can find the "dump" link below.
> >
> > dump:
> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >
> >
> > Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
> >
> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > >> We are starting to work on it. First step is getting the structure
> out and dumping the current value as you say.
> > >>
> > >> And you were correct we did not run force_create_pg.
> > >
> > > Great.
> > >
> > > So, eager to see what the current structure is... please attach once
> you
> > > have it.
> > >
> > > The new replacement one should look like this (when hexdump -C'd):
> > >
> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00
> |................|
> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
> |..B...........|
> > > 0000001e
> > >
> > > ...except that from byte 6 you want to put in a recent OSDMap epoch,
> in
> > > hex, little endian (least significant byte first), in place of the
> 0x10
> > > that is there now.  It should dump like this:
> > >
> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> > > {
> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
> > >    "creating_pgs": [],
> > >    "queue": [],
> > >    "created_pools": [
> > >        66
> > >    ]
> > > }
> > >
> > > sage
> > >
> > >
> > >>
> > >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > >>>
> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > >>>> Sage,
> > >>>>
> > >>>> Pool 66 is the only pool it shows right now. This a pool created
> months ago.
> > >>>> ceph osd lspools
> > >>>> 66 mypool
> > >>>>
> > >>>> As we recreated mon db from OSDs, the pools for MDS was unusable.
> So we deleted them.
> > >>>> After we create another cephfs fs and pools we started MDS and it
> stucked on creation. So we stopped MDS and removed fs and fs pools. Right
> now we do not have MDS running nor we have cephfs related things.
> > >>>>
> > >>>> ceph fs dump
> > >>>> dumped fsmap epoch 1 e1
> > >>>> enable_multiple, ever_enabled_multiple: 0,0
> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> > >>>> legacy client fscid: -1
> > >>>>
> > >>>> No filesystems configured
> > >>>>
> > >>>> ceph fs ls
> > >>>> No filesystems enabled
> > >>>>
> > >>>> Now pool 66 seems to only pool we have and it has been created
> months ago. Then I guess there is something hidden out there.
> > >>>>
> > >>>> Is there any way to find and delete it?
> > >>>
> > >>> Ok, I'm concerned that the creating pg is in there if this is an old
> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming
> you
> > >>> didn't, I think this is a bug in the process for rebuilding the mon
> > >>> store.. one that doesn't normally come up because the impact is this
> > >>> osdmap scan that is cheap in our test scenarios but clearly not
> cheap for
> > >>> your aged cluster.
> > >>>
> > >>> In any case, there is a way to clear those out of the mon, but it's
> a bit
> > >>> dicey.
> > >>>
> > >>> 1. stop all mons
> > >>> 2. make a backup of all mons
> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
> > >>> key=creating key on one of the mons
> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import
> FILE dump_json
> > >>> 5. hex edit the structure to remove all of the creating pgs, and
> adds pool
> > >>> 66 to the created_pgs member.
> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
> > >>> 7. inject the updated structure into all of the mons
> > >>> 8. start all mons
> > >>>
> > >>> 4-6 will probably be an iterative process... let's start by getting
> the
> > >>> structure out and dumping the current value?
> > >>>
> > >>> The code to refer to to understand the structure is
> src/mon/CreatingPGs.h
> > >>> encode/decode methods.
> > >>>
> > >>> sage
> > >>>
> > >>>
> > >>>>
> > >>>>
> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > >>>>>
> > >>>>> Oh... I think this is the problem:
> > >>>>>
> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op
> osd_pg_create(e72883
> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196
> 66.124:60196
> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196
> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> > >>>>>
> > >>>>> You are in the midst of creating new pgs, and unfortunately pg
> create is
> > >>>>> one of the last remaining places where the OSDs need to look at a
> full
> > >>>>> history of map changes between then and the current map epoch.  In
> this
> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k
> epochs
> > >>>>> later.
> > >>>>>
> > >>>>> What is this new pool for?  Is it still empty, and if so, can we
> delete
> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the
> mon end
> > >>>>> and restarting the OSDs will make these pg_creates go away.
> > >>>>>
> > >>>>> s
> > >>>>>
> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> It seems nothing has changed.
> > >>>>>>
> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <
> https://paste.ubuntu.com/p/MtvTr5HYW4/>
> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <
> https://paste.ubuntu.com/p/7Sx64xGzkR/>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> You can also reduce the osd map updates by adding this to your
> ceph
> > >>>>>>> config file. "osd crush update on start = false". This should
> remove
> > >>>>>>> and update that is generated when osd starts.
> > >>>>>>>
> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0
> (leader)
> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log
> [INF] :
> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
> cmd=[{"prefix":
> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> > >>>>>>> dispatch
> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0
> (leader)
> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> create-or-move",
> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> > >>>>>>> "root=default"]} v 0) v1
> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log
> [INF] :
> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
> cmd=[{"prefix":
> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush
> item name
> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
> > >>>>>>> {host=SRV-SEKUARK8,root=default}
> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full
> prune
> > >>>>>>> enabled
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> > >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Sage,
> > >>>>>>>>
> > >>>>>>>> Thank you for your response. Now I am sure this incident is
> going to be resolved.
> > >>>>>>>>
> > >>>>>>>> The problem started when 7 server crashed same time and they
> came back after ~5 minutes.
> > >>>>>>>>
> > >>>>>>>> Two of our 3 mon services were restarted in this crash. Since
> mon services are enabled they should be started nearly at the same time. I
> dont know if this makes any difference but some of the guys on IRC told it
> is required that they start in order not at the same time. Otherwise it
> could break things badly.
> > >>>>>>>>
> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the
> end we have so many STUCK request and our cluster can not heal itself.
> > >>>>>>>>
> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when
> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> > >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization
> and never ends. CPU and RAM usage are OK.
> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> > >>>>>>>>
> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our
> goal was the eliminate older PG's and leaving cluster with good ones.
> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it
> didn’t solve our problem.
> > >>>>>>>>
> > >>>>>>>> Of course we have to save %100 of data. But we feel like even
> saving %50 of our data will be make us very happy right now.
> > >>>>>>>>
> > >>>>>>>> This is what happens when the cluster starts. I believe it
> explains the whole story very nicely.
> > >>>>>>>>
> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> > >>>>>>>>
> > >>>>>>>> This is our ceph.conf:
> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> > >>>>>>>>
> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s &&
> ceph health”:
> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> > >>>>>>>>
> > >>>>>>>> This is pg dump:
> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> > >>>>>>>>
> > >>>>>>>> This is iostat & perf top:
> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> > >>>>>>>>
> > >>>>>>>> This strace output of ceph-osd:
> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> > >>>>>>>>
> > >>>>>>>> This is OSD log (default debug):
> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> > >>>>>>>>
> > >>>>>>>> This is leader MON log (default debug):
> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> > >>>>>>>>
> > >>>>>>>> These are OSDs failed to start. Total number is 58.
> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and
> debug_filestore = 20.
> > >>>>>>>>
> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> > >>>>>>>>
> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and
> debug_filestore = 20.
> > >>>>>>>>
> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> > >>>>>>>>
> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch and
> exceeds mon epoch somehow??
> > >>>>>>>>
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg
> 66.f8 e60196@2018-09-28 23:57:08.251119
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642
> build_initial_pg_history 66.f8 created 60196
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map
> 60196 - loading and decoding 0x19da8400
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
> _process 66.d8 to_process <> waiting <> waiting_peering {}
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
> _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent:
> 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10
> e72642)> waiting <> waiting_peering {}
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg
> 0xb579400
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642
> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934]
> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f
> 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1
> crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]]
> do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt
> +create_info
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642
> queue_want_up_thru want 72642 <= queued 72642, currently 72206
> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
> _process empty q, waiting
> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642
> add_map_bl 60196 50012 bytes
> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map
> 60197 - loading and decoding 0x19da8880
> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642
> add_map_bl 60197 50012 bytes
> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map
> 60198 - loading and decoding 0x19da9180
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > >>>>>>>>
> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and
> you should
> > >>>>>>>> only use it in very specific circumstances when directed by a
> developer.
> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not
> getting to
> > >>>>>>>> that point...you're seeing some sort of resource exhaustion.
> > >>>>>>>>
> > >>>>>>>> The noup trick works when OSDs are way behind on maps and all
> need to
> > >>>>>>>> catch up.  The way to tell if they are behind is by looking at
> the 'ceph
> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest
> OSDMap epoch tha
> > >>>>>>>> t the mons have.  Were they really caught up when you unset
> noup?
> > >>>>>>>>
> > >>>>>>>> I'm just catching up and haven't read the whole thread but I
> haven't seen
> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.
> Catching up
> > >>>>>>>> on maps could explain it but not why they wouldn't peer once
> they were all
> > >>>>>>>> marked up...
> > >>>>>>>>
> > >>>>>>>> sage
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> > >>>>>>>>
> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true
> ?
> > >>>>>>>> Is that be usefull here? There is such a less information about
> it.
> > >>>>>>>>
> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2
> Eki 2018 22:11):
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions
> when I read somewhere that that will do the proper sizing.
> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will
> increase 8GB for each and decide a more calculated number    after cluster
> settles.
> > >>>>>>>>
> > >>>>>>>> For the osd map loading I’ve also figured it out. And it is in
> loop. For that reason I started cluster with noup flag and waited OSDs to
> reach the uptodate epoch number. After that I unset noup. But I did not pay
> attention to manager logs. Let me check it, thank you!
> > >>>>>>>>
> > >>>>>>>> I am not forcing jmellac or anything else really. I have a very
> standard installation and no tweaks or tunings. All we ask for the
> stability versus speed from the begining. And here we are :/
> > >>>>>>>>
> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com>
> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I can see some issues from the osd log file. You have an
> extremely low
> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal.
> I would
> > >>>>>>>> recommend cranking up rocksdb cache size as much as possible.
> If you
> > >>>>>>>> have RAM you can also increase bluestores cache size for hdd.
> Default
> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You
> also
> > >>>>>>>> have lots of osd map loading and decoding in the log. Are you
> sure all
> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after
> upgrading to
> > >>>>>>>> mimic.
> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> > >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hello Darius,
> > >>>>>>>>
> > >>>>>>>> Thanks for reply!
> > >>>>>>>>
> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f
> query” does stucks and wait forever since OSD is unresponsive.
> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And
> we are certain that OSD responds again after its disk utilization stops.
> > >>>>>>>>
> > >>>>>>>> So we have a small test like that:
> > >>>>>>>> * Stop all OSDs (168 of them)
> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It
> takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 &
> OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O
> and it takes 25 mins to settle.
> > >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs
> start high I/O again. And it takes 1 hour to finish.
> > >>>>>>>>
> > >>>>>>>> So in the light of these findings we flagged noup, started all
> OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168
> OSD started to make high I/O. And we thought that if we wait long enough it
> will finish & OSDs will be responsive again. After 24hours they did not
> because I/O did not finish or even slowed down.
> > >>>>>>>> One can think that is a lot of data there to scan. But it is
> just 33TB.
> > >>>>>>>>
> > >>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> > >>>>>>>>
> > >>>>>>>> However we met an weird thing half an hour ago. We exported the
> same PG from two different OSDs. One was 4.2GB and the other is 500KB! So
> we decided to export all OSDs for backup. Then we will delete strange sized
> ones and start the cluster all over. Maybe then we could solve the stucked
> or unfound PGs as you advise.
> > >>>>>>>>
> > >>>>>>>> Any thought would be greatly appreciated.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com>
> wrote:
> > >>>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>> Currently you have 15 objects missing. I would recommend
> finding them
> > >>>>>>>> and making backups of them. Ditch all other osds that are
> failing to
> > >>>>>>>> start and concentrate on bringing online those that have missing
> > >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster
> and see
> > >>>>>>>> if it stabilises. If it stabilises leave these setting if not
> turn
> > >>>>>>>> them back on.
> > >>>>>>>> Now get some of the pg's that are blocked and querry the pgs to
> check
> > >>>>>>>> why they are blocked. Try removing as much blocks as possible
> and then
> > >>>>>>>> remove the norebalance/norecovery flags and see if it starts to
> fix
> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> > >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat
> preview tech
> > >>>>>>>> (as for Redhat).
> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. There
> are some
> > >>>>>>>> tools to check health and also repair. But there are limited
> > >>>>>>>> documentation.
> > >>>>>>>> Anyone who has experince with it?
> > >>>>>>>> Anyone lead/help to a proper check would be great.
> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt,
> 22:55
> > >>>>>>>> tarihinde şunu yazdı:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Hi all,
> > >>>>>>>>
> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6
> days since this cluster is offline. The long short story is here:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> > >>>>>>>>
> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If
> this is not to correct way I apology and please let me know.
> > >>>>>>>>
> > >>>>>>>> For the 6 days lots of thing happened and there were some
> outcomes about the problem. Some of them was misjudged and some of them are
> not looked deeper.
> > >>>>>>>> However the most certain diagnosis is this: each OSD causes
> very high disk I/O to its bluestore disk (WAL and DB are fine). After that
> OSDs become unresponsive or very very less responsive. For example "ceph
> tell osd.x version” stucks like for ever.
> > >>>>>>>>
> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is
> our problem!
> > >>>>>>>>
> > >>>>>>>> This is the one we are very sure of. But we are not sure of the
> reason.
> > >>>>>>>>
> > >>>>>>>> Here is the latest ceph status:
> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> > >>>>>>>>
> > >>>>>>>> This is the status after we started all of the OSDs 24 hours
> ago.
> > >>>>>>>> Some of the OSDs are not started. However it didnt make any
> difference when all of them was online.
> > >>>>>>>>
> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> > >>>>>>>> As we figure out there is a loop pattern. I am sure it wont
> caught from eye.
> > >>>>>>>>
> > >>>>>>>> This the full log the same OSD.
> > >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> > >>>>>>>>
> > >>>>>>>> Here is the strace of the same OSD process:
> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> > >>>>>>>>
> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as
> we do. I am sure we have done lots of mistakes to let this happening. And
> this situation may be a example for other user and could be a potential bug
> for ceph developer.
> > >>>>>>>>
> > >>>>>>>> Any help to figure out what is going on would be great.
> > >>>>>>>>
> > >>>>>>>> Best Regards,
> > >>>>>>>> Goktug Yildirim
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> ceph-users mailing list
> > >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> >
> >

[-- Attachment #1.2: Type: text/html, Size: 38988 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                       ` <CACz15iKdK47N0mBJrRyDeEx7BH-uRKmvZ+OJ6Cc6u6-qxTeA6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-03 19:13                                                                         ` Sage Weil
       [not found]                                                                           ` <f9cf0773-cce5-4989-8f78-de3a01080450-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 19:13 UTC (permalink / raw)
  To: Goktug YILDIRIM
  Cc: ceph-devel, by morphin, Göktuğ Yıldırım,
	ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 25183 bytes --]

I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc

On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>I changed the file name to make it clear.
>When I use your command with "+decode"  I'm getting an error like this:
>
>ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>error: buffer::malformed_input: void
>creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer
>understand
>old encoding version 2 < 111
>
>My ceph version: 13.2.2
>
>3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu
>yazdı:
>
>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> > If I didn't do it wrong, I got the output as below.
>> >
>> > ceph-kvstore-tool rocksdb
>/var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/
>> get osd_pg_creating creating > dump
>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column
>> families: [default]
>> >
>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>
>> Sorry, should be
>>
>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>
>> s
>>
>> > {
>> >     "last_scan_epoch": 0,
>> >     "creating_pgs": [],
>> >     "queue": [],
>> >     "created_pools": []
>> > }
>> >
>> > You can find the "dump" link below.
>> >
>> > dump:
>>
>https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>> >
>> >
>> > Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
>> >
>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >> We are starting to work on it. First step is getting the
>structure
>> out and dumping the current value as you say.
>> > >>
>> > >> And you were correct we did not run force_create_pg.
>> > >
>> > > Great.
>> > >
>> > > So, eager to see what the current structure is... please attach
>once
>> you
>> > > have it.
>> > >
>> > > The new replacement one should look like this (when hexdump
>-C'd):
>> > >
>> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00
>> |................|
>> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
>> |..B...........|
>> > > 0000001e
>> > >
>> > > ...except that from byte 6 you want to put in a recent OSDMap
>epoch,
>> in
>> > > hex, little endian (least significant byte first), in place of
>the
>> 0x10
>> > > that is there now.  It should dump like this:
>> > >
>> > > $ ceph-dencoder type creating_pgs_t import myfile decode
>dump_json
>> > > {
>> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
>> > >    "creating_pgs": [],
>> > >    "queue": [],
>> > >    "created_pools": [
>> > >        66
>> > >    ]
>> > > }
>> > >
>> > > sage
>> > >
>> > >
>> > >>
>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>> > >>>
>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >>>> Sage,
>> > >>>>
>> > >>>> Pool 66 is the only pool it shows right now. This a pool
>created
>> months ago.
>> > >>>> ceph osd lspools
>> > >>>> 66 mypool
>> > >>>>
>> > >>>> As we recreated mon db from OSDs, the pools for MDS was
>unusable.
>> So we deleted them.
>> > >>>> After we create another cephfs fs and pools we started MDS and
>it
>> stucked on creation. So we stopped MDS and removed fs and fs pools.
>Right
>> now we do not have MDS running nor we have cephfs related things.
>> > >>>>
>> > >>>> ceph fs dump
>> > >>>> dumped fsmap epoch 1 e1
>> > >>>> enable_multiple, ever_enabled_multiple: 0,0
>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in
>separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> > >>>> legacy client fscid: -1
>> > >>>>
>> > >>>> No filesystems configured
>> > >>>>
>> > >>>> ceph fs ls
>> > >>>> No filesystems enabled
>> > >>>>
>> > >>>> Now pool 66 seems to only pool we have and it has been created
>> months ago. Then I guess there is something hidden out there.
>> > >>>>
>> > >>>> Is there any way to find and delete it?
>> > >>>
>> > >>> Ok, I'm concerned that the creating pg is in there if this is
>an old
>> > >>> pool... did you perhaps run force_create_pg at some point? 
>Assuming
>> you
>> > >>> didn't, I think this is a bug in the process for rebuilding the
>mon
>> > >>> store.. one that doesn't normally come up because the impact is
>this
>> > >>> osdmap scan that is cheap in our test scenarios but clearly not
>> cheap for
>> > >>> your aged cluster.
>> > >>>
>> > >>> In any case, there is a way to clear those out of the mon, but
>it's
>> a bit
>> > >>> dicey.
>> > >>>
>> > >>> 1. stop all mons
>> > >>> 2. make a backup of all mons
>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
>> > >>> key=creating key on one of the mons
>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t
>import
>> FILE dump_json
>> > >>> 5. hex edit the structure to remove all of the creating pgs,
>and
>> adds pool
>> > >>> 66 to the created_pgs member.
>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
>> > >>> 7. inject the updated structure into all of the mons
>> > >>> 8. start all mons
>> > >>>
>> > >>> 4-6 will probably be an iterative process... let's start by
>getting
>> the
>> > >>> structure out and dumping the current value?
>> > >>>
>> > >>> The code to refer to to understand the structure is
>> src/mon/CreatingPGs.h
>> > >>> encode/decode methods.
>> > >>>
>> > >>> sage
>> > >>>
>> > >>>
>> > >>>>
>> > >>>>
>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>> > >>>>>
>> > >>>>> Oh... I think this is the problem:
>> > >>>>>
>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op
>> osd_pg_create(e72883
>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196
>> 66.124:60196
>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196
>66.271:60196
>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03
>16:20:01.915916
>> > >>>>>
>> > >>>>> You are in the midst of creating new pgs, and unfortunately
>pg
>> create is
>> > >>>>> one of the last remaining places where the OSDs need to look
>at a
>> full
>> > >>>>> history of map changes between then and the current map
>epoch.  In
>> this
>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k
>> epochs
>> > >>>>> later.
>> > >>>>>
>> > >>>>> What is this new pool for?  Is it still empty, and if so, can
>we
>> delete
>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at
>the
>> mon end
>> > >>>>> and restarting the OSDs will make these pg_creates go away.
>> > >>>>>
>> > >>>>> s
>> > >>>>>
>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >>>>>>
>> > >>>>>> Hello,
>> > >>>>>>
>> > >>>>>> It seems nothing has changed.
>> > >>>>>>
>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <
>> https://paste.ubuntu.com/p/MtvTr5HYW4/>
>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <
>> https://paste.ubuntu.com/p/7Sx64xGzkR/>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius
><daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>> > >>>>>>>
>> > >>>>>>> Hello,
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> You can also reduce the osd map updates by adding this to
>your
>> ceph
>> > >>>>>>> config file. "osd crush update on start = false". This
>should
>> remove
>> > >>>>>>> and update that is generated when osd starts.
>> > >>>>>>>
>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0
>> (leader)
>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit)
>log
>> [INF] :
>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
>> cmd=[{"prefix":
>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids":
>["47"]}]:
>> > >>>>>>> dispatch
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0
>> (leader)
>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>> create-or-move",
>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>> > >>>>>>> "root=default"]} v 0) v1
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit)
>log
>> [INF] :
>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
>> cmd=[{"prefix":
>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396,
>"args":
>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush
>> item name
>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
>> > >>>>>>> {host=SRV-SEKUARK8,root=default}
>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full
>> prune
>> > >>>>>>> enabled
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>> > >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi Sage,
>> > >>>>>>>>
>> > >>>>>>>> Thank you for your response. Now I am sure this incident
>is
>> going to be resolved.
>> > >>>>>>>>
>> > >>>>>>>> The problem started when 7 server crashed same time and
>they
>> came back after ~5 minutes.
>> > >>>>>>>>
>> > >>>>>>>> Two of our 3 mon services were restarted in this crash.
>Since
>> mon services are enabled they should be started nearly at the same
>time. I
>> dont know if this makes any difference but some of the guys on IRC
>told it
>> is required that they start in order not at the same time. Otherwise
>it
>> could break things badly.
>> > >>>>>>>>
>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But
>in the
>> end we have so many STUCK request and our cluster can not heal
>itself.
>> > >>>>>>>>
>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But
>when
>> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>> > >>>>>>>> I/O load on all of my OSD disks are at around %95
>utilization
>> and never ends. CPU and RAM usage are OK.
>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0
>query”.
>> > >>>>>>>>
>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1.
>Our
>> goal was the eliminate older PG's and leaving cluster with good ones.
>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But
>it
>> didn’t solve our problem.
>> > >>>>>>>>
>> > >>>>>>>> Of course we have to save %100 of data. But we feel like
>even
>> saving %50 of our data will be make us very happy right now.
>> > >>>>>>>>
>> > >>>>>>>> This is what happens when the cluster starts. I believe it
>> explains the whole story very nicely.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> This is our ceph.conf:
>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>> > >>>>>>>>
>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s
>&&
>> ceph health”:
>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>> > >>>>>>>>
>> > >>>>>>>> This is pg dump:
>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>> > >>>>>>>>
>> > >>>>>>>> This is iostat & perf top:
>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>> > >>>>>>>>
>> > >>>>>>>> This strace output of ceph-osd:
>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>> > >>>>>>>>
>> > >>>>>>>> This is OSD log (default debug):
>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>> > >>>>>>>>
>> > >>>>>>>> This is leader MON log (default debug):
>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>> > >>>>>>>>
>> > >>>>>>>> These are OSDs failed to start. Total number is 58.
>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and
>> debug_filestore = 20.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1
>and
>> debug_filestore = 20.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch
>and
>> exceeds mon epoch somehow??
>> > >>>>>>>>
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg
>> 66.f8 e60196@2018-09-28 23:57:08.251119
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642
>> build_initial_pg_history 66.f8 created 60196
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60196 - loading and decoding 0x19da8400
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process 66.d8 to_process <> waiting <> waiting_peering {}
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)
>queued
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process 66.d8 to_process <OpQueueItem(66.d8
>PGPeeringEvent(epoch_sent:
>> 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10
>> e72642)> waiting <> waiting_peering {}
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)
>pg
>> 0xb579400
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch:
>72642
>> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934]
>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206
>les/c/f
>> 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642
>pi=[72206,72642)/1
>> crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]]
>> do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt
>> +create_info
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642
>> queue_want_up_thru want 72642 <= queued 72642, currently 72206
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process empty q, waiting
>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642
>> add_map_bl 60196 50012 bytes
>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60197 - loading and decoding 0x19da8880
>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642
>> add_map_bl 60197 50012 bytes
>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60198 - loading and decoding 0x19da9180
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
>wrote:
>> > >>>>>>>>
>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous
>option and
>> you should
>> > >>>>>>>> only use it in very specific circumstances when directed
>by a
>> developer.
>> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But
>you're not
>> getting to
>> > >>>>>>>> that point...you're seeing some sort of resource
>exhaustion.
>> > >>>>>>>>
>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and
>all
>> need to
>> > >>>>>>>> catch up.  The way to tell if they are behind is by
>looking at
>> the 'ceph
>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest
>> OSDMap epoch tha
>> > >>>>>>>> t the mons have.  Were they really caught up when you
>unset
>> noup?
>> > >>>>>>>>
>> > >>>>>>>> I'm just catching up and haven't read the whole thread but
>I
>> haven't seen
>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk
>IO.
>> Catching up
>> > >>>>>>>> on maps could explain it but not why they wouldn't peer
>once
>> they were all
>> > >>>>>>>> marked up...
>> > >>>>>>>>
>> > >>>>>>>> sage
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>> > >>>>>>>>
>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les =
>true
>> ?
>> > >>>>>>>> Is that be usefull here? There is such a less information
>about
>> it.
>> > >>>>>>>>
>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı
>(2
>> Eki 2018 22:11):
>> > >>>>>>>>
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db
>partitions
>> when I read somewhere that that will do the proper sizing.
>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will
>> increase 8GB for each and decide a more calculated number    after
>cluster
>> settles.
>> > >>>>>>>>
>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it
>is in
>> loop. For that reason I started cluster with noup flag and waited
>OSDs to
>> reach the uptodate epoch number. After that I unset noup. But I did
>not pay
>> attention to manager logs. Let me check it, thank you!
>> > >>>>>>>>
>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a
>very
>> standard installation and no tweaks or tunings. All we ask for the
>> stability versus speed from the begining. And here we are :/
>> > >>>>>>>>
>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius
><daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> I can see some issues from the osd log file. You have an
>> extremely low
>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for
>wal.
>> I would
>> > >>>>>>>> recommend cranking up rocksdb cache size as much as
>possible.
>> If you
>> > >>>>>>>> have RAM you can also increase bluestores cache size for
>hdd.
>> Default
>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills.
>You
>> also
>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are
>you
>> sure all
>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you
>aren't
>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after
>> upgrading to
>> > >>>>>>>> mimic.
>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>> > >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hello Darius,
>> > >>>>>>>>
>> > >>>>>>>> Thanks for reply!
>> > >>>>>>>>
>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f
>> query” does stucks and wait forever since OSD is unresponsive.
>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it
>UP. And
>> we are certain that OSD responds again after its disk utilization
>stops.
>> > >>>>>>>>
>> > >>>>>>>> So we have a small test like that:
>> > >>>>>>>> * Stop all OSDs (168 of them)
>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts.
>It
>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts
>OSD1 &
>> OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high
>I/O
>> and it takes 25 mins to settle.
>> > >>>>>>>> * If you happen to start 5 of them at the same all of the
>OSDs
>> start high I/O again. And it takes 1 hour to finish.
>> > >>>>>>>>
>> > >>>>>>>> So in the light of these findings we flagged noup, started
>all
>> OSDs. At first there was no I/O. After 10 minutes we unset noup. All
>of 168
>> OSD started to make high I/O. And we thought that if we wait long
>enough it
>> will finish & OSDs will be responsive again. After 24hours they did
>not
>> because I/O did not finish or even slowed down.
>> > >>>>>>>> One can think that is a lot of data there to scan. But it
>is
>> just 33TB.
>> > >>>>>>>>
>> > >>>>>>>> So at short we dont know which PG is stuck so we can
>remove it.
>> > >>>>>>>>
>> > >>>>>>>> However we met an weird thing half an hour ago. We
>exported the
>> same PG from two different OSDs. One was 4.2GB and the other is
>500KB! So
>> we decided to export all OSDs for backup. Then we will delete strange
>sized
>> ones and start the cluster all over. Maybe then we could solve the
>stucked
>> or unfound PGs as you advise.
>> > >>>>>>>>
>> > >>>>>>>> Any thought would be greatly appreciated.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius
><daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hello,
>> > >>>>>>>>
>> > >>>>>>>> Currently you have 15 objects missing. I would recommend
>> finding them
>> > >>>>>>>> and making backups of them. Ditch all other osds that are
>> failing to
>> > >>>>>>>> start and concentrate on bringing online those that have
>missing
>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the
>cluster
>> and see
>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if
>not
>> turn
>> > >>>>>>>> them back on.
>> > >>>>>>>> Now get some of the pg's that are blocked and querry the
>pgs to
>> check
>> > >>>>>>>> why they are blocked. Try removing as much blocks as
>possible
>> and then
>> > >>>>>>>> remove the norebalance/norecovery flags and see if it
>starts to
>> fix
>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>> > >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat
>> preview tech
>> > >>>>>>>> (as for Redhat).
>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb.
>There
>> are some
>> > >>>>>>>> tools to check health and also repair. But there are
>limited
>> > >>>>>>>> documentation.
>> > >>>>>>>> Anyone who has experince with it?
>> > >>>>>>>> Anyone lead/help to a proper check would be great.
>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018
>Pzt,
>> 22:55
>> > >>>>>>>> tarihinde şunu yazdı:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hi all,
>> > >>>>>>>>
>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s
>been 6
>> days since this cluster is offline. The long short story is here:
>>
>http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>> > >>>>>>>>
>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug.
>If
>> this is not to correct way I apology and please let me know.
>> > >>>>>>>>
>> > >>>>>>>> For the 6 days lots of thing happened and there were some
>> outcomes about the problem. Some of them was misjudged and some of
>them are
>> not looked deeper.
>> > >>>>>>>> However the most certain diagnosis is this: each OSD
>causes
>> very high disk I/O to its bluestore disk (WAL and DB are fine). After
>that
>> OSDs become unresponsive or very very less responsive. For example
>"ceph
>> tell osd.x version” stucks like for ever.
>> > >>>>>>>>
>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This
>is
>> our problem!
>> > >>>>>>>>
>> > >>>>>>>> This is the one we are very sure of. But we are not sure
>of the
>> reason.
>> > >>>>>>>>
>> > >>>>>>>> Here is the latest ceph status:
>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>> > >>>>>>>>
>> > >>>>>>>> This is the status after we started all of the OSDs 24
>hours
>> ago.
>> > >>>>>>>> Some of the OSDs are not started. However it didnt make
>any
>> difference when all of them was online.
>> > >>>>>>>>
>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all
>others:
>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it
>wont
>> caught from eye.
>> > >>>>>>>>
>> > >>>>>>>> This the full log the same OSD.
>> > >>>>>>>>
>https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>> > >>>>>>>>
>> > >>>>>>>> Here is the strace of the same OSD process:
>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> > >>>>>>>>
>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get
>hurts as
>> we do. I am sure we have done lots of mistakes to let this happening.
>And
>> this situation may be a example for other user and could be a
>potential bug
>> for ceph developer.
>> > >>>>>>>>
>> > >>>>>>>> Any help to figure out what is going on would be great.
>> > >>>>>>>>
>> > >>>>>>>> Best Regards,
>> > >>>>>>>> Goktug Yildirim
>> > >>>>>>>>
>> > >>>>>>>> _______________________________________________
>> > >>>>>>>> ceph-users mailing list
>> > >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> >
>> >

[-- Attachment #1.2: Type: text/html, Size: 39998 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                           ` <f9cf0773-cce5-4989-8f78-de3a01080450-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
@ 2018-10-03 19:34                                                                             ` Göktuğ Yıldırım
       [not found]                                                                               ` <5FBAA21A-8B48-4B22-9FE8-0E0766C363D9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Göktuğ Yıldırım @ 2018-10-03 19:34 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel, by morphin, Göktuğ Yıldırım,
	ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 25694 bytes --]

I'm so sorry about that I missed "out" parameter. My bad..
This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/


Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 21:13):

> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
> 
> 
> 
>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim@gmail.com> wrote:
>> I changed the file name to make it clear.
>> When I use your command with "+decode"  I'm getting an error like this:
>> 
>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
>> 
>> My ceph version: 13.2.2
>> 
>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu yazdı:
>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>> > If I didn't do it wrong, I got the output as below.
>>> > 
>>> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
>>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
>>> > 
>>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>> 
>>> Sorry, should be
>>> 
>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>> 
>>> s
>>> 
>>> > {
>>> >     "last_scan_epoch": 0,
>>> >     "creating_pgs": [],
>>> >     "queue": [],
>>> >     "created_pools": []
>>> > }
>>> > 
>>> > You can find the "dump" link below.
>>> > 
>>> > dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>> > 
>>> > 
>>> > Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
>>> > 
>>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> > >> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
>>> > >> 
>>> > >> And you were correct we did not run force_create_pg.
>>> > > 
>>> > > Great.
>>> > > 
>>> > > So, eager to see what the current structure is... please attach once you 
>>> > > have it.
>>> > > 
>>> > > The new replacement one should look like this (when hexdump -C'd):
>>> > > 
>>> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
>>> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
>>> > > 0000001e
>>> > > 
>>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>> > > hex, little endian (least significant byte first), in place of the 0x10 
>>> > > that is there now.  It should dump like this:
>>> > > 
>>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>> > > {
>>> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
>>> > >    "creating_pgs": [],
>>> > >    "queue": [],
>>> > >    "created_pools": [
>>> > >        66
>>> > >    ]
>>> > > }
>>> > > 
>>> > > sage
>>> > > 
>>> > > 
>>> > >> 
>>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>> > >>> 
>>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> > >>>> Sage,
>>> > >>>> 
>>> > >>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>>> > >>>> ceph osd lspools
>>> > >>>> 66 mypool
>>> > >>>> 
>>> > >>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
>>> > >>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
>>> > >>>> 
>>> > >>>> ceph fs dump
>>> > >>>> dumped fsmap epoch 1 e1
>>> > >>>> enable_multiple, ever_enabled_multiple: 0,0
>>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>> > >>>> legacy client fscid: -1
>>> > >>>> 
>>> > >>>> No filesystems configured
>>> > >>>> 
>>> > >>>> ceph fs ls
>>> > >>>> No filesystems enabled
>>> > >>>> 
>>> > >>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
>>> > >>>> 
>>> > >>>> Is there any way to find and delete it?
>>> > >>> 
>>> > >>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
>>> > >>> didn't, I think this is a bug in the process for rebuilding the mon 
>>> > >>> store.. one that doesn't normally come up because the impact is this 
>>> > >>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
>>> > >>> your aged cluster.
>>> > >>> 
>>> > >>> In any case, there is a way to clear those out of the mon, but it's a bit 
>>> > >>> dicey. 
>>> > >>> 
>>> > >>> 1. stop all mons
>>> > >>> 2. make a backup of all mons
>>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
>>> > >>> key=creating key on one of the mons
>>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
>>> > >>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
>>> > >>> 66 to the created_pgs member.
>>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
>>> > >>> 7. inject the updated structure into all of the mons
>>> > >>> 8. start all mons
>>> > >>> 
>>> > >>> 4-6 will probably be an iterative process... let's start by getting the 
>>> > >>> structure out and dumping the current value?  
>>> > >>> 
>>> > >>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
>>> > >>> encode/decode methods.
>>> > >>> 
>>> > >>> sage
>>> > >>> 
>>> > >>> 
>>> > >>>> 
>>> > >>>> 
>>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>> > >>>>> 
>>> > >>>>> Oh... I think this is the problem:
>>> > >>>>> 
>>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>> > >>>>> 
>>> > >>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>> > >>>>> one of the last remaining places where the OSDs need to look at a full 
>>> > >>>>> history of map changes between then and the current map epoch.  In this 
>>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>> > >>>>> later.
>>> > >>>>> 
>>> > >>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>> > >>>>> and restarting the OSDs will make these pg_creates go away.
>>> > >>>>> 
>>> > >>>>> s
>>> > >>>>> 
>>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> > >>>>>> 
>>> > >>>>>> Hello,
>>> > >>>>>> 
>>> > >>>>>> It seems nothing has changed.
>>> > >>>>>> 
>>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>> > >>>>>> 
>>> > >>>>>> 
>>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis@gmail.com> wrote:
>>> > >>>>>>> 
>>> > >>>>>>> Hello,
>>> > >>>>>>> 
>>> > >>>>>>> 
>>> > >>>>>>> You can also reduce the osd map updates by adding this to your ceph
>>> > >>>>>>> config file. "osd crush update on start = false". This should remove
>>> > >>>>>>> and update that is generated when osd starts.
>>> > >>>>>>> 
>>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>> > >>>>>>> dispatch
>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>> > >>>>>>> "root=default"]} v 0) v1
>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
>>> > >>>>>>> {host=SRV-SEKUARK8,root=default}
>>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>> > >>>>>>> enabled
>>> > >>>>>>> 
>>> > >>>>>>> 
>>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>> > >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> Hi Sage,
>>> > >>>>>>>> 
>>> > >>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>> > >>>>>>>> 
>>> > >>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>> > >>>>>>>> 
>>> > >>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>> > >>>>>>>> 
>>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>> > >>>>>>>> 
>>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>> > >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>> > >>>>>>>> 
>>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>> > >>>>>>>> 
>>> > >>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>> > >>>>>>>> 
>>> > >>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>> > >>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>> > >>>>>>>> 
>>> > >>>>>>>> This is our ceph.conf:
>>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>> > >>>>>>>> 
>>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>> > >>>>>>>> 
>>> > >>>>>>>> This is pg dump:
>>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>> > >>>>>>>> 
>>> > >>>>>>>> This is iostat & perf top:
>>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>> > >>>>>>>> 
>>> > >>>>>>>> This strace output of ceph-osd:
>>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>> > >>>>>>>> 
>>> > >>>>>>>> This is OSD log (default debug):
>>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>> > >>>>>>>> 
>>> > >>>>>>>> This is leader MON log (default debug):
>>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>> > >>>>>>>> 
>>> > >>>>>>>> These are OSDs failed to start. Total number is 58.
>>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>> > >>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>> > >>>>>>>> 
>>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>> > >>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>> > >>>>>>>> 
>>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>> > >>>>>>>> 
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>> > >>>>>>>> only use it in very specific circumstances when directed by a developer.
>>> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>> > >>>>>>>> that point...you're seeing some sort of resource exhaustion.
>>> > >>>>>>>> 
>>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
>>> > >>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>> > >>>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>> > >>>>>>>> 
>>> > >>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>> > >>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
>>> > >>>>>>>> marked up...
>>> > >>>>>>>> 
>>> > >>>>>>>> sage
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>> > >>>>>>>> Is that be usefull here? There is such a less information about it.
>>> > >>>>>>>> 
>>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>>> > >>>>>>>> 
>>> > >>>>>>>> Hi,
>>> > >>>>>>>> 
>>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>> > >>>>>>>> 
>>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>> > >>>>>>>> 
>>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>> > >>>>>>>> 
>>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> Hi,
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> I can see some issues from the osd log file. You have an extremely low
>>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>> > >>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>> > >>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>> > >>>>>>>> mimic.
>>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>> > >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> Hello Darius,
>>> > >>>>>>>> 
>>> > >>>>>>>> Thanks for reply!
>>> > >>>>>>>> 
>>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>> > >>>>>>>> 
>>> > >>>>>>>> So we have a small test like that:
>>> > >>>>>>>> * Stop all OSDs (168 of them)
>>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>> > >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>> > >>>>>>>> 
>>> > >>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>> > >>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>> > >>>>>>>> 
>>> > >>>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>> > >>>>>>>> 
>>> > >>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>> > >>>>>>>> 
>>> > >>>>>>>> Any thought would be greatly appreciated.
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> Hello,
>>> > >>>>>>>> 
>>> > >>>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>> > >>>>>>>> and making backups of them. Ditch all other osds that are failing to
>>> > >>>>>>>> start and concentrate on bringing online those that have missing
>>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>> > >>>>>>>> them back on.
>>> > >>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>> > >>>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>> > >>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>> > >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>> > >>>>>>>> (as for Redhat).
>>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>> > >>>>>>>> tools to check health and also repair. But there are limited
>>> > >>>>>>>> documentation.
>>> > >>>>>>>> Anyone who has experince with it?
>>> > >>>>>>>> Anyone lead/help to a proper check would be great.
>>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>>> > >>>>>>>> tarihinde şunu yazdı:
>>> > >>>>>>>> 
>>> > >>>>>>>> 
>>> > >>>>>>>> Hi all,
>>> > >>>>>>>> 
>>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>> > >>>>>>>> 
>>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>> > >>>>>>>> 
>>> > >>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>> > >>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>> > >>>>>>>> 
>>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>> > >>>>>>>> 
>>> > >>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>> > >>>>>>>> 
>>> > >>>>>>>> Here is the latest ceph status:
>>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>> > >>>>>>>> 
>>> > >>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>> > >>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>> > >>>>>>>> 
>>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>> > >>>>>>>> 
>>> > >>>>>>>> This the full log the same OSD.
>>> > >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>> > >>>>>>>> 
>>> > >>>>>>>> Here is the strace of the same OSD process:
>>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>> > >>>>>>>> 
>>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>> > >>>>>>>> 
>>> > >>>>>>>> Any help to figure out what is going on would be great.
>>> > >>>>>>>> 
>>> > >>>>>>>> Best Regards,
>>> > >>>>>>>> Goktug Yildirim
>>> > >>>>>>>> 
>>> > >>>>>>>> _______________________________________________
>>> > >>>>>>>> ceph-users mailing list
>>> > >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > >> 
>>> > 
>>> >

[-- Attachment #1.2: Type: text/html, Size: 39721 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                               ` <5FBAA21A-8B48-4B22-9FE8-0E0766C363D9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-03 19:37                                                                                 ` Sage Weil
       [not found]                                                                                   ` <alpine.DEB.2.11.1810031936110.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  2018-10-03 19:43                                                                                 ` Göktuğ Yıldırım
  1 sibling, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-03 19:37 UTC (permalink / raw)
  To: Göktuğ Yıldırım
  Cc: ceph-devel, by morphin, Göktuğ Yıldırım,
	ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 26682 bytes --]

On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> I'm so sorry about that I missed "out" parameter. My bad..
> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/

Excellent, thanks.  That looks like it confirms the problem is that teh 
recovery tool didn't repopulate the creating pgs properly.

If you take that 30 byte file I sent earlier (as hex) and update the 
osdmap epoch to the latest on the mon, confirm it decodes and dumps 
properly, and then inject it on the 3 mons, that should get you past this 
hump (and hopefully back up!).

sage


> 
> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 21:13):
> 
> > I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
> > 
> > 
> > 
> >> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> I changed the file name to make it clear.
> >> When I use your command with "+decode"  I'm getting an error like this:
> >> 
> >> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
> >> 
> >> My ceph version: 13.2.2
> >> 
> >> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu yazdı:
> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>> > If I didn't do it wrong, I got the output as below.
> >>> > 
> >>> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
> >>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
> >>> > 
> >>> > ceph-dencoder type creating_pgs_t import dump dump_json
> >>> 
> >>> Sorry, should be
> >>> 
> >>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>> 
> >>> s
> >>> 
> >>> > {
> >>> >     "last_scan_epoch": 0,
> >>> >     "creating_pgs": [],
> >>> >     "queue": [],
> >>> >     "created_pools": []
> >>> > }
> >>> > 
> >>> > You can find the "dump" link below.
> >>> > 
> >>> > dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >>> > 
> >>> > 
> >>> > Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
> >>> > 
> >>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> > >> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
> >>> > >> 
> >>> > >> And you were correct we did not run force_create_pg.
> >>> > > 
> >>> > > Great.
> >>> > > 
> >>> > > So, eager to see what the current structure is... please attach once you 
> >>> > > have it.
> >>> > > 
> >>> > > The new replacement one should look like this (when hexdump -C'd):
> >>> > > 
> >>> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
> >>> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
> >>> > > 0000001e
> >>> > > 
> >>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
> >>> > > hex, little endian (least significant byte first), in place of the 0x10 
> >>> > > that is there now.  It should dump like this:
> >>> > > 
> >>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>> > > {
> >>> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
> >>> > >    "creating_pgs": [],
> >>> > >    "queue": [],
> >>> > >    "created_pools": [
> >>> > >        66
> >>> > >    ]
> >>> > > }
> >>> > > 
> >>> > > sage
> >>> > > 
> >>> > > 
> >>> > >> 
> >>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> > >>> 
> >>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> > >>>> Sage,
> >>> > >>>> 
> >>> > >>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
> >>> > >>>> ceph osd lspools
> >>> > >>>> 66 mypool
> >>> > >>>> 
> >>> > >>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> >>> > >>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> >>> > >>>> 
> >>> > >>>> ceph fs dump
> >>> > >>>> dumped fsmap epoch 1 e1
> >>> > >>>> enable_multiple, ever_enabled_multiple: 0,0
> >>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>> > >>>> legacy client fscid: -1
> >>> > >>>> 
> >>> > >>>> No filesystems configured
> >>> > >>>> 
> >>> > >>>> ceph fs ls
> >>> > >>>> No filesystems enabled
> >>> > >>>> 
> >>> > >>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> >>> > >>>> 
> >>> > >>>> Is there any way to find and delete it?
> >>> > >>> 
> >>> > >>> Ok, I'm concerned that the creating pg is in there if this is an old 
> >>> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> >>> > >>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>> > >>> store.. one that doesn't normally come up because the impact is this 
> >>> > >>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> >>> > >>> your aged cluster.
> >>> > >>> 
> >>> > >>> In any case, there is a way to clear those out of the mon, but it's a bit 
> >>> > >>> dicey. 
> >>> > >>> 
> >>> > >>> 1. stop all mons
> >>> > >>> 2. make a backup of all mons
> >>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>> > >>> key=creating key on one of the mons
> >>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> >>> > >>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> >>> > >>> 66 to the created_pgs member.
> >>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>> > >>> 7. inject the updated structure into all of the mons
> >>> > >>> 8. start all mons
> >>> > >>> 
> >>> > >>> 4-6 will probably be an iterative process... let's start by getting the 
> >>> > >>> structure out and dumping the current value?  
> >>> > >>> 
> >>> > >>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> >>> > >>> encode/decode methods.
> >>> > >>> 
> >>> > >>> sage
> >>> > >>> 
> >>> > >>> 
> >>> > >>>> 
> >>> > >>>> 
> >>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> > >>>>> 
> >>> > >>>>> Oh... I think this is the problem:
> >>> > >>>>> 
> >>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>> > >>>>> 
> >>> > >>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>> > >>>>> one of the last remaining places where the OSDs need to look at a full 
> >>> > >>>>> history of map changes between then and the current map epoch.  In this 
> >>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>> > >>>>> later.
> >>> > >>>>> 
> >>> > >>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>> > >>>>> and restarting the OSDs will make these pg_creates go away.
> >>> > >>>>> 
> >>> > >>>>> s
> >>> > >>>>> 
> >>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> > >>>>>> 
> >>> > >>>>>> Hello,
> >>> > >>>>>> 
> >>> > >>>>>> It seems nothing has changed.
> >>> > >>>>>> 
> >>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>> > >>>>>> 
> >>> > >>>>>> 
> >>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>> 
> >>> > >>>>>>> Hello,
> >>> > >>>>>>> 
> >>> > >>>>>>> 
> >>> > >>>>>>> You can also reduce the osd map updates by adding this to your ceph
> >>> > >>>>>>> config file. "osd crush update on start = false". This should remove
> >>> > >>>>>>> and update that is generated when osd starts.
> >>> > >>>>>>> 
> >>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>> > >>>>>>> dispatch
> >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>> > >>>>>>> "root=default"]} v 0) v1
> >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
> >>> > >>>>>>> {host=SRV-SEKUARK8,root=default}
> >>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>> > >>>>>>> enabled
> >>> > >>>>>>> 
> >>> > >>>>>>> 
> >>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>> > >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hi Sage,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>> > >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>> > >>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is our ceph.conf:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is pg dump:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is iostat & perf top:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This strace output of ceph-osd:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is OSD log (default debug):
> >>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is leader MON log (default debug):
> >>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> These are OSDs failed to start. Total number is 58.
> >>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>> > >>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>> > >>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>> > >>>>>>>> 
> >>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>> > >>>>>>>> only use it in very specific circumstances when directed by a developer.
> >>> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>> > >>>>>>>> that point...you're seeing some sort of resource exhaustion.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>> > >>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>> > >>>>>>>> t the mons have.  Were they really caught up when you unset noup?
> >>> > >>>>>>>> 
> >>> > >>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>> > >>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
> >>> > >>>>>>>> marked up...
> >>> > >>>>>>>> 
> >>> > >>>>>>>> sage
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>> > >>>>>>>> Is that be usefull here? There is such a less information about it.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hi,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>> > >>>>>>>> 
> >>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hi,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> I can see some issues from the osd log file. You have an extremely low
> >>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>> > >>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>> > >>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>> > >>>>>>>> mimic.
> >>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>> > >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hello Darius,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Thanks for reply!
> >>> > >>>>>>>> 
> >>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> So we have a small test like that:
> >>> > >>>>>>>> * Stop all OSDs (168 of them)
> >>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>> > >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>> > >>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Any thought would be greatly appreciated.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hello,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>> > >>>>>>>> and making backups of them. Ditch all other osds that are failing to
> >>> > >>>>>>>> start and concentrate on bringing online those that have missing
> >>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>> > >>>>>>>> them back on.
> >>> > >>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>> > >>>>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>> > >>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>> > >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>> > >>>>>>>> (as for Redhat).
> >>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>> > >>>>>>>> tools to check health and also repair. But there are limited
> >>> > >>>>>>>> documentation.
> >>> > >>>>>>>> Anyone who has experince with it?
> >>> > >>>>>>>> Anyone lead/help to a proper check would be great.
> >>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>> > >>>>>>>> tarihinde şunu yazdı:
> >>> > >>>>>>>> 
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Hi all,
> >>> > >>>>>>>> 
> >>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>> > >>>>>>>> 
> >>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>> > >>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Here is the latest ceph status:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>> > >>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> This the full log the same OSD.
> >>> > >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Here is the strace of the same OSD process:
> >>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Any help to figure out what is going on would be great.
> >>> > >>>>>>>> 
> >>> > >>>>>>>> Best Regards,
> >>> > >>>>>>>> Goktug Yildirim
> >>> > >>>>>>>> 
> >>> > >>>>>>>> _______________________________________________
> >>> > >>>>>>>> ceph-users mailing list
> >>> > >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> > >> 
> >>> > 
> >>> >
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                               ` <5FBAA21A-8B48-4B22-9FE8-0E0766C363D9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2018-10-03 19:37                                                                                 ` Sage Weil
@ 2018-10-03 19:43                                                                                 ` Göktuğ Yıldırım
  1 sibling, 0 replies; 27+ messages in thread
From: Göktuğ Yıldırım @ 2018-10-03 19:43 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel, by morphin, Göktuğ Yıldırım,
	ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 26479 bytes --]

Also you was asking the RAW output.
I've been trying to fix it for days and I didn't sleep. Forgive the dumb mistakes.

RAW dump output: https://drive.google.com/file/d/1SzFNNjSK9Q_j4iyYJTRqOYuLWJcsFX9C/view?usp=sharing

Göktuğ Yıldırım <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (3 Eki 2018 21:34):

> I'm so sorry about that I missed "out" parameter. My bad..
> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> 
> 
> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 21:13):
> 
>> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
>> 
>> 
>> 
>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim@gmail.com> wrote:
>>> I changed the file name to make it clear.
>>> When I use your command with "+decode"  I'm getting an error like this:
>>> 
>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
>>> 
>>> My ceph version: 13.2.2
>>> 
>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu yazdı:
>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>>> > If I didn't do it wrong, I got the output as below.
>>>> > 
>>>> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
>>>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
>>>> > 
>>>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>>> 
>>>> Sorry, should be
>>>> 
>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>>> 
>>>> s
>>>> 
>>>> > {
>>>> >     "last_scan_epoch": 0,
>>>> >     "creating_pgs": [],
>>>> >     "queue": [],
>>>> >     "created_pools": []
>>>> > }
>>>> > 
>>>> > You can find the "dump" link below.
>>>> > 
>>>> > dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>>> > 
>>>> > 
>>>> > Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
>>>> > 
>>>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>> > >> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
>>>> > >> 
>>>> > >> And you were correct we did not run force_create_pg.
>>>> > > 
>>>> > > Great.
>>>> > > 
>>>> > > So, eager to see what the current structure is... please attach once you 
>>>> > > have it.
>>>> > > 
>>>> > > The new replacement one should look like this (when hexdump -C'd):
>>>> > > 
>>>> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
>>>> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
>>>> > > 0000001e
>>>> > > 
>>>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>>> > > hex, little endian (least significant byte first), in place of the 0x10 
>>>> > > that is there now.  It should dump like this:
>>>> > > 
>>>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>>> > > {
>>>> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
>>>> > >    "creating_pgs": [],
>>>> > >    "queue": [],
>>>> > >    "created_pools": [
>>>> > >        66
>>>> > >    ]
>>>> > > }
>>>> > > 
>>>> > > sage
>>>> > > 
>>>> > > 
>>>> > >> 
>>>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>>> > >>> 
>>>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>> > >>>> Sage,
>>>> > >>>> 
>>>> > >>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>>>> > >>>> ceph osd lspools
>>>> > >>>> 66 mypool
>>>> > >>>> 
>>>> > >>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
>>>> > >>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
>>>> > >>>> 
>>>> > >>>> ceph fs dump
>>>> > >>>> dumped fsmap epoch 1 e1
>>>> > >>>> enable_multiple, ever_enabled_multiple: 0,0
>>>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>>> > >>>> legacy client fscid: -1
>>>> > >>>> 
>>>> > >>>> No filesystems configured
>>>> > >>>> 
>>>> > >>>> ceph fs ls
>>>> > >>>> No filesystems enabled
>>>> > >>>> 
>>>> > >>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
>>>> > >>>> 
>>>> > >>>> Is there any way to find and delete it?
>>>> > >>> 
>>>> > >>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>>> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
>>>> > >>> didn't, I think this is a bug in the process for rebuilding the mon 
>>>> > >>> store.. one that doesn't normally come up because the impact is this 
>>>> > >>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
>>>> > >>> your aged cluster.
>>>> > >>> 
>>>> > >>> In any case, there is a way to clear those out of the mon, but it's a bit 
>>>> > >>> dicey. 
>>>> > >>> 
>>>> > >>> 1. stop all mons
>>>> > >>> 2. make a backup of all mons
>>>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
>>>> > >>> key=creating key on one of the mons
>>>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
>>>> > >>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
>>>> > >>> 66 to the created_pgs member.
>>>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
>>>> > >>> 7. inject the updated structure into all of the mons
>>>> > >>> 8. start all mons
>>>> > >>> 
>>>> > >>> 4-6 will probably be an iterative process... let's start by getting the 
>>>> > >>> structure out and dumping the current value?  
>>>> > >>> 
>>>> > >>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
>>>> > >>> encode/decode methods.
>>>> > >>> 
>>>> > >>> sage
>>>> > >>> 
>>>> > >>> 
>>>> > >>>> 
>>>> > >>>> 
>>>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>>> > >>>>> 
>>>> > >>>>> Oh... I think this is the problem:
>>>> > >>>>> 
>>>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>>> > >>>>> 
>>>> > >>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>>> > >>>>> one of the last remaining places where the OSDs need to look at a full 
>>>> > >>>>> history of map changes between then and the current map epoch.  In this 
>>>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>>> > >>>>> later.
>>>> > >>>>> 
>>>> > >>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>>> > >>>>> and restarting the OSDs will make these pg_creates go away.
>>>> > >>>>> 
>>>> > >>>>> s
>>>> > >>>>> 
>>>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>> > >>>>>> 
>>>> > >>>>>> Hello,
>>>> > >>>>>> 
>>>> > >>>>>> It seems nothing has changed.
>>>> > >>>>>> 
>>>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>>> > >>>>>> 
>>>> > >>>>>> 
>>>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>> > >>>>>>> 
>>>> > >>>>>>> Hello,
>>>> > >>>>>>> 
>>>> > >>>>>>> 
>>>> > >>>>>>> You can also reduce the osd map updates by adding this to your ceph
>>>> > >>>>>>> config file. "osd crush update on start = false". This should remove
>>>> > >>>>>>> and update that is generated when osd starts.
>>>> > >>>>>>> 
>>>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>>> > >>>>>>> dispatch
>>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>>> > >>>>>>> "root=default"]} v 0) v1
>>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
>>>> > >>>>>>> {host=SRV-SEKUARK8,root=default}
>>>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>>> > >>>>>>> enabled
>>>> > >>>>>>> 
>>>> > >>>>>>> 
>>>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>> > >>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hi Sage,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>> > >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>>> > >>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is our ceph.conf:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is pg dump:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is iostat & perf top:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This strace output of ceph-osd:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is OSD log (default debug):
>>>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is leader MON log (default debug):
>>>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> These are OSDs failed to start. Total number is 58.
>>>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>> > >>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>> > >>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>> > >>>>>>>> 
>>>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>>> > >>>>>>>> only use it in very specific circumstances when directed by a developer.
>>>> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>>> > >>>>>>>> that point...you're seeing some sort of resource exhaustion.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
>>>> > >>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>>> > >>>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>>> > >>>>>>>> 
>>>> > >>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>>> > >>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
>>>> > >>>>>>>> marked up...
>>>> > >>>>>>>> 
>>>> > >>>>>>>> sage
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>> > >>>>>>>> Is that be usefull here? There is such a less information about it.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hi,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>>> > >>>>>>>> 
>>>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hi,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> I can see some issues from the osd log file. You have an extremely low
>>>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>> > >>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>> > >>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>> > >>>>>>>> mimic.
>>>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>> > >>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hello Darius,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Thanks for reply!
>>>> > >>>>>>>> 
>>>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> So we have a small test like that:
>>>> > >>>>>>>> * Stop all OSDs (168 of them)
>>>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>> > >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>> > >>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Any thought would be greatly appreciated.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hello,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>>> > >>>>>>>> and making backups of them. Ditch all other osds that are failing to
>>>> > >>>>>>>> start and concentrate on bringing online those that have missing
>>>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>> > >>>>>>>> them back on.
>>>> > >>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>> > >>>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>>> > >>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>> > >>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>> > >>>>>>>> (as for Redhat).
>>>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>> > >>>>>>>> tools to check health and also repair. But there are limited
>>>> > >>>>>>>> documentation.
>>>> > >>>>>>>> Anyone who has experince with it?
>>>> > >>>>>>>> Anyone lead/help to a proper check would be great.
>>>> > >>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
>>>> > >>>>>>>> tarihinde şunu yazdı:
>>>> > >>>>>>>> 
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Hi all,
>>>> > >>>>>>>> 
>>>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>> > >>>>>>>> 
>>>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>> > >>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Here is the latest ceph status:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>> > >>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> This the full log the same OSD.
>>>> > >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Here is the strace of the same OSD process:
>>>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Any help to figure out what is going on would be great.
>>>> > >>>>>>>> 
>>>> > >>>>>>>> Best Regards,
>>>> > >>>>>>>> Goktug Yildirim
>>>> > >>>>>>>> 
>>>> > >>>>>>>> _______________________________________________
>>>> > >>>>>>>> ceph-users mailing list
>>>> > >>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> > >> 
>>>> > 
>>>> >

[-- Attachment #1.2: Type: text/html, Size: 40547 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                                   ` <alpine.DEB.2.11.1810031936110.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-04  1:35                                                                                     ` Goktug Yildirim
       [not found]                                                                                       ` <53664F54-EB52-4E9B-A46C-26C8E27CFE24-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-04  1:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, by morphin, ceph-users

This is our cluster state right now. I can reach rbd list and thats good! Thanks a lot Sage!!!
ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/

As you can see we have 2 unfound pg since some of our OSDs can not start. 58 OSD gives different errors.
How can I fix these OSD's? If I remember correctly it should not be so much trouble.

These are OSDs' failed logs.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/
https://paste.ubuntu.com/p/pkRdVjCH4D/
https://paste.ubuntu.com/p/zJTf2fzSj9/
https://paste.ubuntu.com/p/xpJRK6YhRX/
https://paste.ubuntu.com/p/SY3576dNbJ/
https://paste.ubuntu.com/p/smyT6Y976b/

> On 3 Oct 2018, at 21:37, Sage Weil <sage@newdream.net> wrote:
> 
> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> I'm so sorry about that I missed "out" parameter. My bad..
>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> 
> Excellent, thanks.  That looks like it confirms the problem is that teh 
> recovery tool didn't repopulate the creating pgs properly.
> 
> If you take that 30 byte file I sent earlier (as hex) and update the 
> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> properly, and then inject it on the 3 mons, that should get you past this 
> hump (and hopefully back up!).
> 
> sage
> 
> 
>> 
>> Sage Weil <sage@newdream.net> şunları yazdı (3 Eki 2018 21:13):
>> 
>>> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
>>> 
>>> 
>>> 
>>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim@gmail.com> wrote:
>>>> I changed the file name to make it clear.
>>>> When I use your command with "+decode"  I'm getting an error like this:
>>>> 
>>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>>>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
>>>> 
>>>> My ceph version: 13.2.2
>>>> 
>>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage@newdream.net> şunu yazdı:
>>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>> If I didn't do it wrong, I got the output as below.
>>>>>> 
>>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
>>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
>>>>>> 
>>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
>>>>> 
>>>>> Sorry, should be
>>>>> 
>>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>>>> 
>>>>> s
>>>>> 
>>>>>> {
>>>>>>    "last_scan_epoch": 0,
>>>>>>    "creating_pgs": [],
>>>>>>    "queue": [],
>>>>>>    "created_pools": []
>>>>>> }
>>>>>> 
>>>>>> You can find the "dump" link below.
>>>>>> 
>>>>>> dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>>>>> 
>>>>>> 
>>>>>> Sage Weil <sage@newdream.net> şunları yazdı (3 Eki 2018 18:45):
>>>>>> 
>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
>>>>>>>> 
>>>>>>>> And you were correct we did not run force_create_pg.
>>>>>>> 
>>>>>>> Great.
>>>>>>> 
>>>>>>> So, eager to see what the current structure is... please attach once you 
>>>>>>> have it.
>>>>>>> 
>>>>>>> The new replacement one should look like this (when hexdump -C'd):
>>>>>>> 
>>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
>>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
>>>>>>> 0000001e
>>>>>>> 
>>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>>>>>> hex, little endian (least significant byte first), in place of the 0x10 
>>>>>>> that is there now.  It should dump like this:
>>>>>>> 
>>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>>>>>> {
>>>>>>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
>>>>>>>   "creating_pgs": [],
>>>>>>>   "queue": [],
>>>>>>>   "created_pools": [
>>>>>>>       66
>>>>>>>   ]
>>>>>>> }
>>>>>>> 
>>>>>>> sage
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>> 
>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>> Sage,
>>>>>>>>>> 
>>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>>>>>>>>>> ceph osd lspools
>>>>>>>>>> 66 mypool
>>>>>>>>>> 
>>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
>>>>>>>>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
>>>>>>>>>> 
>>>>>>>>>> ceph fs dump
>>>>>>>>>> dumped fsmap epoch 1 e1
>>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
>>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>>>>>>>>> legacy client fscid: -1
>>>>>>>>>> 
>>>>>>>>>> No filesystems configured
>>>>>>>>>> 
>>>>>>>>>> ceph fs ls
>>>>>>>>>> No filesystems enabled
>>>>>>>>>> 
>>>>>>>>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
>>>>>>>>>> 
>>>>>>>>>> Is there any way to find and delete it?
>>>>>>>>> 
>>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>>>>>>>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
>>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
>>>>>>>>> store.. one that doesn't normally come up because the impact is this 
>>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
>>>>>>>>> your aged cluster.
>>>>>>>>> 
>>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a bit 
>>>>>>>>> dicey. 
>>>>>>>>> 
>>>>>>>>> 1. stop all mons
>>>>>>>>> 2. make a backup of all mons
>>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
>>>>>>>>> key=creating key on one of the mons
>>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
>>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
>>>>>>>>> 66 to the created_pgs member.
>>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
>>>>>>>>> 7. inject the updated structure into all of the mons
>>>>>>>>> 8. start all mons
>>>>>>>>> 
>>>>>>>>> 4-6 will probably be an iterative process... let's start by getting the 
>>>>>>>>> structure out and dumping the current value?  
>>>>>>>>> 
>>>>>>>>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
>>>>>>>>> encode/decode methods.
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Oh... I think this is the problem:
>>>>>>>>>>> 
>>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>>>>>>>>>> 
>>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>>>>>>>>>> one of the last remaining places where the OSDs need to look at a full 
>>>>>>>>>>> history of map changes between then and the current map epoch.  In this 
>>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>>>>>>>>>> later.
>>>>>>>>>>> 
>>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
>>>>>>>>>>> 
>>>>>>>>>>> s
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> It seems nothing has changed.
>>>>>>>>>>>> 
>>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your ceph
>>>>>>>>>>>>> config file. "osd crush update on start = false". This should remove
>>>>>>>>>>>>> and update that is generated when osd starts.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>>>>>>>>>>>> dispatch
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>>>>>>>>>>>> "root=default"]} v 0) v1
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
>>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
>>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>>>>>>>>>>> <goktug.yildirim@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Sage,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is our ceph.conf:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is pg dump:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is iostat & perf top:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This strace output of ceph-osd:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD log (default debug):
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is leader MON log (default debug):
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>>>>>>>>>>>>> only use it in very specific circumstances when directed by a developer.
>>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
>>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
>>>>>>>>>>>>>> marked up...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>>>>>>>>>>>> Is that be usefull here? There is such a less information about it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim@gmail.com> şunları yazdı (2 Eki 2018 22:11):
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can see some issues from the osd log file. You have an extremely low
>>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>>>>>>>>>>>> mimic.
>>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>>>>>>>>>>>> <goktug.yildirim@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello Darius,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for reply!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So we have a small test like that:
>>>>>>>>>>>>>> * Stop all OSDs (168 of them)
>>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Any thought would be greatly appreciated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are failing to
>>>>>>>>>>>>>> start and concentrate on bringing online those that have missing
>>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>>>>>>>>>>>> them back on.
>>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>>>>>>>>>>>> <morphinwithyou@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>>>>>>>>>>>> (as for Redhat).
>>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>>>>>>>>>>>> tools to check health and also repair. But there are limited
>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>> Anyone who has experince with it?
>>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>>>>>>>>>>>>>> tarihinde şunu yazdı:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the latest ceph status:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This the full log the same OSD.
>>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the strace of the same OSD process:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Any help to figure out what is going on would be great.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> Goktug Yildirim
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>> 
>>>>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                                       ` <53664F54-EB52-4E9B-A46C-26C8E27CFE24-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-04  4:17                                                                                         ` Sage Weil
       [not found]                                                                                           ` <alpine.DEB.2.11.1810040414290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2018-10-04  4:17 UTC (permalink / raw)
  To: Goktug Yildirim; +Cc: ceph-devel, by morphin, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 28012 bytes --]

On Thu, 4 Oct 2018, Goktug Yildirim wrote:
> This is our cluster state right now. I can reach rbd list and thats good! Thanks a lot Sage!!!
> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/

Progress!  Not out of the woods yet, though...

> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 OSD gives different errors.
> How can I fix these OSD's? If I remember correctly it should not be so much trouble.
> 
> These are OSDs' failed logs.
> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> https://paste.ubuntu.com/p/pkRdVjCH4D/

These are both failing in rocksdb code, with something like

Can't access /032949.sst: NotFound:

Can you check whether that .sst file actually exists?  Might be a 
weird path issue.

> https://paste.ubuntu.com/p/zJTf2fzSj9/
> https://paste.ubuntu.com/p/xpJRK6YhRX/

These are failing in the rocksdb CheckConstency code.  Not sure what to 
make of that.

> https://paste.ubuntu.com/p/SY3576dNbJ/
> https://paste.ubuntu.com/p/smyT6Y976b/

These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
here, can you give it a shot?

sage


> 
> > On 3 Oct 2018, at 21:37, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > 
> > On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >> I'm so sorry about that I missed "out" parameter. My bad..
> >> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> > 
> > Excellent, thanks.  That looks like it confirms the problem is that teh 
> > recovery tool didn't repopulate the creating pgs properly.
> > 
> > If you take that 30 byte file I sent earlier (as hex) and update the 
> > osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> > properly, and then inject it on the 3 mons, that should get you past this 
> > hump (and hopefully back up!).
> > 
> > sage
> > 
> > 
> >> 
> >> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 21:13):
> >> 
> >>> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
> >>> 
> >>> 
> >>> 
> >>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>> I changed the file name to make it clear.
> >>>> When I use your command with "+decode"  I'm getting an error like this:
> >>>> 
> >>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >>>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
> >>>> 
> >>>> My ceph version: 13.2.2
> >>>> 
> >>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu yazdı:
> >>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>> If I didn't do it wrong, I got the output as below.
> >>>>>> 
> >>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
> >>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
> >>>>>> 
> >>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
> >>>>> 
> >>>>> Sorry, should be
> >>>>> 
> >>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>>>> 
> >>>>> s
> >>>>> 
> >>>>>> {
> >>>>>>    "last_scan_epoch": 0,
> >>>>>>    "creating_pgs": [],
> >>>>>>    "queue": [],
> >>>>>>    "created_pools": []
> >>>>>> }
> >>>>>> 
> >>>>>> You can find the "dump" link below.
> >>>>>> 
> >>>>>> dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >>>>>> 
> >>>>>> 
> >>>>>> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
> >>>>>> 
> >>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
> >>>>>>>> 
> >>>>>>>> And you were correct we did not run force_create_pg.
> >>>>>>> 
> >>>>>>> Great.
> >>>>>>> 
> >>>>>>> So, eager to see what the current structure is... please attach once you 
> >>>>>>> have it.
> >>>>>>> 
> >>>>>>> The new replacement one should look like this (when hexdump -C'd):
> >>>>>>> 
> >>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
> >>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
> >>>>>>> 0000001e
> >>>>>>> 
> >>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
> >>>>>>> hex, little endian (least significant byte first), in place of the 0x10 
> >>>>>>> that is there now.  It should dump like this:
> >>>>>>> 
> >>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>>>>>> {
> >>>>>>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
> >>>>>>>   "creating_pgs": [],
> >>>>>>>   "queue": [],
> >>>>>>>   "created_pools": [
> >>>>>>>       66
> >>>>>>>   ]
> >>>>>>> }
> >>>>>>> 
> >>>>>>> sage
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>> 
> >>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>> Sage,
> >>>>>>>>>> 
> >>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
> >>>>>>>>>> ceph osd lspools
> >>>>>>>>>> 66 mypool
> >>>>>>>>>> 
> >>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> >>>>>>>>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> >>>>>>>>>> 
> >>>>>>>>>> ceph fs dump
> >>>>>>>>>> dumped fsmap epoch 1 e1
> >>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
> >>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>>>>>>>>> legacy client fscid: -1
> >>>>>>>>>> 
> >>>>>>>>>> No filesystems configured
> >>>>>>>>>> 
> >>>>>>>>>> ceph fs ls
> >>>>>>>>>> No filesystems enabled
> >>>>>>>>>> 
> >>>>>>>>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> >>>>>>>>>> 
> >>>>>>>>>> Is there any way to find and delete it?
> >>>>>>>>> 
> >>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old 
> >>>>>>>>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> >>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>>>>>>>> store.. one that doesn't normally come up because the impact is this 
> >>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> >>>>>>>>> your aged cluster.
> >>>>>>>>> 
> >>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a bit 
> >>>>>>>>> dicey. 
> >>>>>>>>> 
> >>>>>>>>> 1. stop all mons
> >>>>>>>>> 2. make a backup of all mons
> >>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>>>>>>>> key=creating key on one of the mons
> >>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> >>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> >>>>>>>>> 66 to the created_pgs member.
> >>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>>>>>>>> 7. inject the updated structure into all of the mons
> >>>>>>>>> 8. start all mons
> >>>>>>>>> 
> >>>>>>>>> 4-6 will probably be an iterative process... let's start by getting the 
> >>>>>>>>> structure out and dumping the current value?  
> >>>>>>>>> 
> >>>>>>>>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> >>>>>>>>> encode/decode methods.
> >>>>>>>>> 
> >>>>>>>>> sage
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> Oh... I think this is the problem:
> >>>>>>>>>>> 
> >>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>>>>>>>>>> 
> >>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>>>>>>>>>> one of the last remaining places where the OSDs need to look at a full 
> >>>>>>>>>>> history of map changes between then and the current map epoch.  In this 
> >>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>>>>>>>>>> later.
> >>>>>>>>>>> 
> >>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
> >>>>>>>>>>> 
> >>>>>>>>>>> s
> >>>>>>>>>>> 
> >>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> It seems nothing has changed.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your ceph
> >>>>>>>>>>>>> config file. "osd crush update on start = false". This should remove
> >>>>>>>>>>>>> and update that is generated when osd starts.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>>>>>>>>>> dispatch
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>>>>>>>>>> "root=default"]} v 0) v1
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
> >>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>>>>>>>>>>>> enabled
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>>>>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi Sage,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is our ceph.conf:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is pg dump:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is iostat & perf top:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This strace output of ceph-osd:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD log (default debug):
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is leader MON log (default debug):
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>>>>>>>>>>>>> only use it in very specific circumstances when directed by a developer.
> >>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset noup?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
> >>>>>>>>>>>>>> marked up...
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> sage
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>>>>>>>>>>>> Is that be usefull here? There is such a less information about it.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I can see some issues from the osd log file. You have an extremely low
> >>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>>>>>>>>>>>> mimic.
> >>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>>>>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello Darius,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thanks for reply!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So we have a small test like that:
> >>>>>>>>>>>>>> * Stop all OSDs (168 of them)
> >>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Any thought would be greatly appreciated.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are failing to
> >>>>>>>>>>>>>> start and concentrate on bringing online those that have missing
> >>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>>>>>>>>>>>> them back on.
> >>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>>>>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>>>>>>>>>>> (as for Redhat).
> >>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>>>>>>>>>>> tools to check health and also repair. But there are limited
> >>>>>>>>>>>>>> documentation.
> >>>>>>>>>>>>>> Anyone who has experince with it?
> >>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
> >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>>>>>>>>>>>> tarihinde şunu yazdı:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the latest ceph status:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This the full log the same OSD.
> >>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the strace of the same OSD process:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Any help to figure out what is going on would be great.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>> Goktug Yildirim
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>> 
> >>>>>> 
> >>>>>> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                                           ` <alpine.DEB.2.11.1810040414290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-10-04  9:26                                                                                             ` Goktug Yildirim
       [not found]                                                                                               ` <71B2ECD8-EECF-48D5-844C-7C80EDD24028-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Goktug Yildirim @ 2018-10-04  9:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, by morphin, ceph-users

This is ceph-object-store tool logs for OSD.0.

https://paste.ubuntu.com/p/jNwf4DC46H/

There is something wrong. But we are not sure if we cant use the tool or there is something wrong with OSD.


> On 4 Oct 2018, at 06:17, Sage Weil <sage@newdream.net> wrote:
> 
> On Thu, 4 Oct 2018, Goktug Yildirim wrote:
>> This is our cluster state right now. I can reach rbd list and thats good! Thanks a lot Sage!!!
>> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/
> 
> Progress!  Not out of the woods yet, though...
> 
>> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 OSD gives different errors.
>> How can I fix these OSD's? If I remember correctly it should not be so much trouble.
>> 
>> These are OSDs' failed logs.
>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> 
> These are both failing in rocksdb code, with something like
> 
> Can't access /032949.sst: NotFound:
> 
> Can you check whether that .sst file actually exists?  Might be a 
> weird path issue.
> 
>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> 
> These are failing in the rocksdb CheckConstency code.  Not sure what to 
> make of that.
> 
>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> https://paste.ubuntu.com/p/smyT6Y976b/
> 
> These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
> here, can you give it a shot?
> 
> sage
> 
> 
>> 
>>> On 3 Oct 2018, at 21:37, Sage Weil <sage@newdream.net> wrote:
>>> 
>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>>> I'm so sorry about that I missed "out" parameter. My bad..
>>>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
>>> 
>>> Excellent, thanks.  That looks like it confirms the problem is that teh 
>>> recovery tool didn't repopulate the creating pgs properly.
>>> 
>>> If you take that 30 byte file I sent earlier (as hex) and update the 
>>> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
>>> properly, and then inject it on the 3 mons, that should get you past this 
>>> hump (and hopefully back up!).
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Sage Weil <sage@newdream.net> şunları yazdı (3 Eki 2018 21:13):
>>>> 
>>>>> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
>>>>> 
>>>>> 
>>>>> 
>>>>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim@gmail.com> wrote:
>>>>>> I changed the file name to make it clear.
>>>>>> When I use your command with "+decode"  I'm getting an error like this:
>>>>>> 
>>>>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>>>>>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
>>>>>> 
>>>>>> My ceph version: 13.2.2
>>>>>> 
>>>>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage@newdream.net> şunu yazdı:
>>>>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>>>> If I didn't do it wrong, I got the output as below.
>>>>>>>> 
>>>>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
>>>>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
>>>>>>>> 
>>>>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
>>>>>>> 
>>>>>>> Sorry, should be
>>>>>>> 
>>>>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>>>>>> 
>>>>>>> s
>>>>>>> 
>>>>>>>> {
>>>>>>>>   "last_scan_epoch": 0,
>>>>>>>>   "creating_pgs": [],
>>>>>>>>   "queue": [],
>>>>>>>>   "created_pools": []
>>>>>>>> }
>>>>>>>> 
>>>>>>>> You can find the "dump" link below.
>>>>>>>> 
>>>>>>>> dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sage Weil <sage@newdream.net> şunları yazdı (3 Eki 2018 18:45):
>>>>>>>> 
>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
>>>>>>>>>> 
>>>>>>>>>> And you were correct we did not run force_create_pg.
>>>>>>>>> 
>>>>>>>>> Great.
>>>>>>>>> 
>>>>>>>>> So, eager to see what the current structure is... please attach once you 
>>>>>>>>> have it.
>>>>>>>>> 
>>>>>>>>> The new replacement one should look like this (when hexdump -C'd):
>>>>>>>>> 
>>>>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
>>>>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
>>>>>>>>> 0000001e
>>>>>>>>> 
>>>>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>>>>>>>> hex, little endian (least significant byte first), in place of the 0x10 
>>>>>>>>> that is there now.  It should dump like this:
>>>>>>>>> 
>>>>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>>>>>>>> {
>>>>>>>>>  "last_scan_epoch": 16,   <--- but with a recent epoch here
>>>>>>>>>  "creating_pgs": [],
>>>>>>>>>  "queue": [],
>>>>>>>>>  "created_pools": [
>>>>>>>>>      66
>>>>>>>>>  ]
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>>>> Sage,
>>>>>>>>>>>> 
>>>>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>>>>>>>>>>>> ceph osd lspools
>>>>>>>>>>>> 66 mypool
>>>>>>>>>>>> 
>>>>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
>>>>>>>>>>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
>>>>>>>>>>>> 
>>>>>>>>>>>> ceph fs dump
>>>>>>>>>>>> dumped fsmap epoch 1 e1
>>>>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
>>>>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>>>>>>>>>>> legacy client fscid: -1
>>>>>>>>>>>> 
>>>>>>>>>>>> No filesystems configured
>>>>>>>>>>>> 
>>>>>>>>>>>> ceph fs ls
>>>>>>>>>>>> No filesystems enabled
>>>>>>>>>>>> 
>>>>>>>>>>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there any way to find and delete it?
>>>>>>>>>>> 
>>>>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>>>>>>>>>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
>>>>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
>>>>>>>>>>> store.. one that doesn't normally come up because the impact is this 
>>>>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
>>>>>>>>>>> your aged cluster.
>>>>>>>>>>> 
>>>>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a bit 
>>>>>>>>>>> dicey. 
>>>>>>>>>>> 
>>>>>>>>>>> 1. stop all mons
>>>>>>>>>>> 2. make a backup of all mons
>>>>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
>>>>>>>>>>> key=creating key on one of the mons
>>>>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
>>>>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
>>>>>>>>>>> 66 to the created_pgs member.
>>>>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
>>>>>>>>>>> 7. inject the updated structure into all of the mons
>>>>>>>>>>> 8. start all mons
>>>>>>>>>>> 
>>>>>>>>>>> 4-6 will probably be an iterative process... let's start by getting the 
>>>>>>>>>>> structure out and dumping the current value?  
>>>>>>>>>>> 
>>>>>>>>>>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
>>>>>>>>>>> encode/decode methods.
>>>>>>>>>>> 
>>>>>>>>>>> sage
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Oh... I think this is the problem:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>>>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>>>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>>>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>>>>>>>>>>>> one of the last remaining places where the OSDs need to look at a full 
>>>>>>>>>>>>> history of map changes between then and the current map epoch.  In this 
>>>>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>>>>>>>>>>>> later.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>>>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>>>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> s
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It seems nothing has changed.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>>>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your ceph
>>>>>>>>>>>>>>> config file. "osd crush update on start = false". This should remove
>>>>>>>>>>>>>>> and update that is generated when osd starts.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>>>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>>>>>>>>>>>>>> dispatch
>>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>>>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>>>>>>>>>>>>>> "root=default"]} v 0) v1
>>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>>>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>>>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>>>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
>>>>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
>>>>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>>>>>>>>>>>>> <goktug.yildirim@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Sage,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
>>>>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
>>>>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is our ceph.conf:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is pg dump:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is iostat & perf top:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This strace output of ceph-osd:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is OSD log (default debug):
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is leader MON log (default debug):
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
>>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
>>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage@newdream.net> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>>>>>>>>>>>>>>>> only use it in very specific circumstances when directed by a developer.
>>>>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>>>>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
>>>>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>>>>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>>>>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
>>>>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>>>>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
>>>>>>>>>>>>>>>> marked up...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>>>>>>>>>>>>>> Is that be usefull here? There is such a less information about it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim@gmail.com> şunları yazdı (2 Eki 2018 22:11):
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
>>>>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I can see some issues from the osd log file. You have an extremely low
>>>>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>>>>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
>>>>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
>>>>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
>>>>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
>>>>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
>>>>>>>>>>>>>>>> mimic.
>>>>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>>>>>>>>>>>>>> <goktug.yildirim@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hello Darius,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for reply!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
>>>>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So we have a small test like that:
>>>>>>>>>>>>>>>> * Stop all OSDs (168 of them)
>>>>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>>>>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>>>>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
>>>>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
>>>>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any thought would be greatly appreciated.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding them
>>>>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are failing to
>>>>>>>>>>>>>>>> start and concentrate on bringing online those that have missing
>>>>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
>>>>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
>>>>>>>>>>>>>>>> them back on.
>>>>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
>>>>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible and then
>>>>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
>>>>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>>>>>>>>>>>>>> <morphinwithyou@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
>>>>>>>>>>>>>>>> (as for Redhat).
>>>>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
>>>>>>>>>>>>>>>> tools to check health and also repair. But there are limited
>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>> Anyone who has experince with it?
>>>>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
>>>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim@gmail.com>, 1 Eki 2018 Pzt, 22:55
>>>>>>>>>>>>>>>> tarihinde şunu yazdı:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
>>>>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Here is the latest ceph status:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This the full log the same OSD.
>>>>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Here is the strace of the same OSD process:
>>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any help to figure out what is going on would be great.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>> Goktug Yildirim
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Mimic offline problem
       [not found]                                                                                               ` <71B2ECD8-EECF-48D5-844C-7C80EDD24028-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-10-05 14:45                                                                                                 ` Sage Weil
  0 siblings, 0 replies; 27+ messages in thread
From: Sage Weil @ 2018-10-05 14:45 UTC (permalink / raw)
  To: Goktug Yildirim; +Cc: ceph-devel, by morphin, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 30425 bytes --]

Quick update here:

The problem with the OSDs that are throwing rocksdb errors (missing SST 
files) is that ceph-kvstore-tool bluestore-kv ... repair was run on OSDs, 
and it looks like the rocksdb repair function actually broke the 
(non-broken) rocksdb instance.  I'm not quite sure why that is the 
case--seems like a pretty big problem for a repair to be unsafe--so that 
is something we need to follow up on with the rocksdb folks.

The (possible) good news is that the structure of bluefs is such that 
it looks like we can limit replay of the internal journal and effectively 
roll back the changes made by repair.  That seems to have worked on one 
OSD at least; need to see if it works on others too.

However, even if all those are fixed, various other low-level 
ceph-objectstore-tool commands were run on the OSDs and I have a feeling 
those are going to prevent recovery.  :(

Moral of the story: don't run random low-level *-tool commands on your 
system if you don't know what they do or whether they are needed!

sage


On Thu, 4 Oct 2018, Goktug Yildirim wrote:

> This is ceph-object-store tool logs for OSD.0.
> 
> https://paste.ubuntu.com/p/jNwf4DC46H/
> 
> There is something wrong. But we are not sure if we cant use the tool or there is something wrong with OSD.
> 
> 
> > On 4 Oct 2018, at 06:17, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> > 
> > On Thu, 4 Oct 2018, Goktug Yildirim wrote:
> >> This is our cluster state right now. I can reach rbd list and thats good! Thanks a lot Sage!!!
> >> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/
> > 
> > Progress!  Not out of the woods yet, though...
> > 
> >> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 OSD gives different errors.
> >> How can I fix these OSD's? If I remember correctly it should not be so much trouble.
> >> 
> >> These are OSDs' failed logs.
> >> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >> https://paste.ubuntu.com/p/pkRdVjCH4D/
> > 
> > These are both failing in rocksdb code, with something like
> > 
> > Can't access /032949.sst: NotFound:
> > 
> > Can you check whether that .sst file actually exists?  Might be a 
> > weird path issue.
> > 
> >> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >> https://paste.ubuntu.com/p/xpJRK6YhRX/
> > 
> > These are failing in the rocksdb CheckConstency code.  Not sure what to 
> > make of that.
> > 
> >> https://paste.ubuntu.com/p/SY3576dNbJ/
> >> https://paste.ubuntu.com/p/smyT6Y976b/
> > 
> > These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
> > here, can you give it a shot?
> > 
> > sage
> > 
> > 
> >> 
> >>> On 3 Oct 2018, at 21:37, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>> 
> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>>> I'm so sorry about that I missed "out" parameter. My bad..
> >>>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> >>> 
> >>> Excellent, thanks.  That looks like it confirms the problem is that teh 
> >>> recovery tool didn't repopulate the creating pgs properly.
> >>> 
> >>> If you take that 30 byte file I sent earlier (as hex) and update the 
> >>> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> >>> properly, and then inject it on the 3 mons, that should get you past this 
> >>> hump (and hopefully back up!).
> >>> 
> >>> sage
> >>> 
> >>> 
> >>>> 
> >>>> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 21:13):
> >>>> 
> >>>>> I bet the kvstore output it in a hexdump format?  There is another option to get the raw data iirc
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>> I changed the file name to make it clear.
> >>>>>> When I use your command with "+decode"  I'm getting an error like this:
> >>>>>> 
> >>>>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >>>>>> error: buffer::malformed_input: void creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 2 < 111
> >>>>>> 
> >>>>>> My ceph version: 13.2.2
> >>>>>> 
> >>>>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunu yazdı:
> >>>>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>>> If I didn't do it wrong, I got the output as below.
> >>>>>>>> 
> >>>>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating creating > dump
> >>>>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: [default]
> >>>>>>>> 
> >>>>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
> >>>>>>> 
> >>>>>>> Sorry, should be
> >>>>>>> 
> >>>>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>>>>>> 
> >>>>>>> s
> >>>>>>> 
> >>>>>>>> {
> >>>>>>>>   "last_scan_epoch": 0,
> >>>>>>>>   "creating_pgs": [],
> >>>>>>>>   "queue": [],
> >>>>>>>>   "created_pools": []
> >>>>>>>> }
> >>>>>>>> 
> >>>>>>>> You can find the "dump" link below.
> >>>>>>>> 
> >>>>>>>> dump: https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> şunları yazdı (3 Eki 2018 18:45):
> >>>>>>>> 
> >>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>> We are starting to work on it. First step is getting the structure out and dumping the current value as you say.
> >>>>>>>>>> 
> >>>>>>>>>> And you were correct we did not run force_create_pg.
> >>>>>>>>> 
> >>>>>>>>> Great.
> >>>>>>>>> 
> >>>>>>>>> So, eager to see what the current structure is... please attach once you 
> >>>>>>>>> have it.
> >>>>>>>>> 
> >>>>>>>>> The new replacement one should look like this (when hexdump -C'd):
> >>>>>>>>> 
> >>>>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  |................|
> >>>>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        |..B...........|
> >>>>>>>>> 0000001e
> >>>>>>>>> 
> >>>>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
> >>>>>>>>> hex, little endian (least significant byte first), in place of the 0x10 
> >>>>>>>>> that is there now.  It should dump like this:
> >>>>>>>>> 
> >>>>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>>>>>>>> {
> >>>>>>>>>  "last_scan_epoch": 16,   <--- but with a recent epoch here
> >>>>>>>>>  "creating_pgs": [],
> >>>>>>>>>  "queue": [],
> >>>>>>>>>  "created_pools": [
> >>>>>>>>>      66
> >>>>>>>>>  ]
> >>>>>>>>> }
> >>>>>>>>> 
> >>>>>>>>> sage
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>>>> Sage,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created months ago.
> >>>>>>>>>>>> ceph osd lspools
> >>>>>>>>>>>> 66 mypool
> >>>>>>>>>>>> 
> >>>>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted them.
> >>>>>>>>>>>> After we create another cephfs fs and pools we started MDS and it stucked on creation. So we stopped MDS and removed fs and fs pools. Right now we do not have MDS running nor we have cephfs related things.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> ceph fs dump
> >>>>>>>>>>>> dumped fsmap epoch 1 e1
> >>>>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
> >>>>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>>>>>>>>>>> legacy client fscid: -1
> >>>>>>>>>>>> 
> >>>>>>>>>>>> No filesystems configured
> >>>>>>>>>>>> 
> >>>>>>>>>>>> ceph fs ls
> >>>>>>>>>>>> No filesystems enabled
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Now pool 66 seems to only pool we have and it has been created months ago. Then I guess there is something hidden out there.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Is there any way to find and delete it?
> >>>>>>>>>>> 
> >>>>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old 
> >>>>>>>>>>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> >>>>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>>>>>>>>>> store.. one that doesn't normally come up because the impact is this 
> >>>>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> >>>>>>>>>>> your aged cluster.
> >>>>>>>>>>> 
> >>>>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a bit 
> >>>>>>>>>>> dicey. 
> >>>>>>>>>>> 
> >>>>>>>>>>> 1. stop all mons
> >>>>>>>>>>> 2. make a backup of all mons
> >>>>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>>>>>>>>>> key=creating key on one of the mons
> >>>>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
> >>>>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> >>>>>>>>>>> 66 to the created_pgs member.
> >>>>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>>>>>>>>>> 7. inject the updated structure into all of the mons
> >>>>>>>>>>> 8. start all mons
> >>>>>>>>>>> 
> >>>>>>>>>>> 4-6 will probably be an iterative process... let's start by getting the 
> >>>>>>>>>>> structure out and dumping the current value?  
> >>>>>>>>>>> 
> >>>>>>>>>>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> >>>>>>>>>>> encode/decode methods.
> >>>>>>>>>>> 
> >>>>>>>>>>> sage
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Oh... I think this is the problem:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>>>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>>>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>>>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>>>>>>>>>>>> one of the last remaining places where the OSDs need to look at a full 
> >>>>>>>>>>>>> history of map changes between then and the current map epoch.  In this 
> >>>>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>>>>>>>>>>>> later.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>>>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>>>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> s
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> It seems nothing has changed.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your ceph
> >>>>>>>>>>>>>>> config file. "osd crush update on start = false". This should remove
> >>>>>>>>>>>>>>> and update that is generated when osd starts.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>>>>>>>>>>>> dispatch
> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>>>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>>>>>>>>>>>> "root=default"]} v 0) v1
> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>>>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
> >>>>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>>>>>>>>>>>>>> enabled
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>>>>>>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hi Sage,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is going to be resolved.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> The problem started when 7 server crashed same time and they came back after ~5 minutes.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since mon services are enabled they should be started nearly at the same time. I dont know if this makes any difference but some of the guys on IRC told it is required that they start in order not at the same time. Otherwise it could break things badly.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have so many STUCK request and our cluster can not heal itself.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization and never ends. CPU and RAM usage are OK.
> >>>>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the eliminate older PG's and leaving cluster with good ones.
> >>>>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our problem.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even saving %50 of our data will be make us very happy right now.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it explains the whole story very nicely.
> >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is our ceph.conf:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is pg dump:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is iostat & perf top:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This strace output of ceph-osd:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is OSD log (default debug):
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is leader MON log (default debug):
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
> >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch somehow??
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 66.f8 created 60196
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading and decoding 0x19da8400
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> waiting_peering {}
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt +create_info
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, waiting
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 bytes
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - loading and decoding 0x19da8880
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 bytes
> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - loading and decoding 0x19da9180
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you should
> >>>>>>>>>>>>>>>> only use it in very specific circumstances when directed by a developer.
> >>>>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not getting to
> >>>>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at the 'ceph
> >>>>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
> >>>>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset noup?
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I haven't seen
> >>>>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
> >>>>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once they were all
> >>>>>>>>>>>>>>>> marked up...
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> sage
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>>>>>>>>>>>>>> Is that be usefull here? There is such a less information about it.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> şunları yazdı (2 Eki 2018 22:11):
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I read somewhere that that will do the proper sizing.
> >>>>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB for each and decide a more calculated number    after cluster settles.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in loop. For that reason I started cluster with noup flag and waited OSDs to reach the uptodate epoch number. After that I unset noup. But I did not pay attention to manager logs. Let me check it, thank you!
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very standard installation and no tweaks or tunings. All we ask for the stability versus speed from the begining. And here we are :/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> I can see some issues from the osd log file. You have an extremely low
> >>>>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>>>>>>>>>>>>>> mimic.
> >>>>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>>>>>>>>>>>> <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hello Darius,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks for reply!
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks and wait forever since OSD is unresponsive.
> >>>>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are certain that OSD responds again after its disk utilization stops.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> So we have a small test like that:
> >>>>>>>>>>>>>>>> * Stop all OSDs (168 of them)
> >>>>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
> >>>>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 mins to settle.
> >>>>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs start high I/O again. And it takes 1 hour to finish.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all OSDs. At first there was no I/O. After 10 minutes we unset noup. All of 168 OSD started to make high I/O. And we thought that if we wait long enough it will finish & OSDs will be responsive again. After 24hours they did not because I/O did not finish or even slowed down.
> >>>>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the same PG from two different OSDs. One was 4.2GB and the other is 500KB! So we decided to export all OSDs for backup. Then we will delete strange sized ones and start the cluster all over. Maybe then we could solve the stucked or unfound PGs as you advise.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Any thought would be greatly appreciated.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daznis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>>>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are failing to
> >>>>>>>>>>>>>>>> start and concentrate on bringing online those that have missing
> >>>>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>>>>>>>>>>>>>> them back on.
> >>>>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>>>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>>>>>>>>>>>> <morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>>>>>>>>>>>>> (as for Redhat).
> >>>>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>>>>>>>>>>>>> tools to check health and also repair. But there are limited
> >>>>>>>>>>>>>>>> documentation.
> >>>>>>>>>>>>>>>> Anyone who has experince with it?
> >>>>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
> >>>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildirim-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, 1 Eki 2018 Pzt, 22:55
> >>>>>>>>>>>>>>>> tarihinde şunu yazdı:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> >>>>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our problem!
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Here is the latest ceph status:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> This the full log the same OSD.
> >>>>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Here is the strace of the same OSD process:
> >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Any help to figure out what is going on would be great.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>>> Goktug Yildirim
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> >>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >> 
> 
> 

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-10-05 14:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4C31C6EC-573D-42AD-BC6C-F7F103DBEC16@gmail.com>
     [not found] ` <4C31C6EC-573D-42AD-BC6C-F7F103DBEC16-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-02 14:13   ` Mimic offline problem by morphin
     [not found]     ` <CAE-AtHrjekCcLmO7MUB86XG7VH+bBt6yJH+ZRe7Wnh3tKB8JWw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-02 16:16       ` Darius Kasparavičius
     [not found]         ` <CANrNMwUaER4e97schGN+KnB_6aNF6SYpPEFvLD=qcpp1eeWEJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-02 18:02           ` Goktug Yildirim
     [not found]             ` <44698713-3F6C-451D-B67B-2053D1B30557-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-02 19:51               ` Paul Emmerich
     [not found]                 ` <CAD9yTbHM6A0j4ghVzd6y=taoUaY-tZ+pfmhB=ZkVr+v-P-C7VQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-02 20:01                   ` Goktug Yildirim
2018-10-02 19:53               ` Darius Kasparavičius
     [not found]                 ` <CANrNMwW-RiVxtUnjTrZpiRXi6k06LxaeqRCqbXiuwwjTStWBhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-02 20:11                   ` Goktug Yildirim
     [not found]                     ` <F96A1781-7767-40A4-B814-C59C0CDD5E68-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-02 21:53                       ` Göktuğ Yıldırım
     [not found]                         ` <E516BA34-165C-4542-B53D-B022B913874A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03  3:14                           ` Sage Weil
     [not found]                             ` <alpine.DEB.2.11.1810030311230.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-03 12:16                               ` Goktug Yildirim
     [not found]                                 ` <683EC5EE-8112-4D9F-B4A8-DBCDE46A7AD8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03 12:27                                   ` Darius Kasparavičius
     [not found]                                     ` <CANrNMwVcE=Zyp6YbAk-1MB0zYYUt9_tqZSxkTBF5QVShjGqPZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-03 13:43                                       ` Goktug Yildirim
     [not found]                                         ` <F58B4646-D64C-469E-BA46-4ED6B5C96F8A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03 14:46                                           ` Sage Weil
     [not found]                                             ` <alpine.DEB.2.11.1810031443440.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-03 15:22                                               ` Goktug Yildirim
     [not found]                                                 ` <8C3EC489-68A9-4622-A283-CCAB9AE68D93-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03 15:52                                                   ` Sage Weil
     [not found]                                                     ` <alpine.DEB.2.11.1810031543160.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-03 16:09                                                       ` Goktug Yildirim
     [not found]                                                         ` <C10AD823-76D4-4E2B-BE2E-BF434373C81B-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03 16:45                                                           ` Sage Weil
     [not found]                                                             ` <7C03273B-DC38-40CC-99F4-AFECE672D80A@usishi.com>
     [not found]                                                               ` <7C03273B-DC38-40CC-99F4-AFECE672D80A-DtrUWsE0is3QT0dZR+AlfA@public.gmane.org>
2018-10-03 18:26                                                                 ` Sage Weil
     [not found]                                                                   ` <alpine.DEB.2.11.1810031826290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-03 19:01                                                                     ` Goktug YILDIRIM
     [not found]                                                                       ` <CACz15iKdK47N0mBJrRyDeEx7BH-uRKmvZ+OJ6Cc6u6-qxTeA6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-03 19:13                                                                         ` Sage Weil
     [not found]                                                                           ` <f9cf0773-cce5-4989-8f78-de3a01080450-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>
2018-10-03 19:34                                                                             ` Göktuğ Yıldırım
     [not found]                                                                               ` <5FBAA21A-8B48-4B22-9FE8-0E0766C363D9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-03 19:37                                                                                 ` Sage Weil
     [not found]                                                                                   ` <alpine.DEB.2.11.1810031936110.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-04  1:35                                                                                     ` Goktug Yildirim
     [not found]                                                                                       ` <53664F54-EB52-4E9B-A46C-26C8E27CFE24-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-04  4:17                                                                                         ` Sage Weil
     [not found]                                                                                           ` <alpine.DEB.2.11.1810040414290.31802-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-10-04  9:26                                                                                             ` Goktug Yildirim
     [not found]                                                                                               ` <71B2ECD8-EECF-48D5-844C-7C80EDD24028-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-10-05 14:45                                                                                                 ` Sage Weil
2018-10-03 19:43                                                                                 ` Göktuğ Yıldırım

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.