All of lore.kernel.org
 help / color / mirror / Atom feed
* mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?)
@ 2017-01-30 14:30 John Spray
  2017-01-30 14:40 ` Sage Weil
  2017-01-30 21:07 ` Gregory Farnum
  0 siblings, 2 replies; 3+ messages in thread
From: John Spray @ 2017-01-30 14:30 UTC (permalink / raw)
  To: Ceph Development, Gregory Farnum, Zheng Yan, Sage Weil; +Cc: Burkhard Linke

This case (see forwarded) is showing that our current rejoin code is
handling situations with many capabilities quite badly -- I think we
should try and improve this soon.

One thought I have is to just throttle the number of open_inos that we
do, so that we allow the cache to get populated with the already hit
dirfrags before trying to load more backtraces, created a ticket for
that here: http://tracker.ceph.com/issues/18730 (should be pretty
simple and doable for luminous).  That would help in cases where many
of the affected inodes were in the same directory (which I expect is
all real workloads).

There are probably other bigger changes we could make for this case,
such as using the path passed in cap_reconnect_t to be smarter, or
even adding a metadata pool structure that would provide super-fast
lookup of backtraces for the N most recently touched ones -- not
saying we necessarily want to go that far!

John


---------- Forwarded message ----------
From: Burkhard Linke <Burkhard.Linke@computational.bio.uni-giessen.de>
Date: Mon, Jan 30, 2017 at 7:09 AM
Subject: Re: [ceph-users] MDS flapping: how to increase MDS timeouts?
To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>


Hi,



On 01/26/2017 03:34 PM, John Spray wrote:
>
> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
> <Burkhard.Linke@computational.bio.uni-giessen.de> wrote:
>>
>> HI,
>>
>>
>> we are running two MDS servers in active/standby-replay setup. Recently we
>> had to disconnect active MDS server, and failover to standby works as
>> expected.
>>
>>
>> The filesystem currently contains over 5 million files, so reading all the
>> metadata information from the data pool took too long, since the information
>> was not available on the OSD page caches. The MDS was timed out by the mons,
>> and a failover switch to the former active MDS (which was available as
>> standby again) happened. This MDS in turn had to read the metadata, again
>> running into a timeout, failover, etc. I resolved the situation by disabling
>> one of the MDS, which kept the mons from failing the now solely available
>> MDS.
>
> The MDS does not re-read every inode on startup -- rather, it replays
> its journal (the overall number of files in your system does not
> factor into this).
>
>> So given a large filesystem, how do I prevent failover flapping between MDS
>> instances that are in the rejoin state and reading the inode information?
>
> The monitor's decision to fail an unresponsive MDS is based on the MDS
> not sending a beacon to the mon -- there is no limit on how long an
> MDS is allowed to stay in a given state (such as rejoin).
>
> So there are two things to investigate here:
>   * Why is the MDS taking so long to start?
>   * Why is the MDS failing to send beacons to the monitor while it is
> in whatever process that is taking it so long?


Under normal operation our system has about 4.5-4.9 million active
caps. Most of them (~ 4 millions) are associated to the machine
running the nightly backups.

I assume that during the rejoin phase, the MDS is renewing the
clients' caps. We see massive amount of small I/O on the data pool (up
to 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to
access the inode information to renew a cap? This would explain the
high number of IOPS and why the rejoin phase can take up to 20
minutes.

Not sure about the second question, since the IOPS should not prevent
beacons from reaching the monitors. We will have to move the MDS
servers to different racks during this week. I'll try to bump up the
debug level before.


Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?)
  2017-01-30 14:30 mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?) John Spray
@ 2017-01-30 14:40 ` Sage Weil
  2017-01-30 21:07 ` Gregory Farnum
  1 sibling, 0 replies; 3+ messages in thread
From: Sage Weil @ 2017-01-30 14:40 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development, Gregory Farnum, Zheng Yan, Burkhard Linke

On Mon, 30 Jan 2017, John Spray wrote:
> This case (see forwarded) is showing that our current rejoin code is
> handling situations with many capabilities quite badly -- I think we
> should try and improve this soon.
> 
> One thought I have is to just throttle the number of open_inos that we
> do, so that we allow the cache to get populated with the already hit
> dirfrags before trying to load more backtraces, created a ticket for
> that here: http://tracker.ceph.com/issues/18730 (should be pretty
> simple and doable for luminous).  That would help in cases where many
> of the affected inodes were in the same directory (which I expect is
> all real workloads).

This sounds like the way to go.  At the very least it will throttle.

> There are probably other bigger changes we could make for this case,
> such as using the path passed in cap_reconnect_t to be smarter, or
> even adding a metadata pool structure that would provide super-fast
> lookup of backtraces for the N most recently touched ones -- not
> saying we necessarily want to go that far!

If we had a hint on the parent directory, we could track in-flight 
open_inos and limit it per parent directory, since that is where the 
duplicated/wasted work is generally coming from...

sage



> 
> John
> 
> 
> ---------- Forwarded message ----------
> From: Burkhard Linke <Burkhard.Linke@computational.bio.uni-giessen.de>
> Date: Mon, Jan 30, 2017 at 7:09 AM
> Subject: Re: [ceph-users] MDS flapping: how to increase MDS timeouts?
> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> 
> 
> Hi,
> 
> 
> 
> On 01/26/2017 03:34 PM, John Spray wrote:
> >
> > On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
> > <Burkhard.Linke@computational.bio.uni-giessen.de> wrote:
> >>
> >> HI,
> >>
> >>
> >> we are running two MDS servers in active/standby-replay setup. Recently we
> >> had to disconnect active MDS server, and failover to standby works as
> >> expected.
> >>
> >>
> >> The filesystem currently contains over 5 million files, so reading all the
> >> metadata information from the data pool took too long, since the information
> >> was not available on the OSD page caches. The MDS was timed out by the mons,
> >> and a failover switch to the former active MDS (which was available as
> >> standby again) happened. This MDS in turn had to read the metadata, again
> >> running into a timeout, failover, etc. I resolved the situation by disabling
> >> one of the MDS, which kept the mons from failing the now solely available
> >> MDS.
> >
> > The MDS does not re-read every inode on startup -- rather, it replays
> > its journal (the overall number of files in your system does not
> > factor into this).
> >
> >> So given a large filesystem, how do I prevent failover flapping between MDS
> >> instances that are in the rejoin state and reading the inode information?
> >
> > The monitor's decision to fail an unresponsive MDS is based on the MDS
> > not sending a beacon to the mon -- there is no limit on how long an
> > MDS is allowed to stay in a given state (such as rejoin).
> >
> > So there are two things to investigate here:
> >   * Why is the MDS taking so long to start?
> >   * Why is the MDS failing to send beacons to the monitor while it is
> > in whatever process that is taking it so long?
> 
> 
> Under normal operation our system has about 4.5-4.9 million active
> caps. Most of them (~ 4 millions) are associated to the machine
> running the nightly backups.
> 
> I assume that during the rejoin phase, the MDS is renewing the
> clients' caps. We see massive amount of small I/O on the data pool (up
> to 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to
> access the inode information to renew a cap? This would explain the
> high number of IOPS and why the rejoin phase can take up to 20
> minutes.
> 
> Not sure about the second question, since the IOPS should not prevent
> beacons from reaching the monitors. We will have to move the MDS
> servers to different racks during this week. I'll try to bump up the
> debug level before.
> 
> 
> Regards,
> Burkhard
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?)
  2017-01-30 14:30 mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?) John Spray
  2017-01-30 14:40 ` Sage Weil
@ 2017-01-30 21:07 ` Gregory Farnum
  1 sibling, 0 replies; 3+ messages in thread
From: Gregory Farnum @ 2017-01-30 21:07 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development, Zheng Yan, Sage Weil, Burkhard Linke

On Mon, Jan 30, 2017 at 6:30 AM, John Spray <jspray@redhat.com> wrote:
> This case (see forwarded) is showing that our current rejoin code is
> handling situations with many capabilities quite badly -- I think we
> should try and improve this soon.
>
> One thought I have is to just throttle the number of open_inos that we
> do, so that we allow the cache to get populated with the already hit
> dirfrags before trying to load more backtraces, created a ticket for
> that here: http://tracker.ceph.com/issues/18730 (should be pretty
> simple and doable for luminous).  That would help in cases where many
> of the affected inodes were in the same directory (which I expect is
> all real workloads).

My concern here is that if we're in a case where the caps are so
scattered, just a straight throttle like that might slow us down even
more as we read in directories for a cap, then throw them out. :/

>
> There are probably other bigger changes we could make for this case,
> such as using the path passed in cap_reconnect_t to be smarter, or
> even adding a metadata pool structure that would provide super-fast
> lookup of backtraces for the N most recently touched ones -- not
> saying we necessarily want to go that far!

I don't think we want to be doing durable storage for something like
that any more than we do. I'm a little surprised this isn't handled by
journaled open inodes -- are we simply dropping some of them after a
long enough period without activity, or are we only journaling inode
numbers?
The possibility of using the paths to try and aggregate into
directories makes a lot more sense to me. We ought to be able to set
up lists of caps as waiters on a directory being read and then attempt
to process them as stuff comes in, or else put them into a proper
lookup?
-Greg

>
> John
>
>
> ---------- Forwarded message ----------
> From: Burkhard Linke <Burkhard.Linke@computational.bio.uni-giessen.de>
> Date: Mon, Jan 30, 2017 at 7:09 AM
> Subject: Re: [ceph-users] MDS flapping: how to increase MDS timeouts?
> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
>
>
> Hi,
>
>
>
> On 01/26/2017 03:34 PM, John Spray wrote:
>>
>> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
>> <Burkhard.Linke@computational.bio.uni-giessen.de> wrote:
>>>
>>> HI,
>>>
>>>
>>> we are running two MDS servers in active/standby-replay setup. Recently we
>>> had to disconnect active MDS server, and failover to standby works as
>>> expected.
>>>
>>>
>>> The filesystem currently contains over 5 million files, so reading all the
>>> metadata information from the data pool took too long, since the information
>>> was not available on the OSD page caches. The MDS was timed out by the mons,
>>> and a failover switch to the former active MDS (which was available as
>>> standby again) happened. This MDS in turn had to read the metadata, again
>>> running into a timeout, failover, etc. I resolved the situation by disabling
>>> one of the MDS, which kept the mons from failing the now solely available
>>> MDS.
>>
>> The MDS does not re-read every inode on startup -- rather, it replays
>> its journal (the overall number of files in your system does not
>> factor into this).
>>
>>> So given a large filesystem, how do I prevent failover flapping between MDS
>>> instances that are in the rejoin state and reading the inode information?
>>
>> The monitor's decision to fail an unresponsive MDS is based on the MDS
>> not sending a beacon to the mon -- there is no limit on how long an
>> MDS is allowed to stay in a given state (such as rejoin).
>>
>> So there are two things to investigate here:
>>   * Why is the MDS taking so long to start?
>>   * Why is the MDS failing to send beacons to the monitor while it is
>> in whatever process that is taking it so long?
>
>
> Under normal operation our system has about 4.5-4.9 million active
> caps. Most of them (~ 4 millions) are associated to the machine
> running the nightly backups.
>
> I assume that during the rejoin phase, the MDS is renewing the
> clients' caps. We see massive amount of small I/O on the data pool (up
> to 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to
> access the inode information to renew a cap? This would explain the
> high number of IOPS and why the rejoin phase can take up to 20
> minutes.
>
> Not sure about the second question, since the IOPS should not prevent
> beacons from reaching the monitors. We will have to move the MDS
> servers to different racks during this week. I'll try to bump up the
> debug level before.
>
>
> Regards,
> Burkhard
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-01-30 21:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30 14:30 mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?) John Spray
2017-01-30 14:40 ` Sage Weil
2017-01-30 21:07 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.