All of lore.kernel.org
 help / color / mirror / Atom feed
* Long peering - throttle at FileStore::queue_transactions
@ 2016-01-04 23:32 Guang Yang
  2016-01-05  1:17 ` Samuel Just
  2016-01-05  3:21 ` Sage Weil
  0 siblings, 2 replies; 5+ messages in thread
From: Guang Yang @ 2016-01-04 23:32 UTC (permalink / raw)
  To: ceph-devel, ceph-users; +Cc: sjust

Hi Cephers,
Happy New Year! I got question regards to the long PG peering..

Over the last several days I have been looking into the *long peering*
problem when we start a OSD / OSD host, what I observed was that the
two peering working threads were throttled (stuck) when trying to
queue new transactions (writing pg log), thus the peering process are
dramatically slow down.

The first question came to me was, what were the transactions in the
queue? The major ones, as I saw, included:

- The osd_map and incremental osd_map, this happens if the OSD had
been down for a while (in a large cluster), or when the cluster got
upgrade, which made the osd_map epoch the down OSD had, was far behind
the latest osd_map epoch. During the OSD booting, it would need to
persist all those osd_maps and generate lots of filestore transactions
(linear with the epoch gap).
> As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?

- There are lots of deletion transactions, and as the PG booting, it
needs to merge the PG log from its peers, and for the deletion PG
entry, it would need to queue the deletion transaction immediately.
> Could we delay the queue of the transactions until all PGs on the host are peered?

Thanks,
Guang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Long peering - throttle at FileStore::queue_transactions
  2016-01-04 23:32 Long peering - throttle at FileStore::queue_transactions Guang Yang
@ 2016-01-05  1:17 ` Samuel Just
  2016-01-05  3:21 ` Sage Weil
  1 sibling, 0 replies; 5+ messages in thread
From: Samuel Just @ 2016-01-05  1:17 UTC (permalink / raw)
  To: Guang Yang; +Cc: ceph-devel, ceph-users

We need every OSDMap persisted before persisting later ones because we
rely on there being no holes for a bunch of reasons.

The deletion transactions are more interesting.  It's not part of the
boot process, these are deletions resulting from merging in a log from
a peer which logically removed an object.  It's more noticeable on
boot because all PGs will see these operations at once (if there are a
bunch of deletes happening).  We need to process these transactions
before we can serve reads (before we activate) currently since we use
the on disk state (modulo the objectcontext locks) as authoritative.
That transaction iirc also contains the updated PGLog.  We can't avoid
writing down the PGLog prior to activation, but we *can* delay the
deletes (and even batch/throttle them) if we do some work:
1) During activation, we need to maintain a set of to-be-deleted
objects.  For each of these objects, we need to populate the
objectcontext cache with an exists=false objectcontext so that we
don't erroneously read the deleted data.  Each of the entries in the
to-be-deleted object set would have a reference to the context to keep
it alive until the deletion is processed.
2) Any write operation which references one of these objects needs to
be preceded by a delete if one has not yet been queued (and the
to-be-deleted set updated appropriately).  The tricky part is that the
primary and replicas may have different objects in this set...  The
replica would have to insert deletes ahead of any subop (or the ec
equilivant) it gets from the primary.  For that to work, it needs to
have something like the obc cache.  I have a wip-replica-read branch
which refactors object locking to allow the replica to maintain locks
(to avoid replica-reads conflicting with writes).  That machinery
would probably be the right place to put it.
3) We need to make sure that if a node restarts anywhere in this
process that it correctly repopulates the set of to be deleted
entries.  We might consider a deleted-to version in the log?  Not sure
about this one since it would be different on the replica and the
primary.

Anyway, it's actually more complicated than you'd expect and will
require more design (and probably depends on wip-replica-read
landing).
-Sam

On Mon, Jan 4, 2016 at 3:32 PM, Guang Yang <guangyy@gmail.com> wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
>
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
>
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
>
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
>> As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?
>
> - There are lots of deletion transactions, and as the PG booting, it
> needs to merge the PG log from its peers, and for the deletion PG
> entry, it would need to queue the deletion transaction immediately.
>> Could we delay the queue of the transactions until all PGs on the host are peered?
>
> Thanks,
> Guang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Long peering - throttle at FileStore::queue_transactions
  2016-01-04 23:32 Long peering - throttle at FileStore::queue_transactions Guang Yang
  2016-01-05  1:17 ` Samuel Just
@ 2016-01-05  3:21 ` Sage Weil
  2016-01-05 22:33   ` Guang Yang
  1 sibling, 1 reply; 5+ messages in thread
From: Sage Weil @ 2016-01-05  3:21 UTC (permalink / raw)
  To: Guang Yang; +Cc: ceph-devel, ceph-users, sjust

On Mon, 4 Jan 2016, Guang Yang wrote:
> Hi Cephers,
> Happy New Year! I got question regards to the long PG peering..
> 
> Over the last several days I have been looking into the *long peering*
> problem when we start a OSD / OSD host, what I observed was that the
> two peering working threads were throttled (stuck) when trying to
> queue new transactions (writing pg log), thus the peering process are
> dramatically slow down.
> 
> The first question came to me was, what were the transactions in the
> queue? The major ones, as I saw, included:
> 
> - The osd_map and incremental osd_map, this happens if the OSD had
> been down for a while (in a large cluster), or when the cluster got
> upgrade, which made the osd_map epoch the down OSD had, was far behind
> the latest osd_map epoch. During the OSD booting, it would need to
> persist all those osd_maps and generate lots of filestore transactions
> (linear with the epoch gap).
> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?

This part should happen before the OSD sends the MOSDBoot message, before 
anyone knows it exists.  There is a tunable threshold that controls how 
recent the map has to be before the OSD tries to boot.  If you're 
seeing this in the real world, be probably just need to adjust that value 
way down to something small(er).

sage


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Long peering - throttle at FileStore::queue_transactions
  2016-01-05  3:21 ` Sage Weil
@ 2016-01-05 22:33   ` Guang Yang
  2016-01-06 14:09     ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Guang Yang @ 2016-01-05 22:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users, Samuel Just

On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 4 Jan 2016, Guang Yang wrote:
>> Hi Cephers,
>> Happy New Year! I got question regards to the long PG peering..
>>
>> Over the last several days I have been looking into the *long peering*
>> problem when we start a OSD / OSD host, what I observed was that the
>> two peering working threads were throttled (stuck) when trying to
>> queue new transactions (writing pg log), thus the peering process are
>> dramatically slow down.
>>
>> The first question came to me was, what were the transactions in the
>> queue? The major ones, as I saw, included:
>>
>> - The osd_map and incremental osd_map, this happens if the OSD had
>> been down for a while (in a large cluster), or when the cluster got
>> upgrade, which made the osd_map epoch the down OSD had, was far behind
>> the latest osd_map epoch. During the OSD booting, it would need to
>> persist all those osd_maps and generate lots of filestore transactions
>> (linear with the epoch gap).
>> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?
>
> This part should happen before the OSD sends the MOSDBoot message, before
> anyone knows it exists.  There is a tunable threshold that controls how
> recent the map has to be before the OSD tries to boot.  If you're
> seeing this in the real world, be probably just need to adjust that value
> way down to something small(er).
It would queue the transactions and then sends out the MOSDBoot, thus
there is still a chance that it could have contention with the peering
OPs (especially on large clusters where there are lots of activities
which generates many osdmap epoch). Any chance we can change the
*queue_transactions* to "apply_transactions*, thus we block there
waiting for the persistent of the osdmap. At least we may be able to
do that during OSD booting? The concern is, if the OSD is active, the
apply_transaction would take longer with holding the osd_lock..
I don't find such tuning, could you elaborate? Thanks!
>
> sage
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Long peering - throttle at FileStore::queue_transactions
  2016-01-05 22:33   ` Guang Yang
@ 2016-01-06 14:09     ` Sage Weil
  0 siblings, 0 replies; 5+ messages in thread
From: Sage Weil @ 2016-01-06 14:09 UTC (permalink / raw)
  To: Guang Yang; +Cc: ceph-devel, ceph-users, Samuel Just

On Tue, 5 Jan 2016, Guang Yang wrote:
> On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 4 Jan 2016, Guang Yang wrote:
> >> Hi Cephers,
> >> Happy New Year! I got question regards to the long PG peering..
> >>
> >> Over the last several days I have been looking into the *long peering*
> >> problem when we start a OSD / OSD host, what I observed was that the
> >> two peering working threads were throttled (stuck) when trying to
> >> queue new transactions (writing pg log), thus the peering process are
> >> dramatically slow down.
> >>
> >> The first question came to me was, what were the transactions in the
> >> queue? The major ones, as I saw, included:
> >>
> >> - The osd_map and incremental osd_map, this happens if the OSD had
> >> been down for a while (in a large cluster), or when the cluster got
> >> upgrade, which made the osd_map epoch the down OSD had, was far behind
> >> the latest osd_map epoch. During the OSD booting, it would need to
> >> persist all those osd_maps and generate lots of filestore transactions
> >> (linear with the epoch gap).
> >> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?
> >
> > This part should happen before the OSD sends the MOSDBoot message, before
> > anyone knows it exists.  There is a tunable threshold that controls how
> > recent the map has to be before the OSD tries to boot.  If you're
> > seeing this in the real world, be probably just need to adjust that value
> > way down to something small(er).
> It would queue the transactions and then sends out the MOSDBoot, thus
> there is still a chance that it could have contention with the peering
> OPs (especially on large clusters where there are lots of activities
> which generates many osdmap epoch). Any chance we can change the
> *queue_transactions* to "apply_transactions*, thus we block there
> waiting for the persistent of the osdmap. At least we may be able to
> do that during OSD booting? The concern is, if the OSD is active, the
> apply_transaction would take longer with holding the osd_lock..
> I don't find such tuning, could you elaborate? Thanks!

Yeah, that sounds like a good idea (and clearly safe).  Probably a simpler 
fix is to just call store->flush() or similar before sending the boot 
message?

sage


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-06 14:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-04 23:32 Long peering - throttle at FileStore::queue_transactions Guang Yang
2016-01-05  1:17 ` Samuel Just
2016-01-05  3:21 ` Sage Weil
2016-01-05 22:33   ` Guang Yang
2016-01-06 14:09     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.