All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore locking
@ 2016-08-11 16:03 Sage Weil
  2016-08-11 16:51 ` Somnath Roy
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Sage Weil @ 2016-08-11 16:03 UTC (permalink / raw)
  To: Somnath.Roy; +Cc: ceph-devel

After our conversation this mornning I went through the locking a bit more 
and I think we're in okay shape.  To summarize:

These 'forward lookup' structures are protected by coll->lock:

 Bnode::blob_map -> Blob                                    coll->lock
 Onode::blob_map -> Blob                                    coll->lock

These ones are protected by cache->lock:

 Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
 Blob::bc -> Buffer                                             cache->lock

The BnodeSet is a bit different because it is depopulated when the last 
onode ref goes away.  But it has its own lock:

 Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock

Anyway, the point of this is that the cache trim() can do everything it 
needs with just cache->lock.  That means that during an update, we 
normally have coll->lock to protect the structures we're touching, and if 
we are adding onodes to OnodeSpace or BufferSpace we additionally take 
cache->lock for the appropriate cache fragment.

We were getting tripped up from the blob_map iteration in _txc_state_proc 
because we were holding no locks at all (or, previously, a collection lock 
that may or may not be the right one).  Igor's PR fixes this by 
making Blob refcounted and keeping a list of these.  The finish_write() 
function takes cache->lock as needed.  Also, it's worth pointing out that 
the blobs that we're touching will all exist under an Onode that is in the 
onodes list, and it turns out that the trim() is already doing the right 
thing and not trimming Onodes that still have any refs.

Which leaves me a bit confused as to how we originally were 
crashing, because we were taking the first_collection lock.  My guess is 
that first_collection was not the right collection and a racing update was 
updating the Onode.  The refcounting alone sorts this out.  My other fix 
would have also resolved it by taking the correct collection's 
lock, I think.  Unless there is another locking problem I'm not seeing.. 
but I think what is in master now has it covered.

Igor, Somnath, does the current strategy make sense?

sage


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: bluestore locking
  2016-08-11 16:03 bluestore locking Sage Weil
@ 2016-08-11 16:51 ` Somnath Roy
  2016-08-11 16:58   ` Sage Weil
  2016-08-12 14:23 ` Igor Fedotov
  2016-08-12 14:25 ` Igor Fedotov
  2 siblings, 1 reply; 9+ messages in thread
From: Somnath Roy @ 2016-08-11 16:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

<<inline with [Somnath]

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Thursday, August 11, 2016 9:04 AM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: bluestore locking

After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:

These 'forward lookup' structures are protected by coll->lock:

 Bnode::blob_map -> Blob                                    coll->lock
 Onode::blob_map -> Blob                                    coll->lock

These ones are protected by cache->lock:

 Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
 Blob::bc -> Buffer                                             cache->lock

The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:

 Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock

Anyway, the point of this is that the cache trim() can do everything it needs with just cache->lock.  That means that during an update, we normally have coll->lock to protect the structures we're touching, and if we are adding onodes to OnodeSpace or BufferSpace we additionally take
cache->lock for the appropriate cache fragment.

We were getting tripped up from the blob_map iteration in _txc_state_proc because we were holding no locks at all (or, previously, a collection lock that may or may not be the right one).  Igor's PR fixes this by making Blob refcounted and keeping a list of these.  The finish_write() function takes cache->lock as needed.  Also, it's worth pointing out that the blobs that we're touching will all exist under an Onode that is in the onodes list, and it turns out that the trim() is already doing the right thing and not trimming Onodes that still have any refs.

[Somnath] The crash we are not seeing in the current code as it is protected by first_collection lock in _txc_state_*. But, that patch opened up a deadlock like this.

Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished txc
Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock

There is a small window as txcs are added in the o->flush_txns list after _txc_add release coll->Wlock()

BTW, the crash is happening as iterator got invalidated and not because blob is deleted (which eventually may hit that). Making blob ref counted will not be solving that the blob is been erased from the blob_map set. We need to protect that with a lock.

Which leaves me a bit confused as to how we originally were crashing, because we were taking the first_collection lock.  My guess is that first_collection was not the right collection and a racing update was updating the Onode.  The refcounting alone sorts this out.  My other fix would have also resolved it by taking the correct collection's lock, I think.  Unless there is another locking problem I'm not seeing..
but I think what is in master now has it covered.

[Somnath] As I mentioned above , after taking first_collection lock we were not crashing anymore , it was the deadlock we were hitting. So, the fix would be to protect blob_map with a lock. If it is cool->lock we are hitting deadlock , if it is cache->lock() we need to see. The PR I sent is introducing a blob_map lock and fixing this.

But, Mark was hitting a onode corruption and initially I thought it is because trim() is *not invoked* within coll->lock() from _osr_reap_done(). The fix in the PR seems to be working for him and if so, trim() is the one is deleting that and we don't have explanation how it happened. Is there any place in the code other than trim() can delete onode ? I couldn't find any..

Igor, Somnath, does the current strategy make sense?

sage

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: bluestore locking
  2016-08-11 16:51 ` Somnath Roy
@ 2016-08-11 16:58   ` Sage Weil
  2016-08-11 17:00     ` Mark Nelson
  2016-08-11 17:22     ` Somnath Roy
  0 siblings, 2 replies; 9+ messages in thread
From: Sage Weil @ 2016-08-11 16:58 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Thu, 11 Aug 2016, Somnath Roy wrote:
> <<inline with [Somnath]
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 9:04 AM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: bluestore locking
> 
> After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
> 
> These 'forward lookup' structures are protected by coll->lock:
> 
>  Bnode::blob_map -> Blob                                    coll->lock
>  Onode::blob_map -> Blob                                    coll->lock
> 
> These ones are protected by cache->lock:
> 
>  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
>  Blob::bc -> Buffer                                             cache->lock
> 
> The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
> 
>  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
> 
> Anyway, the point of this is that the cache trim() can do everything it 
> needs with just cache->lock.  That means that during an update, we 
> normally have coll->lock to protect the structures we're touching, and 
> if we are adding onodes to OnodeSpace or BufferSpace we additionally 
> take cache->lock for the appropriate cache fragment.
> 
> We were getting tripped up from the blob_map iteration in 
> _txc_state_proc because we were holding no locks at all (or, previously, 
> a collection lock that may or may not be the right one).  Igor's PR 
> fixes this by making Blob refcounted and keeping a list of these.  The 
> finish_write() function takes cache->lock as needed.  Also, it's worth 
> pointing out that the blobs that we're touching will all exist under an 
> Onode that is in the onodes list, and it turns out that the trim() is 
> already doing the right thing and not trimming Onodes that still have 
> any refs.
> 
> [Somnath] The crash we are not seeing in the current code as it is 
> protected by first_collection lock in _txc_state_*. But, that patch 
> opened up a deadlock like this.
> 
> Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished txc
> Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
> 
> There is a small window as txcs are added in the o->flush_txns list 
> after _txc_add release coll->Wlock()
> 
> BTW, the crash is happening as iterator got invalidated and not because 
> blob is deleted (which eventually may hit that). Making blob ref counted 
> will not be solving that the blob is been erased from the blob_map set. 
> We need to protect that with a lock.

Well, the blob refcounting fixes the deadlock certainly, bc we no longer 
need collection lock in that path.  So the question remaining is if 
there is a locking issue still...
 
> Which leaves me a bit confused as to how we originally were crashing, 
> because we were taking the first_collection lock.  My guess is that 
> first_collection was not the right collection and a racing update was 
> updating the Onode.  The refcounting alone sorts this out.  My other fix 
> would have also resolved it by taking the correct collection's lock, I 
> think.  Unless there is another locking problem I'm not seeing.. but I 
> think what is in master now has it covered.

And I think there isn't.  We no longer use an iterator (since we have the 
list of blobs to fix up), and the blob finish_write stuff is protected by 
cache->lock.  So I think this shoudl resolve it.  But.. whatever workload 
you were using that triggered this before, perhaps you can run it 
again?

> [Somnath] As I mentioned above , after taking first_collection lock we 
> were not crashing anymore , it was the deadlock we were hitting. So, the 
> fix would be to protect blob_map with a lock. If it is cool->lock we are 
> hitting deadlock , if it is cache->lock() we need to see. The PR I sent 
> is introducing a blob_map lock and fixing this.
> 
> But, Mark was hitting a onode corruption and initially I thought it is 
> because trim() is *not invoked* within coll->lock() from 
> _osr_reap_done(). The fix in the PR seems to be working for him and if 
> so, trim() is the one is deleting that and we don't have explanation how 
> it happened. Is there any place in the code other than trim() can delete 
> onode ? I couldn't find any..

What is the workload here?  And can you retest with master now?  Trim 
shouldn't need any lock at all (it takes cache->lock on its own).  The 
open question is whether there are other paths that are touching 
cache->lock protected things that aren't taking cache->lock.  I don't 
think so, but verifying with the problematic workloads will give us more 
confidence...

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bluestore locking
  2016-08-11 16:58   ` Sage Weil
@ 2016-08-11 17:00     ` Mark Nelson
  2016-08-11 17:22     ` Somnath Roy
  1 sibling, 0 replies; 9+ messages in thread
From: Mark Nelson @ 2016-08-11 17:00 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy; +Cc: ceph-devel

On 08/11/2016 11:58 AM, Sage Weil wrote:
> On Thu, 11 Aug 2016, Somnath Roy wrote:
>> <<inline with [Somnath]
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Thursday, August 11, 2016 9:04 AM
>> To: Somnath Roy
>> Cc: ceph-devel@vger.kernel.org
>> Subject: bluestore locking
>>
>> After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
>>
>> These 'forward lookup' structures are protected by coll->lock:
>>
>>  Bnode::blob_map -> Blob                                    coll->lock
>>  Onode::blob_map -> Blob                                    coll->lock
>>
>> These ones are protected by cache->lock:
>>
>>  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
>>  Blob::bc -> Buffer                                             cache->lock
>>
>> The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
>>
>>  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
>>
>> Anyway, the point of this is that the cache trim() can do everything it
>> needs with just cache->lock.  That means that during an update, we
>> normally have coll->lock to protect the structures we're touching, and
>> if we are adding onodes to OnodeSpace or BufferSpace we additionally
>> take cache->lock for the appropriate cache fragment.
>>
>> We were getting tripped up from the blob_map iteration in
>> _txc_state_proc because we were holding no locks at all (or, previously,
>> a collection lock that may or may not be the right one).  Igor's PR
>> fixes this by making Blob refcounted and keeping a list of these.  The
>> finish_write() function takes cache->lock as needed.  Also, it's worth
>> pointing out that the blobs that we're touching will all exist under an
>> Onode that is in the onodes list, and it turns out that the trim() is
>> already doing the right thing and not trimming Onodes that still have
>> any refs.
>>
>> [Somnath] The crash we are not seeing in the current code as it is
>> protected by first_collection lock in _txc_state_*. But, that patch
>> opened up a deadlock like this.
>>
>> Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished txc
>> Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
>>
>> There is a small window as txcs are added in the o->flush_txns list
>> after _txc_add release coll->Wlock()
>>
>> BTW, the crash is happening as iterator got invalidated and not because
>> blob is deleted (which eventually may hit that). Making blob ref counted
>> will not be solving that the blob is been erased from the blob_map set.
>> We need to protect that with a lock.
>
> Well, the blob refcounting fixes the deadlock certainly, bc we no longer
> need collection lock in that path.  So the question remaining is if
> there is a locking issue still...
>
>> Which leaves me a bit confused as to how we originally were crashing,
>> because we were taking the first_collection lock.  My guess is that
>> first_collection was not the right collection and a racing update was
>> updating the Onode.  The refcounting alone sorts this out.  My other fix
>> would have also resolved it by taking the correct collection's lock, I
>> think.  Unless there is another locking problem I'm not seeing.. but I
>> think what is in master now has it covered.
>
> And I think there isn't.  We no longer use an iterator (since we have the
> list of blobs to fix up), and the blob finish_write stuff is protected by
> cache->lock.  So I think this shoudl resolve it.  But.. whatever workload
> you were using that triggered this before, perhaps you can run it
> again?

The workload that I've hit it most commonly with is 4k random writes. 
I've got master compiled, I can run through tests if we are ready? 
Might take a little while.

>
>> [Somnath] As I mentioned above , after taking first_collection lock we
>> were not crashing anymore , it was the deadlock we were hitting. So, the
>> fix would be to protect blob_map with a lock. If it is cool->lock we are
>> hitting deadlock , if it is cache->lock() we need to see. The PR I sent
>> is introducing a blob_map lock and fixing this.
>>
>> But, Mark was hitting a onode corruption and initially I thought it is
>> because trim() is *not invoked* within coll->lock() from
>> _osr_reap_done(). The fix in the PR seems to be working for him and if
>> so, trim() is the one is deleting that and we don't have explanation how
>> it happened. Is there any place in the code other than trim() can delete
>> onode ? I couldn't find any..
>
> What is the workload here?  And can you retest with master now?  Trim
> shouldn't need any lock at all (it takes cache->lock on its own).  The
> open question is whether there are other paths that are touching
> cache->lock protected things that aren't taking cache->lock.  I don't
> think so, but verifying with the problematic workloads will give us more
> confidence...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: bluestore locking
  2016-08-11 16:58   ` Sage Weil
  2016-08-11 17:00     ` Mark Nelson
@ 2016-08-11 17:22     ` Somnath Roy
  2016-08-11 19:25       ` Sage Weil
  1 sibling, 1 reply; 9+ messages in thread
From: Somnath Roy @ 2016-08-11 17:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Sage,
Probably, I am missing something.
Blob ref count will protect blob from being deleted. But, it will be still erased from blob_map, right ?
If any other thread iterating over blob_map will get invalid iterator and can crash ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Thursday, August 11, 2016 9:58 AM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: bluestore locking

On Thu, 11 Aug 2016, Somnath Roy wrote:
> <<inline with [Somnath]
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 9:04 AM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: bluestore locking
>
> After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
>
> These 'forward lookup' structures are protected by coll->lock:
>
>  Bnode::blob_map -> Blob                                    coll->lock
>  Onode::blob_map -> Blob                                    coll->lock
>
> These ones are protected by cache->lock:
>
>  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
>  Blob::bc -> Buffer                                             cache->lock
>
> The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
>
>  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
>
> Anyway, the point of this is that the cache trim() can do everything
> it needs with just cache->lock.  That means that during an update, we
> normally have coll->lock to protect the structures we're touching, and
> if we are adding onodes to OnodeSpace or BufferSpace we additionally
> take cache->lock for the appropriate cache fragment.
>
> We were getting tripped up from the blob_map iteration in
> _txc_state_proc because we were holding no locks at all (or,
> previously, a collection lock that may or may not be the right one).
> Igor's PR fixes this by making Blob refcounted and keeping a list of
> these.  The
> finish_write() function takes cache->lock as needed.  Also, it's worth
> pointing out that the blobs that we're touching will all exist under
> an Onode that is in the onodes list, and it turns out that the trim()
> is already doing the right thing and not trimming Onodes that still
> have any refs.
>
> [Somnath] The crash we are not seeing in the current code as it is
> protected by first_collection lock in _txc_state_*. But, that patch
> opened up a deadlock like this.
>
> Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished
> txc Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
>
> There is a small window as txcs are added in the o->flush_txns list
> after _txc_add release coll->Wlock()
>
> BTW, the crash is happening as iterator got invalidated and not
> because blob is deleted (which eventually may hit that). Making blob
> ref counted will not be solving that the blob is been erased from the blob_map set.
> We need to protect that with a lock.

Well, the blob refcounting fixes the deadlock certainly, bc we no longer need collection lock in that path.  So the question remaining is if there is a locking issue still...

> Which leaves me a bit confused as to how we originally were crashing,
> because we were taking the first_collection lock.  My guess is that
> first_collection was not the right collection and a racing update was
> updating the Onode.  The refcounting alone sorts this out.  My other
> fix would have also resolved it by taking the correct collection's
> lock, I think.  Unless there is another locking problem I'm not
> seeing.. but I think what is in master now has it covered.

And I think there isn't.  We no longer use an iterator (since we have the list of blobs to fix up), and the blob finish_write stuff is protected by
cache->lock.  So I think this shoudl resolve it.  But.. whatever
cache->workload
you were using that triggered this before, perhaps you can run it again?

> [Somnath] As I mentioned above , after taking first_collection lock we
> were not crashing anymore , it was the deadlock we were hitting. So,
> the fix would be to protect blob_map with a lock. If it is cool->lock
> we are hitting deadlock , if it is cache->lock() we need to see. The
> PR I sent is introducing a blob_map lock and fixing this.
>
> But, Mark was hitting a onode corruption and initially I thought it is
> because trim() is *not invoked* within coll->lock() from
> _osr_reap_done(). The fix in the PR seems to be working for him and if
> so, trim() is the one is deleting that and we don't have explanation
> how it happened. Is there any place in the code other than trim() can
> delete onode ? I couldn't find any..

What is the workload here?  And can you retest with master now?  Trim shouldn't need any lock at all (it takes cache->lock on its own).  The open question is whether there are other paths that are touching
cache->lock protected things that aren't taking cache->lock.  I don't
think so, but verifying with the problematic workloads will give us more confidence...

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: bluestore locking
  2016-08-11 17:22     ` Somnath Roy
@ 2016-08-11 19:25       ` Sage Weil
  2016-08-11 20:20         ` Somnath Roy
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2016-08-11 19:25 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

On Thu, 11 Aug 2016, Somnath Roy wrote:
> Sage,
> Probably, I am missing something.
> Blob ref count will protect blob from being deleted. But, it will be 
> still erased from blob_map, right ?

The only two ways it would get removed from blob_map would be

 - Another thread is holding the collection lock and doing an update.  
This used to be a problem but isn't any more since we don't iterate over 
blob_map.. we have a list of BlobRefs.

 - We are tearing down the Onode because it is trimmed.  This would only 
happen if we trim, which won't happen because any blobs in our txc blob 
list will belong to onodes (or bnodes indirectly) referenced by the txc 
onodes list.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 9:58 AM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore locking
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > <<inline with [Somnath]
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 9:04 AM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: bluestore locking
> >
> > After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
> >
> > These 'forward lookup' structures are protected by coll->lock:
> >
> >  Bnode::blob_map -> Blob                                    coll->lock
> >  Onode::blob_map -> Blob                                    coll->lock
> >
> > These ones are protected by cache->lock:
> >
> >  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
> >  Blob::bc -> Buffer                                             cache->lock
> >
> > The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
> >
> >  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
> >
> > Anyway, the point of this is that the cache trim() can do everything
> > it needs with just cache->lock.  That means that during an update, we
> > normally have coll->lock to protect the structures we're touching, and
> > if we are adding onodes to OnodeSpace or BufferSpace we additionally
> > take cache->lock for the appropriate cache fragment.
> >
> > We were getting tripped up from the blob_map iteration in
> > _txc_state_proc because we were holding no locks at all (or,
> > previously, a collection lock that may or may not be the right one).
> > Igor's PR fixes this by making Blob refcounted and keeping a list of
> > these.  The
> > finish_write() function takes cache->lock as needed.  Also, it's worth
> > pointing out that the blobs that we're touching will all exist under
> > an Onode that is in the onodes list, and it turns out that the trim()
> > is already doing the right thing and not trimming Onodes that still
> > have any refs.
> >
> > [Somnath] The crash we are not seeing in the current code as it is
> > protected by first_collection lock in _txc_state_*. But, that patch
> > opened up a deadlock like this.
> >
> > Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished
> > txc Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
> >
> > There is a small window as txcs are added in the o->flush_txns list
> > after _txc_add release coll->Wlock()
> >
> > BTW, the crash is happening as iterator got invalidated and not
> > because blob is deleted (which eventually may hit that). Making blob
> > ref counted will not be solving that the blob is been erased from the blob_map set.
> > We need to protect that with a lock.
> 
> Well, the blob refcounting fixes the deadlock certainly, bc we no longer need collection lock in that path.  So the question remaining is if there is a locking issue still...
> 
> > Which leaves me a bit confused as to how we originally were crashing,
> > because we were taking the first_collection lock.  My guess is that
> > first_collection was not the right collection and a racing update was
> > updating the Onode.  The refcounting alone sorts this out.  My other
> > fix would have also resolved it by taking the correct collection's
> > lock, I think.  Unless there is another locking problem I'm not
> > seeing.. but I think what is in master now has it covered.
> 
> And I think there isn't.  We no longer use an iterator (since we have the list of blobs to fix up), and the blob finish_write stuff is protected by
> cache->lock.  So I think this shoudl resolve it.  But.. whatever
> cache->workload
> you were using that triggered this before, perhaps you can run it again?
> 
> > [Somnath] As I mentioned above , after taking first_collection lock we
> > were not crashing anymore , it was the deadlock we were hitting. So,
> > the fix would be to protect blob_map with a lock. If it is cool->lock
> > we are hitting deadlock , if it is cache->lock() we need to see. The
> > PR I sent is introducing a blob_map lock and fixing this.
> >
> > But, Mark was hitting a onode corruption and initially I thought it is
> > because trim() is *not invoked* within coll->lock() from
> > _osr_reap_done(). The fix in the PR seems to be working for him and if
> > so, trim() is the one is deleting that and we don't have explanation
> > how it happened. Is there any place in the code other than trim() can
> > delete onode ? I couldn't find any..
> 
> What is the workload here?  And can you retest with master now?  Trim shouldn't need any lock at all (it takes cache->lock on its own).  The open question is whether there are other paths that are touching
> cache->lock protected things that aren't taking cache->lock.  I don't
> think so, but verifying with the problematic workloads will give us more confidence...
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: bluestore locking
  2016-08-11 19:25       ` Sage Weil
@ 2016-08-11 20:20         ` Somnath Roy
  0 siblings, 0 replies; 9+ messages in thread
From: Somnath Roy @ 2016-08-11 20:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Got it , I didn't know we are not iterating over blob_map anymore in the finish io path..Saw the latest master and it seems fine..

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Thursday, August 11, 2016 12:26 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: bluestore locking

On Thu, 11 Aug 2016, Somnath Roy wrote:
> Sage,
> Probably, I am missing something.
> Blob ref count will protect blob from being deleted. But, it will be 
> still erased from blob_map, right ?

The only two ways it would get removed from blob_map would be

 - Another thread is holding the collection lock and doing an update.  
This used to be a problem but isn't any more since we don't iterate over blob_map.. we have a list of BlobRefs.

 - We are tearing down the Onode because it is trimmed.  This would only happen if we trim, which won't happen because any blobs in our txc blob list will belong to onodes (or bnodes indirectly) referenced by the txc onodes list.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 9:58 AM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore locking
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > <<inline with [Somnath]
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 9:04 AM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: bluestore locking
> >
> > After our conversation this mornning I went through the locking a bit more and I think we're in okay shape.  To summarize:
> >
> > These 'forward lookup' structures are protected by coll->lock:
> >
> >  Bnode::blob_map -> Blob                                    coll->lock
> >  Onode::blob_map -> Blob                                    coll->lock
> >
> > These ones are protected by cache->lock:
> >
> >  Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
> >  Blob::bc -> Buffer                                             cache->lock
> >
> > The BnodeSet is a bit different because it is depopulated when the last onode ref goes away.  But it has its own lock:
> >
> >  Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
> >
> > Anyway, the point of this is that the cache trim() can do everything 
> > it needs with just cache->lock.  That means that during an update, 
> > we normally have coll->lock to protect the structures we're 
> > touching, and if we are adding onodes to OnodeSpace or BufferSpace 
> > we additionally take cache->lock for the appropriate cache fragment.
> >
> > We were getting tripped up from the blob_map iteration in 
> > _txc_state_proc because we were holding no locks at all (or, 
> > previously, a collection lock that may or may not be the right one).
> > Igor's PR fixes this by making Blob refcounted and keeping a list of 
> > these.  The
> > finish_write() function takes cache->lock as needed.  Also, it's 
> > worth pointing out that the blobs that we're touching will all exist 
> > under an Onode that is in the onodes list, and it turns out that the 
> > trim() is already doing the right thing and not trimming Onodes that 
> > still have any refs.
> >
> > [Somnath] The crash we are not seeing in the current code as it is 
> > protected by first_collection lock in _txc_state_*. But, that patch 
> > opened up a deadlock like this.
> >
> > Thread 1 -> coll->Wlock -> onode->flush() -> waiting for unfinished 
> > txc Thread 2 -> _txc_finish_io -> _txc_state_* -> coll->Rlock
> >
> > There is a small window as txcs are added in the o->flush_txns list 
> > after _txc_add release coll->Wlock()
> >
> > BTW, the crash is happening as iterator got invalidated and not 
> > because blob is deleted (which eventually may hit that). Making blob 
> > ref counted will not be solving that the blob is been erased from the blob_map set.
> > We need to protect that with a lock.
> 
> Well, the blob refcounting fixes the deadlock certainly, bc we no longer need collection lock in that path.  So the question remaining is if there is a locking issue still...
> 
> > Which leaves me a bit confused as to how we originally were 
> > crashing, because we were taking the first_collection lock.  My 
> > guess is that first_collection was not the right collection and a 
> > racing update was updating the Onode.  The refcounting alone sorts 
> > this out.  My other fix would have also resolved it by taking the 
> > correct collection's lock, I think.  Unless there is another locking 
> > problem I'm not seeing.. but I think what is in master now has it covered.
> 
> And I think there isn't.  We no longer use an iterator (since we have 
> the list of blobs to fix up), and the blob finish_write stuff is 
> protected by
> cache->lock.  So I think this shoudl resolve it.  But.. whatever 
> cache->workload
> you were using that triggered this before, perhaps you can run it again?
> 
> > [Somnath] As I mentioned above , after taking first_collection lock 
> > we were not crashing anymore , it was the deadlock we were hitting. 
> > So, the fix would be to protect blob_map with a lock. If it is 
> > cool->lock we are hitting deadlock , if it is cache->lock() we need 
> > to see. The PR I sent is introducing a blob_map lock and fixing this.
> >
> > But, Mark was hitting a onode corruption and initially I thought it 
> > is because trim() is *not invoked* within coll->lock() from 
> > _osr_reap_done(). The fix in the PR seems to be working for him and 
> > if so, trim() is the one is deleting that and we don't have 
> > explanation how it happened. Is there any place in the code other 
> > than trim() can delete onode ? I couldn't find any..
> 
> What is the workload here?  And can you retest with master now?  Trim 
> shouldn't need any lock at all (it takes cache->lock on its own).  The 
> open question is whether there are other paths that are touching
> cache->lock protected things that aren't taking cache->lock.  I don't
> think so, but verifying with the problematic workloads will give us more confidence...
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bluestore locking
  2016-08-11 16:03 bluestore locking Sage Weil
  2016-08-11 16:51 ` Somnath Roy
@ 2016-08-12 14:23 ` Igor Fedotov
  2016-08-12 14:25 ` Igor Fedotov
  2 siblings, 0 replies; 9+ messages in thread
From: Igor Fedotov @ 2016-08-12 14:23 UTC (permalink / raw)
  To: Sage Weil, Somnath.Roy; +Cc: ceph-devel

In general the approach looks good.

But I have some concerns about cache->lock use. Currently it's usage is 
far from transparent and IMHO is potentially error prone.

The major issue with it is that there is no single entity that provides 
a protected access. We use it from ONodeSpace, BufferSpace, BlueStore, 
Cache class descendants...

Some of their methods are protected with this lock, some ( public ones!) 
aren't. Sometimes there are comments that corresponding method has to be 
invoked under such lock, sometimes not.

Even a variable name for the lock is too general and it's hard to 
determine all the locations where it's used.

IMO this makes future maintenance and code evolution a difficult and 
highly error prone task.

Probably we should consider some refactoring on that to make all the 
staff more manageable.


Another topic irrelevant to the above.

Don't we have a potential unprotected gap for _remove_collection?

It acquires coll_lock and then enumerates onode_map for object presence 
( that's protected by the cache lock) and then proceeds with the 
collection removal from coll_map.

What will happen if someone inserts an object to the collection after an 
enumeration but prior to collection removal, e.g. by calling touch. Is 
this possible or I missed something?


Thanks,

Igor


On 11.08.2016 19:03, Sage Weil wrote:
> After our conversation this mornning I went through the locking a bit more
> and I think we're in okay shape.  To summarize:
>
> These 'forward lookup' structures are protected by coll->lock:
>
>   Bnode::blob_map -> Blob                                    coll->lock
>   Onode::blob_map -> Blob                                    coll->lock
>
> These ones are protected by cache->lock:
>
>   Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
>   Blob::bc -> Buffer                                             cache->lock
>
> The BnodeSet is a bit different because it is depopulated when the last
> onode ref goes away.  But it has its own lock:
>
>   Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
>
> Anyway, the point of this is that the cache trim() can do everything it
> needs with just cache->lock.  That means that during an update, we
> normally have coll->lock to protect the structures we're touching, and if
> we are adding onodes to OnodeSpace or BufferSpace we additionally take
> cache->lock for the appropriate cache fragment.
>
> We were getting tripped up from the blob_map iteration in _txc_state_proc
> because we were holding no locks at all (or, previously, a collection lock
> that may or may not be the right one).  Igor's PR fixes this by
> making Blob refcounted and keeping a list of these.  The finish_write()
> function takes cache->lock as needed.  Also, it's worth pointing out that
> the blobs that we're touching will all exist under an Onode that is in the
> onodes list, and it turns out that the trim() is already doing the right
> thing and not trimming Onodes that still have any refs.
>
> Which leaves me a bit confused as to how we originally were
> crashing, because we were taking the first_collection lock.  My guess is
> that first_collection was not the right collection and a racing update was
> updating the Onode.  The refcounting alone sorts this out.  My other fix
> would have also resolved it by taking the correct collection's
> lock, I think.  Unless there is another locking problem I'm not seeing..
> but I think what is in master now has it covered.
>
> Igor, Somnath, does the current strategy make sense?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bluestore locking
  2016-08-11 16:03 bluestore locking Sage Weil
  2016-08-11 16:51 ` Somnath Roy
  2016-08-12 14:23 ` Igor Fedotov
@ 2016-08-12 14:25 ` Igor Fedotov
  2 siblings, 0 replies; 9+ messages in thread
From: Igor Fedotov @ 2016-08-12 14:25 UTC (permalink / raw)
  To: Sage Weil, Somnath.Roy; +Cc: ceph-devel

In general the approach looks good.

But I have some concerns about cache->lock use. Currently it's usage is 
far from transparent and IMHO is potentially error prone.

The major issue with it is that there is no single entity that provides 
a protected access. We use it from ONodeSpace, BufferSpace, BlueStore, 
Cache class descendants...

Some of their methods are protected with this lock, some ( public ones!) 
aren't. Sometimes there are comments that corresponding method has to be 
invoked under such lock, sometimes not.

Even a variable name for the lock is too general and it's hard to 
determine all the locations where it's used.

IMO this makes future maintenance and code evolution a difficult and 
highly error prone task.

Probably we should consider some refactoring on that to make all the 
staff more manageable.


Another topic irrelevant to the above.

Don't we have a potential unprotected gap for _remove_collection?

It acquires coll_lock and then enumerates onode_map for object presence 
( that's protected by the cache lock) and then proceeds with the 
collection removal from coll_map.

What will happen if someone inserts an object to the collection after an 
enumeration but prior to collection removal, e.g. by calling touch. Is 
this possible or I missed something?


Thanks,

Igor



On 11.08.2016 19:03, Sage Weil wrote:
> After our conversation this mornning I went through the locking a bit more
> and I think we're in okay shape.  To summarize:
>
> These 'forward lookup' structures are protected by coll->lock:
>
>   Bnode::blob_map -> Blob                                    coll->lock
>   Onode::blob_map -> Blob                                    coll->lock
>
> These ones are protected by cache->lock:
>
>   Collection::OnodeSpace::onode_map -> Onode (unordered_map)     cache->lock
>   Blob::bc -> Buffer                                             cache->lock
>
> The BnodeSet is a bit different because it is depopulated when the last
> onode ref goes away.  But it has its own lock:
>
>   Collection::BnodeSet::uset -> Bnode (intrustive set)       BnodeSet::lock
>
> Anyway, the point of this is that the cache trim() can do everything it
> needs with just cache->lock.  That means that during an update, we
> normally have coll->lock to protect the structures we're touching, and if
> we are adding onodes to OnodeSpace or BufferSpace we additionally take
> cache->lock for the appropriate cache fragment.
>
> We were getting tripped up from the blob_map iteration in _txc_state_proc
> because we were holding no locks at all (or, previously, a collection lock
> that may or may not be the right one).  Igor's PR fixes this by
> making Blob refcounted and keeping a list of these.  The finish_write()
> function takes cache->lock as needed.  Also, it's worth pointing out that
> the blobs that we're touching will all exist under an Onode that is in the
> onodes list, and it turns out that the trim() is already doing the right
> thing and not trimming Onodes that still have any refs.
>
> Which leaves me a bit confused as to how we originally were
> crashing, because we were taking the first_collection lock.  My guess is
> that first_collection was not the right collection and a racing update was
> updating the Onode.  The refcounting alone sorts this out.  My other fix
> would have also resolved it by taking the correct collection's
> lock, I think.  Unless there is another locking problem I'm not seeing..
> but I think what is in master now has it covered.
>
> Igor, Somnath, does the current strategy make sense?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-12 14:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 16:03 bluestore locking Sage Weil
2016-08-11 16:51 ` Somnath Roy
2016-08-11 16:58   ` Sage Weil
2016-08-11 17:00     ` Mark Nelson
2016-08-11 17:22     ` Somnath Roy
2016-08-11 19:25       ` Sage Weil
2016-08-11 20:20         ` Somnath Roy
2016-08-12 14:23 ` Igor Fedotov
2016-08-12 14:25 ` Igor Fedotov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.