All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: BLueStore Deadlock
       [not found] <6AA21C22F0A5DA478922644AD2EC308C373B92BC@SHSMSX101.ccr.corp.intel.com>
@ 2016-07-28 15:45 ` Somnath Roy
  2016-07-28 22:21 ` Somnath Roy
  1 sibling, 0 replies; 7+ messages in thread
From: Somnath Roy @ 2016-07-28 15:45 UTC (permalink / raw)
  To: Ma, Jianpeng; +Cc: ceph-devel

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish
do_write                                                 lock(osr->qlock)                                               lock(osr->qlock)
do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
       [not found] <6AA21C22F0A5DA478922644AD2EC308C373B92BC@SHSMSX101.ccr.corp.intel.com>
  2016-07-28 15:45 ` BLueStore Deadlock Somnath Roy
@ 2016-07-28 22:21 ` Somnath Roy
  2016-07-29  1:24   ` Ma, Jianpeng
  1 sibling, 1 reply; 7+ messages in thread
From: Somnath Roy @ 2016-07-28 22:21 UTC (permalink / raw)
  To: Ma, Jianpeng; +Cc: ceph-devel

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll)
aio_complete_thread --> waiting for RLock(coll)
No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff
diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc
index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
  2016-07-28 22:21 ` Somnath Roy
@ 2016-07-29  1:24   ` Ma, Jianpeng
  2016-07-29  1:36     ` Ma, Jianpeng
  0 siblings, 1 reply; 7+ messages in thread
From: Ma, Jianpeng @ 2016-07-29  1:24 UTC (permalink / raw)
  To: Somnath Roy; +Cc: ceph-devel

Hi Roy:
    W/ your patch, there still deadlock.
By the way, if we change the BufferSpace::_add_buffer, if w/ cache flush push it into cache front and if w/o cache flag only put it  at back.
I think use this way we can remove finish_write.
How about it?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, July 29, 2016 6:22 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll) aio_complete_thread --> waiting for RLock(coll) No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
  2016-07-29  1:24   ` Ma, Jianpeng
@ 2016-07-29  1:36     ` Ma, Jianpeng
  2016-08-03  1:35       ` Ma, Jianpeng
  0 siblings, 1 reply; 7+ messages in thread
From: Ma, Jianpeng @ 2016-07-29  1:36 UTC (permalink / raw)
  To: Ma, Jianpeng, Somnath Roy; +Cc: ceph-devel

Roy: my question is why is there STATE_WRITTING?  For read we don't by pass cache, so why need STATE_WRITTING?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Friday, July 29, 2016 9:24 AM
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Hi Roy:
    W/ your patch, there still deadlock.
By the way, if we change the BufferSpace::_add_buffer, if w/ cache flush push it into cache front and if w/o cache flag only put it  at back.
I think use this way we can remove finish_write.
How about it?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, July 29, 2016 6:22 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll) aio_complete_thread --> waiting for RLock(coll) No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
  2016-07-29  1:36     ` Ma, Jianpeng
@ 2016-08-03  1:35       ` Ma, Jianpeng
  2016-08-05  2:14         ` Somnath Roy
  0 siblings, 1 reply; 7+ messages in thread
From: Ma, Jianpeng @ 2016-08-03  1:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Somnath Roy

Hi Sage:
    Why is there STATE_WRITTING? 
  In my option: for normal read, it do onode->flush() , so no need wait io complete. The special case is one transaction which has two write and later write need read.
  For write, if it is wal-write, when do finish_write, the data don't locate into disk. This different with non-wal-write

Thanks! 
  

-----Original Message-----
From: Ma, Jianpeng 
Sent: Friday, July 29, 2016 9:36 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Roy: my question is why is there STATE_WRITTING?  For read we don't by pass cache, so why need STATE_WRITTING?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Friday, July 29, 2016 9:24 AM
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Hi Roy:
    W/ your patch, there still deadlock.
By the way, if we change the BufferSpace::_add_buffer, if w/ cache flush push it into cache front and if w/o cache flag only put it  at back.
I think use this way we can remove finish_write.
How about it?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, July 29, 2016 6:22 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll) aio_complete_thread --> waiting for RLock(coll) No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
  2016-08-03  1:35       ` Ma, Jianpeng
@ 2016-08-05  2:14         ` Somnath Roy
  2016-08-05  7:36           ` Ma, Jianpeng
  0 siblings, 1 reply; 7+ messages in thread
From: Somnath Roy @ 2016-08-05  2:14 UTC (permalink / raw)
  To: Ma, Jianpeng, Sage Weil; +Cc: ceph-devel, Mark Nelson (mnelson@redhat.com)

Sorry for the delay ,
Jianpeng,
Could you please try the following pull request in your setup and see if the deadlock is still happening ?

https://github.com/ceph/ceph/pull/10578

Mark,
I think it has the probable fix for your crash in onode->flush() as well , could you please try this out ?

Thanks & Regards
Somnath

-----Original Message-----
From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com] 
Sent: Tuesday, August 02, 2016 6:35 PM
To: Sage Weil
Cc: ceph-devel; Somnath Roy
Subject: RE: BLueStore Deadlock

Hi Sage:
    Why is there STATE_WRITTING? 
  In my option: for normal read, it do onode->flush() , so no need wait io complete. The special case is one transaction which has two write and later write need read.
  For write, if it is wal-write, when do finish_write, the data don't locate into disk. This different with non-wal-write

Thanks! 
  

-----Original Message-----
From: Ma, Jianpeng 
Sent: Friday, July 29, 2016 9:36 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Roy: my question is why is there STATE_WRITTING?  For read we don't by pass cache, so why need STATE_WRITTING?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Friday, July 29, 2016 9:24 AM
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Hi Roy:
    W/ your patch, there still deadlock.
By the way, if we change the BufferSpace::_add_buffer, if w/ cache flush push it into cache front and if w/o cache flag only put it  at back.
I think use this way we can remove finish_write.
How about it?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, July 29, 2016 6:22 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll) aio_complete_thread --> waiting for RLock(coll) No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: BLueStore Deadlock
  2016-08-05  2:14         ` Somnath Roy
@ 2016-08-05  7:36           ` Ma, Jianpeng
  0 siblings, 0 replies; 7+ messages in thread
From: Ma, Jianpeng @ 2016-08-05  7:36 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil; +Cc: ceph-devel, Mark Nelson (mnelson@redhat.com)

Hi Somnath: 
     This PR passed the case. No deadlock occur.

Thanks!
Jianpeng

-----Original Message-----
From: Somnath Roy [mailto:Somnath.Roy@sandisk.com] 
Sent: Friday, August 5, 2016 10:14 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Sage Weil <sweil@redhat.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>; Mark Nelson (mnelson@redhat.com) <mnelson@redhat.com>
Subject: RE: BLueStore Deadlock

Sorry for the delay ,
Jianpeng,
Could you please try the following pull request in your setup and see if the deadlock is still happening ?

https://github.com/ceph/ceph/pull/10578

Mark,
I think it has the probable fix for your crash in onode->flush() as well , could you please try this out ?

Thanks & Regards
Somnath

-----Original Message-----
From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com] 
Sent: Tuesday, August 02, 2016 6:35 PM
To: Sage Weil
Cc: ceph-devel; Somnath Roy
Subject: RE: BLueStore Deadlock

Hi Sage:
    Why is there STATE_WRITTING? 
  In my option: for normal read, it do onode->flush() , so no need wait io complete. The special case is one transaction which has two write and later write need read.
  For write, if it is wal-write, when do finish_write, the data don't locate into disk. This different with non-wal-write

Thanks! 
  

-----Original Message-----
From: Ma, Jianpeng 
Sent: Friday, July 29, 2016 9:36 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Roy: my question is why is there STATE_WRITTING?  For read we don't by pass cache, so why need STATE_WRITTING?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Friday, July 29, 2016 9:24 AM
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Hi Roy:
    W/ your patch, there still deadlock.
By the way, if we change the BufferSpace::_add_buffer, if w/ cache flush push it into cache front and if w/o cache flag only put it  at back.
I think use this way we can remove finish_write.
How about it?

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Friday, July 29, 2016 6:22 AM
To: Ma, Jianpeng <jianpeng.ma@intel.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: BLueStore Deadlock

Jianpeng,
I thought through this and it seems there could be one possible deadlock scenario.

tp_osd_tp --> waiting on onode->flush() for previous txc to finish. Holding Wlock(coll) aio_complete_thread --> waiting for RLock(coll) No other thread will be blocked here.

We do add previous txc in the flush_txns list during _txc_write_nodes() and before aio_complete_thread calling _txc_state_proc(). So, if within this time frame if we have IO on the same collection , it will be waiting on unfinished txcs.
The solution to this could be the following..

root@emsnode5:~/ceph-master/src# git diff diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index e8548b1..575a234 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -4606,6 +4606,11 @@ void BlueStore::_txc_state_proc(TransContext *txc)
         (txc->first_collection)->lock.get_read();
       }
       for (auto& o : txc->onodes) {
+        {
+          std::lock_guard<std::mutex> l(o->flush_lock);
+          o->flush_txns.insert(txc);
+        }
+
         for (auto& p : o->blob_map.blob_map) {
            p.bc.finish_write(txc->seq);
        }
@@ -4733,8 +4738,8 @@ void BlueStore::_txc_write_nodes(TransContext *txc, KeyValueDB::Transaction t)
     dout(20) << "  onode " << (*p)->oid << " is " << bl.length() << dendl;
     t->set(PREFIX_OBJ, (*p)->key, bl);

-    std::lock_guard<std::mutex> l((*p)->flush_lock);
-    (*p)->flush_txns.insert(txc);
+    /*std::lock_guard<std::mutex> l((*p)->flush_lock);
+    (*p)->flush_txns.insert(txc);*/
   }


I am not able to reproduce this in my setup , so, if you can do the above changes in your env and see if you are still hitting the issue, would be helpful.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:45 AM
To: 'Ma, Jianpeng'
Cc: ceph-devel
Subject: RE: BLueStore Deadlock

Hi Jianpeng,
Are you trying with latest master and still hitting the issue (seems so but confirming) ?
The following scenario should not be creating deadlock because of the following reason.

Onode->flush() is waiting on flush_lock() and from _txc_finish() it is releasing that before taking osr->qlock(). Am I missing anything ?

I got a deadlock in this path in one of my earlier changes in the following pull request (described in detail there) and it is fixed and merged.

https://github.com/ceph/ceph/pull/10220

If my theory is right , we are hitting deadlock because of some other reason may be. It seems you are doing WAL write , could you please describe the steps to reproduce ?

Thanks & Regards
Somnath

From: Ma, Jianpeng [mailto:jianpeng.ma@intel.com]
Sent: Thursday, July 28, 2016 1:46 AM
To: Somnath Roy
Cc: ceph-devel; Ma, Jianpeng
Subject: BLueStore Deadlock

Hi Roy:
     When do seqwrite w/ rbd+librbd, I met deadlock for bluestore. It can reproduce 100%.(based on 98602ae6c67637dbadddd549bd9a0035e5a2717)
By add message and this found this bug caused by bf70bcb6c54e4d6404533bc91781a5ef77d62033.
Consider this case:

tp_osd_tp                                              aio_complete_thread                                 kv_sync_thread
Rwlock(coll)                                            txc_finish_io                                                     _txc_finish do_write                                                 lock(osr->qlock)                                               lock(osr->qlock) do_read                                                  RLock(coll)                                                                     need osr->qlock to continue
onode->flush()                                            need coll
onode->readlock to continue
   need previous txc complete

But current I don't how to fix this.

Thanks!
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-05  7:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <6AA21C22F0A5DA478922644AD2EC308C373B92BC@SHSMSX101.ccr.corp.intel.com>
2016-07-28 15:45 ` BLueStore Deadlock Somnath Roy
2016-07-28 22:21 ` Somnath Roy
2016-07-29  1:24   ` Ma, Jianpeng
2016-07-29  1:36     ` Ma, Jianpeng
2016-08-03  1:35       ` Ma, Jianpeng
2016-08-05  2:14         ` Somnath Roy
2016-08-05  7:36           ` Ma, Jianpeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.