All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Various fixes for mds
@ 2012-11-19  2:43 Yan, Zheng
  2012-11-19  2:43 ` [PATCH 01/16] mds: don't expire log segment before it's fully flushed Yan, Zheng
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage

Hi,

This series of patches are fixes for various of problems I encountered
when running 2 MDS with thrash_exports enabled. The first 3 patches
are general fixes for MDS, the rest patches are fixes for bugs that
only happen in multiple MDS setup.

I tested these patches using 2 MDS with thrash_exports enabled and
sinlge kernel client. They survived fsstress test for hours, the MDS
handled about 500000 requests.

These patches are also in:
  git://github.com/ukernel/ceph.git master

Regards
Yan, Zheng


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/16] mds: don't expire log segment before it's fully flushed
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 02/16] mds: fix anchor table update Yan, Zheng
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Expiring log segment before it's fully flushed may cause various
issues during log replay.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDLog.cc | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc
index cac5615..b02c181 100644
--- a/src/mds/MDLog.cc
+++ b/src/mds/MDLog.cc
@@ -330,6 +330,11 @@ void MDLog::trim(int m)
     assert(ls);
     p++;
     
+    if (ls->end > journaler->get_write_safe_pos()) {
+      dout(5) << "trim segment " << ls->offset << ", not fully flushed yet, safe "
+	      << journaler->get_write_safe_pos() << " < end " << ls->end << dendl;
+      break;
+    }
     if (expiring_segments.count(ls)) {
       dout(5) << "trim already expiring segment " << ls->offset << ", " << ls->num_events << " events" << dendl;
     } else if (expired_segments.count(ls)) {
@@ -412,9 +417,6 @@ void MDLog::_expired(LogSegment *ls)
 
   if (!capped && ls == get_current_segment()) {
     dout(5) << "_expired not expiring " << ls->offset << ", last one and !capped" << dendl;
-  } else if (ls->end > journaler->get_write_safe_pos()) {
-    dout(5) << "_expired not expiring " << ls->offset << ", not fully flushed yet, safe "
-	    << journaler->get_write_safe_pos() << " < end " << ls->end << dendl;
   } else {
     // expired.
     expired_segments.insert(ls);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 02/16] mds: fix anchor table update
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
  2012-11-19  2:43 ` [PATCH 01/16] mds: don't expire log segment before it's fully flushed Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 03/16] mds: don't add not issued caps when confirming cap receipt Yan, Zheng
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The reference count of an anchor table entry that corresponds to
directory is number of anchored inodes under the directory. But
when updating anchor trace for an directory inode, the code only
increases/decreases its new/old ancestor anchor table entries'
reference counts by one.

This patch probably resolves BUG #1850.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/AnchorServer.cc | 13 +++++++------
 src/mds/AnchorServer.h  |  4 ++--
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/mds/AnchorServer.cc b/src/mds/AnchorServer.cc
index 731ec96..cd2e625 100644
--- a/src/mds/AnchorServer.cc
+++ b/src/mds/AnchorServer.cc
@@ -73,7 +73,7 @@ bool AnchorServer::add(inodeno_t ino, inodeno_t dirino, __u32 dn_hash,
   return true;
 }
 
-void AnchorServer::inc(inodeno_t ino)
+void AnchorServer::inc(inodeno_t ino, int ref)
 {
   dout(7) << "inc " << ino << dendl;
 
@@ -81,7 +81,7 @@ void AnchorServer::inc(inodeno_t ino)
 
   while (1) {
     Anchor &anchor = anchor_map[ino];
-    anchor.nref++;
+    anchor.nref += ref;
     anchor.updated = version;
       
     dout(10) << "inc now " << anchor << dendl;
@@ -92,14 +92,14 @@ void AnchorServer::inc(inodeno_t ino)
   }
 }
 
-void AnchorServer::dec(inodeno_t ino) 
+void AnchorServer::dec(inodeno_t ino, int ref)
 {
   dout(7) << "dec " << ino << dendl;
   assert(anchor_map.count(ino));
 
   while (true) {
     Anchor &anchor = anchor_map[ino];
-    anchor.nref--;
+    anchor.nref -= ref;
     anchor.updated = version;
 
     if (anchor.nref == 0) {
@@ -186,13 +186,14 @@ void AnchorServer::_commit(version_t tid)
     if (anchor_map.count(ino)) {
       dout(7) << "commit " << tid << " update " << ino << dendl;
 
+      int ref = anchor_map[ino].nref;
       // remove old
-      dec(ino);
+      dec(ino, ref);
       
       // add new
       for (unsigned i=0; i<trace.size(); i++) 
 	add(trace[i].ino, trace[i].dirino, trace[i].dn_hash, true);
-      inc(ino);
+      inc(ino, ref);
     } else {
       dout(7) << "commit " << tid << " update " << ino << " -- DNE" << dendl;
     }
diff --git a/src/mds/AnchorServer.h b/src/mds/AnchorServer.h
index ca70aa2..aa5588a 100644
--- a/src/mds/AnchorServer.h
+++ b/src/mds/AnchorServer.h
@@ -50,8 +50,8 @@ class AnchorServer : public MDSTableServer {
   }
 
   bool add(inodeno_t ino, inodeno_t dirino, __u32 dn_hash, bool replace);
-  void inc(inodeno_t ino);
-  void dec(inodeno_t ino);
+  void inc(inodeno_t ino, int ref=1);
+  void dec(inodeno_t ino, int ref=1);
 
   void dump();
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/16] mds: don't add not issued caps when confirming cap receipt
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
  2012-11-19  2:43 ` [PATCH 01/16] mds: don't expire log segment before it's fully flushed Yan, Zheng
  2012-11-19  2:43 ` [PATCH 02/16] mds: fix anchor table update Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 04/16] mds: allow try_eval to eval unstable locks in freezing object Yan, Zheng
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

There is message ordering race in cephfs kernel client. We compose
cap messages when i_ceph_lock is hold. But when adding messages
to the output queue, the kernel releases i_ceph_lock and acquires
a mutex. So it is possible that cap messages are send out of order.
If the kernel client send a cap update, then send a cap release,
but the two messages reach MDS out of order. The update message
will re-add the released caps. This patch adds code to check if
caps were actually issued when confirming cap receipt.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Locker.cc | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 9d91748..c29ac34 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -2287,11 +2287,17 @@ void Locker::handle_client_caps(MClientCaps *m)
  
     // head inode, and cap
     MClientCaps *ack = 0;
+
+    int caps = m->get_caps();
+    if (caps & ~cap->issued()) {
+      dout(10) << " confirming not issued caps " << ccap_string(caps & ~cap->issued()) << dendl;
+      caps &= cap->issued();
+    }
     
-    cap->confirm_receipt(m->get_seq(), m->get_caps());
+    cap->confirm_receipt(m->get_seq(), caps);
     dout(10) << " follows " << follows
 	     << " retains " << ccap_string(m->get_caps())
-	     << " dirty " << ccap_string(m->get_caps())
+	     << " dirty " << ccap_string(m->get_dirty())
 	     << " on " << *in << dendl;
 
 
@@ -2424,6 +2430,10 @@ void Locker::process_request_cap_release(MDRequest *mdr, client_t client, const
     return;
   }
     
+  if (caps & ~cap->issued()) {
+    dout(10) << " confirming not issued caps " << ccap_string(caps & ~cap->issued()) << dendl;
+    caps &= cap->issued();
+  }
   cap->confirm_receipt(seq, caps);
   adjust_cap_wanted(cap, wanted, issue_seq);
   
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 04/16] mds: allow try_eval to eval unstable locks in freezing object
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (2 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 03/16] mds: don't add not issued caps when confirming cap receipt Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 05/16] mds: Don't acquire replica object's versionlock Yan, Zheng
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Unstable locks hold auth_pins on the object, it prevents the freezing
object become frozen and then unfreeze. So try_eval() should not wait
for freezing object

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Locker.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index c29ac34..7d55940 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -845,8 +845,8 @@ void Locker::try_eval(MDSCacheObject *p, int mask)
     return;
   }
 
-  if (p->is_auth() && !p->can_auth_pin()) {
-    dout(7) << "try_eval can't auth_pin, waiting on " << *p << dendl;
+  if (p->is_auth() && p->is_frozen()) {
+    dout(7) << "try_eval frozen, waiting on " << *p << dendl;
     p->add_waiter(MDSCacheObject::WAIT_UNFREEZE, new C_Locker_Eval(this, p, mask));
     return;
   }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 05/16] mds: Don't acquire replica object's versionlock
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (3 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 04/16] mds: allow try_eval to eval unstable locks in freezing object Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 06/16] mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED Yan, Zheng
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Both CInode and CDentry's versionlocks are of type LocalLock.
Acquiring LocalLock in replica object is useless and problematic.
For example, if two requests try acquiring a replica object's
versionlock, the first request succeeds, the second request
is added to wait queue. Later when the first request finishes,
MDCache::request_drop_foreign_locks() finds the lock's parent is
non-auth, it skips waking requests in the wait queue. So the
second request hangs.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Locker.cc |  7 +++++++
 src/mds/Server.cc | 25 ++++++++++---------------
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 7d55940..b67b3ee 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -196,6 +196,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
     // augment xlock with a versionlock?
     if ((*p)->get_type() == CEPH_LOCK_DN) {
       CDentry *dn = (CDentry*)(*p)->get_parent();
+      if (!dn->is_auth())
+	continue;
 
       if (xlocks.count(&dn->versionlock))
 	continue;  // we're xlocking the versionlock too; don't wrlock it!
@@ -213,6 +215,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
     if ((*p)->get_type() > CEPH_LOCK_IVERSION) {
       // inode version lock?
       CInode *in = (CInode*)(*p)->get_parent();
+      if (!in->is_auth())
+	continue;
       if (mdr->is_master()) {
 	// master.  wrlock versionlock so we can pipeline inode updates to journal.
 	wrlocks.insert(&in->versionlock);
@@ -3920,6 +3924,7 @@ void Locker::local_wrlock_grab(LocalLock *lock, Mutation *mut)
   dout(7) << "local_wrlock_grab  on " << *lock
 	  << " on " << *lock->get_parent() << dendl;  
   
+  assert(lock->get_parent()->is_auth());
   assert(lock->can_wrlock());
   assert(!mut->wrlocks.count(lock));
   lock->get_wrlock(mut->get_client());
@@ -3932,6 +3937,7 @@ bool Locker::local_wrlock_start(LocalLock *lock, MDRequest *mut)
   dout(7) << "local_wrlock_start  on " << *lock
 	  << " on " << *lock->get_parent() << dendl;  
   
+  assert(lock->get_parent()->is_auth());
   if (lock->can_wrlock()) {
     assert(!mut->wrlocks.count(lock));
     lock->get_wrlock(mut->get_client());
@@ -3963,6 +3969,7 @@ bool Locker::local_xlock_start(LocalLock *lock, MDRequest *mut)
   dout(7) << "local_xlock_start  on " << *lock
 	  << " on " << *lock->get_parent() << dendl;  
   
+  assert(lock->get_parent()->is_auth());
   if (!lock->can_xlock_local()) {
     lock->add_waiter(SimpleLock::WAIT_WR|SimpleLock::WAIT_STABLE, new C_MDS_RetryRequest(mdcache, mut));
     return false;
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 4642a13..f8d6666 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -5204,25 +5204,20 @@ void Server::handle_client_rename(MDRequest *mdr)
     wrlocks.insert(&straydn->get_dir()->inode->nestlock);
   }
 
-  // xlock versionlock on srci if remote?
-  //  this ensures it gets safely remotely auth_pinned, avoiding deadlock;
-  //  strictly speaking, having the slave node freeze the inode is 
-  //  otherwise sufficient for avoiding conflicts with inode locks, etc.
-  if (!srcdn->is_auth() && srcdnl->is_primary())  // xlock versionlock on srci if there are any witnesses
-    xlocks.insert(&srci->versionlock);
-
   // xlock versionlock on dentries if there are witnesses.
   //  replicas can't see projected dentry linkages, and will get
   //  confused if we try to pipeline things.
   if (!witnesses.empty()) {
-    if (srcdn->is_projected())
-      xlocks.insert(&srcdn->versionlock);
-    if (destdn->is_projected())
-      xlocks.insert(&destdn->versionlock);
-    // also take rdlock on all ancestor dentries for destdn.  this ensures that the
-    // destdn can be traversed to by the witnesses.
-    for (int i=0; i<(int)desttrace.size(); i++) 
-      xlocks.insert(&desttrace[i]->versionlock);
+    // take xlock on all projected ancestor dentries for srcdn and destdn.
+    // this ensures the srcdn and destdn can be traversed to by the witnesses.
+    for (int i= 0; i<(int)srctrace.size(); i++) {
+      if (srctrace[i]->is_auth() && srctrace[i]->is_projected())
+	  xlocks.insert(&srctrace[i]->versionlock);
+    }
+    for (int i=0; i<(int)desttrace.size(); i++) {
+      if (desttrace[i]->is_auth() && desttrace[i]->is_projected())
+	  xlocks.insert(&desttrace[i]->versionlock);
+    }
   }
 
   // we need to update srci's ctime.  xlock its least contended lock to do that...
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 06/16] mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (4 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 05/16] mds: Don't acquire replica object's versionlock Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 07/16] mds: call eval() after caps are exported Yan, Zheng
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

So eval_gather() will not skip calling scatter_writebehind(),
otherwise the replica lock may be in flushing state forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Locker.cc | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index b67b3ee..c500be5 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -4404,8 +4404,12 @@ void Locker::handle_file_lock(ScatterLock *lock, MLock *m)
     if (lock->get_state() == LOCK_MIX_LOCK ||
 	lock->get_state() == LOCK_MIX_LOCK2 ||
 	lock->get_state() == LOCK_MIX_EXCL ||
-	lock->get_state() == LOCK_MIX_TSYN)
+	lock->get_state() == LOCK_MIX_TSYN) {
       lock->decode_locked_state(m->get_data());
+      // replica is waiting for AC_LOCKFLUSHED, eval_gather() should not
+      // delay calling scatter_writebehind().
+      lock->clear_flushed();
+    }
 
     if (lock->is_gathering()) {
       dout(7) << "handle_file_lock " << *in << " from " << from
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/16] mds: call eval() after caps are exported
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (5 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 06/16] mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 08/16] mds: don't forward client request from MDS Yan, Zheng
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

For an inode just changed authority, if the new auth MDS want to
change a lock in the inode from 'sync' to 'lock' state before caps
are exported. The lock in replica can be in 'sync->lock' state
because client caps prevent it from transitting to 'lock' state.
So we should call eval() after clearing client caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Migrator.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc
index 9829d68..41d97e9 100644
--- a/src/mds/Migrator.cc
+++ b/src/mds/Migrator.cc
@@ -1052,6 +1052,7 @@ void Migrator::finish_export_inode_caps(CInode *in)
     mds->send_message_client_counted(m, it->first);
   }
   in->clear_client_caps_after_export();
+  mds->locker->eval(in, CEPH_CAP_LOCKS);
 }
 
 void Migrator::finish_export_inode(CInode *in, utime_t now, list<Context*>& finished)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 08/16] mds: don't forward client request from MDS
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (6 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 07/16] mds: call eval() after caps are exported Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 09/16] mds: check parent inode's versionlock when propagating rstats Yan, Zheng
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Forwarding client request that was from MDS will trigger assertion
in MDS::forward_message_mds(). MDS only send client requests for
stray migration/reintegration, so it's safe to drop them.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 5027469..88a248b 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -7475,13 +7475,17 @@ void MDCache::request_finish(MDRequest *mdr)
 
 void MDCache::request_forward(MDRequest *mdr, int who, int port)
 {
-  dout(7) << "request_forward " << *mdr << " to mds." << who << " req " << *mdr << dendl;
-  
-  mds->forward_message_mds(mdr->client_request, who);  
-  mdr->client_request = 0;
+  if (mdr->client_request->get_source().is_client()) {
+    dout(7) << "request_forward " << *mdr << " to mds." << who << " req "
+            << *mdr->client_request << dendl;
+    mds->forward_message_mds(mdr->client_request, who);
+    mdr->client_request = 0;
+    if (mds->logger) mds->logger->inc(l_mds_fw);
+  } else {
+    dout(7) << "request_forward drop " << *mdr << " req " << *mdr->client_request
+            << " was from mds" << dendl;
+  }
   request_cleanup(mdr);
-
-  if (mds->logger) mds->logger->inc(l_mds_fw);
 }
 
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 09/16] mds: check parent inode's versionlock when propagating rstats
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (7 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 08/16] mds: don't forward client request from MDS Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 10/16] mds: drop locks if requiring auth pinning new objects Yan, Zheng
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

To propagate rstats to one level up, we need lock both parent
inode's nestlock and versionlock.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 88a248b..690c2ed 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -1917,7 +1917,8 @@ void MDCache::predirty_journal_parents(Mutation *mut, EMetaBlob *blob,
     // rstat
     if (!primary_dn) {
       // don't update parent this pass
-    } else if (!linkunlink && !parent->inode->nestlock.can_wrlock(-1)) {
+    } else if (!linkunlink && !(parent->inode->nestlock.can_wrlock(-1) &&
+			        parent->inode->versionlock.can_wrlock())) {
       dout(20) << " unwritable parent nestlock " << parent->inode->nestlock
 	       << ", marking dirty rstat on " << *cur << dendl;
       cur->mark_dirty_rstat();
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 10/16] mds: drop locks if requiring auth pinning new objects.
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (8 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 09/16] mds: check parent inode's versionlock when propagating rstats Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 11/16] mds: consider revoking caps in imported caps as issued Yan, Zheng
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Locker::acquire_locks() skip auth pinning replica object if we only
request a rdlock and the lock is read-lockable. To get all locks,
we may call Locker::acquire_locks() several times, locks in replca
objects may become not read-lockable between calls. So it is
possible we need auth pin new objects after already take some locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Locker.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index c500be5..63f8311 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -286,11 +286,12 @@ bool Locker::acquire_locks(MDRequest *mdr,
       continue;
     
     if (!object->is_auth()) {
+      if (!mdr->locks.empty())
+	mds->locker->drop_locks(mdr);
       if (object->is_ambiguous_auth()) {
 	// wait
 	dout(10) << " ambiguous auth, waiting to authpin " << *object << dendl;
 	object->add_waiter(MDSCacheObject::WAIT_SINGLEAUTH, new C_MDS_RetryRequest(mdcache, mdr));
-	mds->locker->drop_locks(mdr);
 	mdr->drop_local_auth_pins();
 	return false;
       }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 11/16] mds: consider revoking caps in imported caps as issued
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (9 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 10/16] mds: drop locks if requiring auth pinning new objects Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 12/16] mds: fix open_remote_inode race Yan, Zheng
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The clients may already send caps release message to the exporting
MDS, so the importing MDS waits for the release message forever.
consider revoking caps as issued can avoid this issue.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/Capability.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/mds/Capability.h b/src/mds/Capability.h
index f374328..6fe67f4 100644
--- a/src/mds/Capability.h
+++ b/src/mds/Capability.h
@@ -297,7 +297,8 @@ public:
     int newpending = other.pending | pending();
     if (other.issued & ~newpending)
       issue(other.issued | newpending);
-    issue(newpending);
+    else
+      issue(newpending);
     last_issue_stamp = other.last_issue_stamp;
 
     client_follows = other.client_follows;
@@ -311,7 +312,8 @@ public:
     int newpending = pending();
     if (otherissued & ~newpending)
       issue(otherissued | newpending);
-    issue(newpending);
+    else
+      issue(newpending);
 
     // wanted
     _wanted = _wanted | otherwanted;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 12/16] mds: fix open_remote_inode race
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (10 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 11/16] mds: consider revoking caps in imported caps as issued Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 13/16] mds: fix assertion in handle_cache_expire Yan, Zheng
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

discover_ino() may return -ENOENT if it races with other FS activities.
so use C_MDC_RetryOpenRemoteIno instead of C_MDC_OpenRemoteIno as
onfinish callback.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 690c2ed..48ccc7e 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -7122,7 +7122,8 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
   if (!dir && in->is_auth()) {
     if (in->is_frozen_dir()) {
       dout(7) << "traverse: " << *in << " is frozen_dir, waiting" << dendl;
-      in->parent->dir->add_waiter(CDir::WAIT_UNFREEZE, onfinish);
+      in->parent->dir->add_waiter(CDir::WAIT_UNFREEZE,
+				  new C_MDC_RetryOpenRemoteIno(this, ino, onfinish));
       return;
     }
     dir = in->get_or_open_dirfrag(this, frag);
@@ -7157,7 +7158,7 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
     dout(10) << "have remote dirfrag " << *dir << ", discovering " 
 	     << anchortrace[i].ino << dendl;
     discover_ino(dir, anchortrace[i].ino, 
-		 new C_MDC_OpenRemoteIno(this, ino, anchortrace, onfinish));
+		 new C_MDC_RetryOpenRemoteIno(this, ino, onfinish));
   }
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 13/16] mds: fix assertion in handle_cache_expire
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (11 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 12/16] mds: fix open_remote_inode race Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 14/16] mds: fix freeze inode deadlock Yan, Zheng
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

During export, it's possible to get cache expire messages in
DISCOVERING, FREEZING and PREPPING state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 48ccc7e..5dcb3e3 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -5928,6 +5928,7 @@ void MDCache::handle_cache_expire(MCacheExpire *m)
 	continue;
       }
       assert(!(parent_dir->is_auth() && parent_dir->is_exporting()) ||
+	     migrator->get_export_state(parent_dir) <= Migrator::EXPORT_PREPPING ||
              (migrator->get_export_state(parent_dir) == Migrator::EXPORT_WARNING &&
                  !migrator->export_has_warned(parent_dir, from)));
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 14/16] mds: fix freeze inode deadlock
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (12 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 13/16] mds: fix assertion in handle_cache_expire Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 15/16] mds: allow open_remote_ino() to open xlocked dentry Yan, Zheng
  2012-11-19  2:43 ` [PATCH 16/16] mds: use rdlock_try() when checking NULL dentry Yan, Zheng
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

CInode::freeze_inode() is used in the case of cross authority rename.
Server::handle_slave_rename_prep() calls it to wait for all other
operations on source inode to complete. This happens after all locks
for the rename operation are acquired. But to acquire locks, we need
auth pin locks' parent objects first. So there is an ABBA deadlock
if someone auth pins the source inode after locks for rename are
acquired and before Server::handle_slave_rename_prep() is called.
The fix is freeze and auth pin the source inode at the same time.

This patch introduces CInode::freeze_auth_pin(), it waits for all
other MDRequests to release auth pins, then change the inode to
FROZENAUTHPIN state, this state prevents other MDRequests from
getting new auth pins.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/CInode.cc               | 36 +++++++++++++++++++++++++++----
 src/mds/CInode.h                |  5 +++++
 src/mds/Locker.cc               |  7 ++++--
 src/mds/Locker.h                |  3 ++-
 src/mds/Mutation.cc             | 31 +++++++++++++++++++++++++++
 src/mds/Mutation.h              |  6 ++++++
 src/mds/Server.cc               | 47 ++++++++++++++++++++++++++++++-----------
 src/mds/mdstypes.h              |  7 ++++++
 src/messages/MMDSSlaveRequest.h |  1 +
 9 files changed, 124 insertions(+), 19 deletions(-)

diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index c129308..af70b68 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -130,6 +130,7 @@ ostream& operator<<(ostream& out, CInode& in)
   if (in.state_test(CInode::STATE_DIRTYPARENT)) out << " dirtyparent";
   if (in.is_freezing_inode()) out << " FREEZING=" << in.auth_pin_freeze_allowance;
   if (in.is_frozen_inode()) out << " FROZEN";
+  if (in.is_frozen_auth_pin()) out << " FROZEN_AUTHPIN";
 
   inode_t *pi = in.get_projected_inode();
   if (pi->is_truncating())
@@ -1862,7 +1863,8 @@ void CInode::add_waiter(uint64_t tag, Context *c)
   // wait on the directory?
   //  make sure its not the inode that is explicitly ambiguous|freezing|frozen
   if (((tag & WAIT_SINGLEAUTH) && !state_test(STATE_AMBIGUOUSAUTH)) ||
-      ((tag & WAIT_UNFREEZE) && !is_frozen_inode() && !is_freezing_inode())) {
+      ((tag & WAIT_UNFREEZE) &&
+       !is_frozen_inode() && !is_freezing_inode() && !is_frozen_auth_pin())) {
     dout(15) << "passing waiter up tree" << dendl;
     parent->dir->add_waiter(tag, c);
     return;
@@ -1885,8 +1887,10 @@ bool CInode::freeze_inode(int auth_pin_allowance)
 
   dout(10) << "freeze_inode - frozen" << dendl;
   assert(auth_pins == auth_pin_allowance);
-  get(PIN_FROZEN);
-  state_set(STATE_FROZEN);
+  if (!state_test(STATE_FROZEN)) {
+    get(PIN_FROZEN);
+    state_set(STATE_FROZEN);
+  }
   return true;
 }
 
@@ -1904,10 +1908,34 @@ void CInode::unfreeze_inode(list<Context*>& finished)
   take_waiting(WAIT_UNFREEZE, finished);
 }
 
+void CInode::unfreeze_inode()
+{
+    list<Context*> finished;
+    unfreeze_inode(finished);
+    mdcache->mds->queue_waiters(finished);
+}
+
+void CInode::freeze_auth_pin()
+{
+  assert(state_test(CInode::STATE_FROZEN));
+  state_set(CInode::STATE_FROZENAUTHPIN);
+}
+
+void CInode::unfreeze_auth_pin()
+{
+  assert(state_test(CInode::STATE_FROZENAUTHPIN));
+  state_clear(CInode::STATE_FROZENAUTHPIN);
+  if (!state_test(STATE_FREEZING|STATE_FROZEN)) {
+    list<Context*> finished;
+    take_waiting(WAIT_UNFREEZE, finished);
+    mdcache->mds->queue_waiters(finished);
+  }
+}
 
 // auth_pins
 bool CInode::can_auth_pin() {
-  if (is_freezing_inode() || is_frozen_inode()) return false;
+  if (is_freezing_inode() || is_frozen_inode() || is_frozen_auth_pin())
+    return false;
   if (parent)
     return parent->can_auth_pin();
   return true;
diff --git a/src/mds/CInode.h b/src/mds/CInode.h
index b76b524..e43ecf5 100644
--- a/src/mds/CInode.h
+++ b/src/mds/CInode.h
@@ -181,6 +181,7 @@ public:
   static const int STATE_DIRTYPARENT =  (1<<14);
   static const int STATE_DIRTYRSTAT =  (1<<15);
   static const int STATE_STRAYPINNED = (1<<16);
+  static const int STATE_FROZENAUTHPIN = (1<<17);
 
   // -- waiters --
   static const uint64_t WAIT_DIR         = (1<<0);
@@ -856,6 +857,7 @@ public:
   // -- freeze --
   bool is_freezing_inode() { return state_test(STATE_FREEZING); }
   bool is_frozen_inode() { return state_test(STATE_FROZEN); }
+  bool is_frozen_auth_pin() { return state_test(STATE_FROZENAUTHPIN); }
   bool is_frozen();
   bool is_frozen_dir();
   bool is_freezing();
@@ -864,7 +866,10 @@ public:
    * auth_pins it is itself holding/responsible for. */
   bool freeze_inode(int auth_pin_allowance=0);
   void unfreeze_inode(list<Context*>& finished);
+  void unfreeze_inode();
 
+  void freeze_auth_pin();
+  void unfreeze_auth_pin();
 
   // -- reference counting --
   void bad_put(int by) {
diff --git a/src/mds/Locker.cc b/src/mds/Locker.cc
index 63f8311..ee4799e 100644
--- a/src/mds/Locker.cc
+++ b/src/mds/Locker.cc
@@ -174,7 +174,8 @@ bool Locker::acquire_locks(MDRequest *mdr,
 			   set<SimpleLock*> &rdlocks,
 			   set<SimpleLock*> &wrlocks,
 			   set<SimpleLock*> &xlocks,
-			   map<SimpleLock*,int> *remote_wrlocks)
+			   map<SimpleLock*,int> *remote_wrlocks,
+			   CInode *auth_pin_freeze)
 {
   if (mdr->done_locking &&
       !mdr->is_slave()) {  // not on slaves!  master requests locks piecemeal.
@@ -336,7 +337,9 @@ bool Locker::acquire_locks(MDRequest *mdr,
 	dout(10) << " req remote auth_pin of " << **q << dendl;
 	MDSCacheObjectInfo info;
 	(*q)->set_object_info(info);
-	req->get_authpins().push_back(info);      
+	req->get_authpins().push_back(info);
+	if (*q == auth_pin_freeze)
+	  (*q)->set_object_info(req->get_authpin_freeze());
 	mdr->pin(*q);
       }
       mds->send_message_mds(req, p->first);
diff --git a/src/mds/Locker.h b/src/mds/Locker.h
index a1cf59e..b3b9919 100644
--- a/src/mds/Locker.h
+++ b/src/mds/Locker.h
@@ -88,7 +88,8 @@ public:
 		     set<SimpleLock*> &rdlocks,
 		     set<SimpleLock*> &wrlocks,
 		     set<SimpleLock*> &xlocks,
-		     map<SimpleLock*,int> *remote_wrlocks=NULL);
+		     map<SimpleLock*,int> *remote_wrlocks=NULL,
+		     CInode *auth_pin_freeze=NULL);
 
   void cancel_locking(Mutation *mut, set<CInode*> *pneed_issue);
   void drop_locks(Mutation *mut, set<CInode*> *pneed_issue=0);
diff --git a/src/mds/Mutation.cc b/src/mds/Mutation.cc
index 6321ffc..a9c3513 100644
--- a/src/mds/Mutation.cc
+++ b/src/mds/Mutation.cc
@@ -82,8 +82,39 @@ void Mutation::auth_unpin(MDSCacheObject *object)
   auth_pins.erase(object);
 }
 
+bool Mutation::freeze_auth_pin(CInode *inode)
+{
+  assert(!auth_pin_freeze || auth_pin_freeze == inode);
+  auth_pin_freeze = inode;
+  auth_pin(inode);
+  if (!inode->freeze_inode(1))
+    return false;
+
+  inode->freeze_auth_pin();
+  inode->unfreeze_inode();
+  return true;
+}
+
+void Mutation::unfreeze_auth_pin(CInode *inode)
+{
+  assert(auth_pin_freeze == inode);
+  assert(is_auth_pinned(inode));
+  if (inode->is_frozen_auth_pin())
+    inode->unfreeze_auth_pin();
+  else
+    inode->unfreeze_inode();
+  auth_pin_freeze = NULL;
+}
+
+bool Mutation::can_auth_pin(MDSCacheObject *object)
+{
+  return object->can_auth_pin() || (is_auth_pinned(object) && object == auth_pin_freeze);
+}
+
 void Mutation::drop_local_auth_pins()
 {
+  if (auth_pin_freeze)
+    unfreeze_auth_pin(auth_pin_freeze);
   for (set<MDSCacheObject*>::iterator it = auth_pins.begin();
        it != auth_pins.end();
        it++) {
diff --git a/src/mds/Mutation.h b/src/mds/Mutation.h
index ca5ee96..d05d0ee 100644
--- a/src/mds/Mutation.h
+++ b/src/mds/Mutation.h
@@ -50,6 +50,7 @@ struct Mutation {
   // auth pins
   set< MDSCacheObject* > remote_auth_pins;
   set< MDSCacheObject* > auth_pins;
+  CInode *auth_pin_freeze;
   
   // held locks
   set< SimpleLock* > rdlocks;  // always local.
@@ -80,12 +81,14 @@ struct Mutation {
     : attempt(0),
       ls(0),
       slave_to_mds(-1),
+      auth_pin_freeze(NULL),
       locking(NULL),
       done_locking(false), committing(false), aborted(false) { }
   Mutation(metareqid_t ri, __u32 att=0, int slave_to=-1)
     : reqid(ri), attempt(att),
       ls(0),
       slave_to_mds(slave_to), 
+      auth_pin_freeze(NULL),
       locking(NULL),
       done_locking(false), committing(false), aborted(false) { }
   virtual ~Mutation() {
@@ -119,6 +122,9 @@ struct Mutation {
   bool is_auth_pinned(MDSCacheObject *object);
   void auth_pin(MDSCacheObject *object);
   void auth_unpin(MDSCacheObject *object);
+  bool freeze_auth_pin(CInode *inode);
+  void unfreeze_auth_pin(CInode *inode);
+  bool can_auth_pin(MDSCacheObject *object);
   void drop_local_auth_pins();
   void add_projected_inode(CInode *in);
   void pop_and_dirty_projected_inodes();
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index f8d6666..5cee949 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -1486,6 +1486,7 @@ void Server::handle_slave_auth_pin(MDRequest *mdr)
 
   // build list of objects
   list<MDSCacheObject*> objects;
+  CInode *auth_pin_freeze = NULL;
   bool fail = false;
 
   for (vector<MDSCacheObjectInfo>::iterator p = mdr->slave_request->get_authpins().begin();
@@ -1499,6 +1500,8 @@ void Server::handle_slave_auth_pin(MDRequest *mdr)
     }
 
     objects.push_back(object);
+    if (*p == mdr->slave_request->get_authpin_freeze())
+      auth_pin_freeze = dynamic_cast<CInode*>(object);
   }
   
   // can we auth pin them?
@@ -1511,8 +1514,7 @@ void Server::handle_slave_auth_pin(MDRequest *mdr)
 	fail = true;
 	break;
       }
-      if (!mdr->is_auth_pinned(*p) &&
-	  !(*p)->can_auth_pin()) {
+      if (!mdr->can_auth_pin(*p)) {
 	// wait
 	dout(10) << " waiting for authpinnable on " << **p << dendl;
 	(*p)->add_waiter(CDir::WAIT_UNFREEZE, new C_MDS_RetryRequest(mdcache, mdr));
@@ -1526,6 +1528,22 @@ void Server::handle_slave_auth_pin(MDRequest *mdr)
   if (fail) {
     mdr->drop_local_auth_pins();  // just in case
   } else {
+    /* handle_slave_rename_prep() call freeze_inode() to wait for all other operations
+     * on the source inode to complete. This happens after all locks for the rename
+     * operation are acquired. But to acquire locks, we need auth pin locks' parent
+     * objects first. So there is an ABBA deadlock if someone auth pins the source inode
+     * after locks are acquired and before Server::handle_slave_rename_prep() is called.
+     * The solution is freeze the inode and prevent other MDRequests from getting new
+     * auth pins.
+     */
+    if (auth_pin_freeze) {
+      dout(10) << " freezing auth pin on " << *auth_pin_freeze << dendl;
+      if (!mdr->freeze_auth_pin(auth_pin_freeze)) {
+	auth_pin_freeze->add_waiter(CInode::WAIT_FROZEN, new C_MDS_RetryRequest(mdcache, mdr));
+	mds->mdlog->flush();
+	return;
+      }
+    }
     for (list<MDSCacheObject*>::iterator p = objects.begin();
 	 p != objects.end();
 	 ++p) {
@@ -1922,7 +1940,8 @@ CInode* Server::rdlock_path_pin_ref(MDRequest *mdr, int n,
     //   do NOT proceed if freezing, as cap release may defer in that case, and
     //   we could deadlock when we try to lock @ref.
     // if we're already auth_pinned, continue; the release has already been processed.
-    if (ref->is_frozen() || (ref->is_freezing() && !mdr->is_auth_pinned(ref))) {
+    if (ref->is_frozen() || ref->is_frozen_auth_pin() ||
+	(ref->is_freezing() && !mdr->is_auth_pinned(ref))) {
       dout(7) << "waiting for !frozen/authpinnable on " << *ref << dendl;
       ref->add_waiter(CInode::WAIT_UNFREEZE, new C_MDS_RetryRequest(mdcache, mdr));
       /* If we have any auth pins, this will deadlock.
@@ -5241,7 +5260,9 @@ void Server::handle_client_rename(MDRequest *mdr)
   // take any locks needed for anchor creation/verification
   mds->mdcache->anchor_create_prep_locks(mdr, srci, rdlocks, xlocks);
 
-  if (!mds->locker->acquire_locks(mdr, rdlocks, wrlocks, xlocks, &remote_wrlocks))
+  CInode *auth_pin_freeze = !srcdn->is_auth() && srcdnl->is_primary() ? srci : NULL;
+  if (!mds->locker->acquire_locks(mdr, rdlocks, wrlocks, xlocks,
+				  &remote_wrlocks, auth_pin_freeze))
     return;
 
   if (oldin &&
@@ -5991,9 +6012,7 @@ void Server::handle_slave_rename_prep(MDRequest *mdr)
 
   // am i srcdn auth?
   if (srcdn->is_auth()) {
-    if (srcdnl->is_primary() && 
-	!srcdnl->get_inode()->is_freezing_inode() &&
-	!srcdnl->get_inode()->is_frozen_inode()) {
+    if (srcdnl->is_primary()) {
       // set ambiguous auth for srci
       /*
        * NOTE: we don't worry about ambiguous cache expire as we do
@@ -6010,7 +6029,13 @@ void Server::handle_slave_rename_prep(MDRequest *mdr)
       int allowance = 2; // 1 for the mdr auth_pin, 1 for the link lock
       allowance += srcdnl->get_inode()->is_dir(); // for the snap lock
       dout(10) << " freezing srci " << *srcdnl->get_inode() << " with allowance " << allowance << dendl;
-      if (!srcdnl->get_inode()->freeze_inode(allowance)) {
+      bool frozen_inode = srcdnl->get_inode()->freeze_inode(allowance);
+
+      // unfreeze auth pin after freezing the inode to avoid queueing waiters
+      if (srcdnl->get_inode()->is_frozen_auth_pin())
+	mdr->unfreeze_auth_pin(srcdnl->get_inode());
+
+      if (!frozen_inode) {
 	srcdnl->get_inode()->add_waiter(CInode::WAIT_FROZEN, new C_MDS_RetryRequest(mdcache, mdr));
 	return;
       }
@@ -6178,8 +6203,7 @@ void Server::_commit_slave_rename(MDRequest *mdr, int r,
       destdnl->get_inode()->take_waiting(CInode::WAIT_SINGLEAUTH, finished);
 
       // unfreeze
-      assert(destdnl->get_inode()->is_frozen_inode() ||
-             destdnl->get_inode()->is_freezing_inode());
+      assert(destdnl->get_inode()->is_frozen_inode());
       destdnl->get_inode()->unfreeze_inode(finished);
 
       mds->queue_waiters(finished);
@@ -6202,8 +6226,7 @@ void Server::_commit_slave_rename(MDRequest *mdr, int r,
       destdnl->get_inode()->take_waiting(CInode::WAIT_SINGLEAUTH, finished);
       
       // unfreeze
-      assert(destdnl->get_inode()->is_frozen_inode() ||
-	     destdnl->get_inode()->is_freezing_inode());
+      assert(destdnl->get_inode()->is_frozen_inode());
       destdnl->get_inode()->unfreeze_inode(finished);
       
       mds->queue_waiters(finished);
diff --git a/src/mds/mdstypes.h b/src/mds/mdstypes.h
index db4dbf1..22e754e 100644
--- a/src/mds/mdstypes.h
+++ b/src/mds/mdstypes.h
@@ -1250,6 +1250,13 @@ public:
   }
 };
 
+inline bool operator==(const MDSCacheObjectInfo& l, const MDSCacheObjectInfo& r) {
+  if (l.ino || r.ino)
+    return l.ino == r.ino && l.snapid == r.snapid;
+  else
+    return l.dirfrag == r.dirfrag && l.dname == r.dname;
+}
+
 WRITE_CLASS_ENCODER(MDSCacheObjectInfo)
 
 
diff --git a/src/messages/MMDSSlaveRequest.h b/src/messages/MMDSSlaveRequest.h
index 4f2bb59..03ec582 100644
--- a/src/messages/MMDSSlaveRequest.h
+++ b/src/messages/MMDSSlaveRequest.h
@@ -112,6 +112,7 @@ public:
 
   int get_lock_type() { return lock_type; }
   MDSCacheObjectInfo &get_object_info() { return object_info; }
+  MDSCacheObjectInfo &get_authpin_freeze() { return object_info; }
 
   vector<MDSCacheObjectInfo>& get_authpins() { return authpins; }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 15/16] mds: allow open_remote_ino() to open xlocked dentry
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (13 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 14/16] mds: fix freeze inode deadlock Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  2012-11-19  2:43 ` [PATCH 16/16] mds: use rdlock_try() when checking NULL dentry Yan, Zheng
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

discover_ino() has a parameter want_xlocked. The parameter indicates
if remote discover handler can proceed when xlocked dentry is
encountered. open_remote_ino() uses discover_ino() to find non-auth
inode, but always set 'want_xlocked' to false. This may cause
dead lock in some corner cases. For example:

we rename a inode's primary dentry to one of its remote dentry and
send slave request to one witness MDS. but before the slave request
reaches the witness MDS, the inode is trimmed from the witness MDS'
cache. Then when the slave request arrives, open_remote_ino() will
be called during traversing the destpath. open_remote_ino() calls
discover_ino() with 'want_xlocled=false' to find the inode.
discover_ino() sends MDiscover message to the inode's authority MDS.
The handler of MDiscover message finds the inode's primary dentry
is xlocked and it sleeps.

The fix is add a parameter 'want_xlocked' to open_remote_ino() and
make open_remote_ino() pass the parameter to discover_ino().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 51 ++++++++++++++++++++++++++++-----------------------
 src/mds/MDCache.h  |  8 ++++----
 src/mds/Server.cc  |  7 +++++++
 3 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 5dcb3e3..9cddfeb 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -6731,7 +6731,8 @@ int MDCache::path_traverse(MDRequest *mdr, Message *req, Context *fin,     // wh
 	} else {
           dout(7) << "remote link to " << dnl->get_remote_ino() << ", which i don't have" << dendl;
 	  assert(mdr);  // we shouldn't hit non-primary dentries doing a non-mdr traversal!
-          open_remote_ino(dnl->get_remote_ino(), _get_waiter(mdr, req, fin));
+          open_remote_ino(dnl->get_remote_ino(), _get_waiter(mdr, req, fin),
+			  (null_okay && depth == path.depth() - 1));
 	  if (mds->logger) mds->logger->inc(l_mds_trino);
           return 1;
         }        
@@ -7017,12 +7018,13 @@ CInode *MDCache::get_dentry_inode(CDentry *dn, MDRequest *mdr, bool projected)
 class C_MDC_RetryOpenRemoteIno : public Context {
   MDCache *mdcache;
   inodeno_t ino;
+  bool want_xlocked;
   Context *onfinish;
 public:
-  C_MDC_RetryOpenRemoteIno(MDCache *mdc, inodeno_t i, Context *c) :
-    mdcache(mdc), ino(i), onfinish(c) {}
+  C_MDC_RetryOpenRemoteIno(MDCache *mdc, inodeno_t i, Context *c, bool wx) :
+    mdcache(mdc), ino(i), want_xlocked(wx), onfinish(c) {}
   void finish(int r) {
-    mdcache->open_remote_ino(ino, onfinish);
+    mdcache->open_remote_ino(ino, onfinish, want_xlocked);
   }
 };
 
@@ -7032,19 +7034,20 @@ class C_MDC_OpenRemoteIno : public Context {
   inodeno_t ino;
   inodeno_t hadino;
   version_t hadv;
+  bool want_xlocked;
   Context *onfinish;
 public:
   vector<Anchor> anchortrace;
 
-  C_MDC_OpenRemoteIno(MDCache *mdc, inodeno_t i, inodeno_t hi, version_t hv, Context *c) :
-    mdcache(mdc), ino(i), hadino(hi), hadv(hv), onfinish(c) {}
-  C_MDC_OpenRemoteIno(MDCache *mdc, inodeno_t i, vector<Anchor>& at, Context *c) :
-    mdcache(mdc), ino(i), hadino(0), hadv(0), onfinish(c), anchortrace(at) {}
+  C_MDC_OpenRemoteIno(MDCache *mdc, inodeno_t i, bool wx, inodeno_t hi, version_t hv, Context *c) :
+    mdcache(mdc), ino(i), hadino(hi), hadv(hv), want_xlocked(wx), onfinish(c) {}
+  C_MDC_OpenRemoteIno(MDCache *mdc, inodeno_t i, bool wx, vector<Anchor>& at, Context *c) :
+    mdcache(mdc), ino(i), hadino(0), hadv(0), want_xlocked(wx), onfinish(c), anchortrace(at) {}
 
   void finish(int r) {
     assert(r == 0);
     if (r == 0)
-      mdcache->open_remote_ino_2(ino, anchortrace, hadino, hadv, onfinish);
+      mdcache->open_remote_ino_2(ino, anchortrace, want_xlocked, hadino, hadv, onfinish);
     else {
       onfinish->finish(r);
       delete onfinish;
@@ -7052,18 +7055,18 @@ public:
   }
 };
 
-void MDCache::open_remote_ino(inodeno_t ino, Context *onfinish, inodeno_t hadino, version_t hadv)
+void MDCache::open_remote_ino(inodeno_t ino, Context *onfinish, bool want_xlocked,
+			      inodeno_t hadino, version_t hadv)
 {
-  dout(7) << "open_remote_ino on " << ino << dendl;
+  dout(7) << "open_remote_ino on " << ino << (want_xlocked ? " want_xlocked":"") << dendl;
   
-  C_MDC_OpenRemoteIno *c = new C_MDC_OpenRemoteIno(this, ino, hadino, hadv, onfinish);
+  C_MDC_OpenRemoteIno *c = new C_MDC_OpenRemoteIno(this, ino, want_xlocked,
+						   hadino, hadv, onfinish);
   mds->anchorclient->lookup(ino, c->anchortrace, c);
 }
 
-void MDCache::open_remote_ino_2(inodeno_t ino,
-                                vector<Anchor>& anchortrace,
-				inodeno_t hadino, version_t hadv,
-                                Context *onfinish)
+void MDCache::open_remote_ino_2(inodeno_t ino, vector<Anchor>& anchortrace, bool want_xlocked,
+				inodeno_t hadino, version_t hadv, Context *onfinish)
 {
   dout(7) << "open_remote_ino_2 on " << ino
 	  << ", trace depth is " << anchortrace.size() << dendl;
@@ -7106,7 +7109,7 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
 
   if (!in->dirfragtree.contains(frag)) {
     dout(10) << "frag " << frag << " not valid, requerying anchortable" << dendl;
-    open_remote_ino(ino, onfinish);
+    open_remote_ino(ino, onfinish, want_xlocked);
     return;
   }
 
@@ -7116,7 +7119,7 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
     dout(10) << "opening remote dirfrag " << frag << " under " << *in << dendl;
     /* we re-query the anchortable just to avoid a fragtree update race */
     open_remote_dirfrag(in, frag,
-			new C_MDC_RetryOpenRemoteIno(this, ino, onfinish));
+			new C_MDC_RetryOpenRemoteIno(this, ino, onfinish, want_xlocked));
     return;
   }
 
@@ -7124,7 +7127,7 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
     if (in->is_frozen_dir()) {
       dout(7) << "traverse: " << *in << " is frozen_dir, waiting" << dendl;
       in->parent->dir->add_waiter(CDir::WAIT_UNFREEZE,
-				  new C_MDC_RetryOpenRemoteIno(this, ino, onfinish));
+				  new C_MDC_RetryOpenRemoteIno(this, ino, onfinish, want_xlocked));
       return;
     }
     dir = in->get_or_open_dirfrag(this, frag);
@@ -7146,20 +7149,22 @@ void MDCache::open_remote_ino_2(inodeno_t ino,
 		 << " in complete dir " << *dir
 		 << ", requerying anchortable"
 		 << dendl;
-	open_remote_ino(ino, onfinish, anchortrace[i].ino, anchortrace[i].updated);
+	open_remote_ino(ino, onfinish, want_xlocked,
+		        anchortrace[i].ino, anchortrace[i].updated);
       }
     } else {
       dout(10) << "need ino " << anchortrace[i].ino
 	       << ", fetching incomplete dir " << *dir
 	       << dendl;
-      dir->fetch(new C_MDC_OpenRemoteIno(this, ino, anchortrace, onfinish));
+      dir->fetch(new C_MDC_OpenRemoteIno(this, ino, want_xlocked, anchortrace, onfinish));
     }
   } else {
     // hmm, discover.
     dout(10) << "have remote dirfrag " << *dir << ", discovering " 
 	     << anchortrace[i].ino << dendl;
-    discover_ino(dir, anchortrace[i].ino, 
-		 new C_MDC_RetryOpenRemoteIno(this, ino, onfinish));
+    discover_ino(dir, anchortrace[i].ino,
+	         new C_MDC_RetryOpenRemoteIno(this, ino, onfinish, want_xlocked),
+		 (want_xlocked && i == anchortrace.size() - 1));
   }
 }
 
diff --git a/src/mds/MDCache.h b/src/mds/MDCache.h
index 64290aa..31c7467 100644
--- a/src/mds/MDCache.h
+++ b/src/mds/MDCache.h
@@ -701,11 +701,11 @@ public:
 
   void open_remote_dirfrag(CInode *diri, frag_t fg, Context *fin);
   CInode *get_dentry_inode(CDentry *dn, MDRequest *mdr, bool projected=false);
-  void open_remote_ino(inodeno_t ino, Context *fin, inodeno_t hadino=0, version_t hadv=0);
+  void open_remote_ino(inodeno_t ino, Context *fin, bool want_xlocked=false,
+		       inodeno_t hadino=0, version_t hadv=0);
   void open_remote_ino_2(inodeno_t ino,
-                         vector<Anchor>& anchortrace,
-			 inodeno_t hadino, version_t hadv,
-                         Context *onfinish);
+			 vector<Anchor>& anchortrace, bool want_xlocked,
+			 inodeno_t hadino, version_t hadv, Context *onfinish);
   void open_remote_dentry(CDentry *dn, bool projected, Context *fin);
   void _open_remote_dentry_finish(int r, CDentry *dn, bool projected, Context *fin);
 
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index 5cee949..ec0d5d5 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -5237,6 +5237,13 @@ void Server::handle_client_rename(MDRequest *mdr)
       if (desttrace[i]->is_auth() && desttrace[i]->is_projected())
 	  xlocks.insert(&desttrace[i]->versionlock);
     }
+    // xlock srci and oldin's primary dentries, so witnesses can call
+    // open_remote_ino() with 'want_locked=true' when the srcdn or destdn
+    // is traversed.
+    if (srcdnl->is_remote())
+      xlocks.insert(&srci->get_projected_parent_dn()->lock);
+    if (destdnl->is_remote())
+      xlocks.insert(&oldin->get_projected_parent_dn()->lock);
   }
 
   // we need to update srci's ctime.  xlock its least contended lock to do that...
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 16/16] mds: use rdlock_try() when checking NULL dentry
  2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
                   ` (14 preceding siblings ...)
  2012-11-19  2:43 ` [PATCH 15/16] mds: allow open_remote_ino() to open xlocked dentry Yan, Zheng
@ 2012-11-19  2:43 ` Yan, Zheng
  15 siblings, 0 replies; 17+ messages in thread
From: Yan, Zheng @ 2012-11-19  2:43 UTC (permalink / raw)
  To: ceph-devel, sage; +Cc: Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Use rdlock_try() instead can_read() when path_traverse encounters
a NULL dentry. This can partly avoid infinitely waiting for the
dentry to become readable when the dentry is replica.

Strictly speaking, use rdlock_try() is still enough because auth
MDS may drop the REQRDLOCK message in some cases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDCache.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 9cddfeb..7152b5e 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -6707,11 +6707,11 @@ int MDCache::path_traverse(MDRequest *mdr, Message *req, Context *fin,     // wh
     
     // can we conclude ENOENT?
     if (dnl && dnl->is_null()) {
-      if (dn->lock.can_read(client)) {
-        dout(12) << "traverse: miss on null+readable dentry " << path[depth] << " " << *dn << dendl;
+      if (mds->locker->rdlock_try(&dn->lock, client, NULL)) {
+        dout(10) << "traverse: miss on null+readable dentry " << path[depth] << " " << *dn << dendl;
         return -ENOENT;
       } else {
-        dout(12) << "miss on dentry " << *dn << ", can't read due to lock" << dendl;
+        dout(10) << "miss on dentry " << *dn << ", can't read due to lock" << dendl;
         dn->lock.add_waiter(SimpleLock::WAIT_RD, _get_waiter(mdr, req, fin));
         return 1;
       }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-11-19  2:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-19  2:43 [PATCH 00/16] Various fixes for mds Yan, Zheng
2012-11-19  2:43 ` [PATCH 01/16] mds: don't expire log segment before it's fully flushed Yan, Zheng
2012-11-19  2:43 ` [PATCH 02/16] mds: fix anchor table update Yan, Zheng
2012-11-19  2:43 ` [PATCH 03/16] mds: don't add not issued caps when confirming cap receipt Yan, Zheng
2012-11-19  2:43 ` [PATCH 04/16] mds: allow try_eval to eval unstable locks in freezing object Yan, Zheng
2012-11-19  2:43 ` [PATCH 05/16] mds: Don't acquire replica object's versionlock Yan, Zheng
2012-11-19  2:43 ` [PATCH 06/16] mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED Yan, Zheng
2012-11-19  2:43 ` [PATCH 07/16] mds: call eval() after caps are exported Yan, Zheng
2012-11-19  2:43 ` [PATCH 08/16] mds: don't forward client request from MDS Yan, Zheng
2012-11-19  2:43 ` [PATCH 09/16] mds: check parent inode's versionlock when propagating rstats Yan, Zheng
2012-11-19  2:43 ` [PATCH 10/16] mds: drop locks if requiring auth pinning new objects Yan, Zheng
2012-11-19  2:43 ` [PATCH 11/16] mds: consider revoking caps in imported caps as issued Yan, Zheng
2012-11-19  2:43 ` [PATCH 12/16] mds: fix open_remote_inode race Yan, Zheng
2012-11-19  2:43 ` [PATCH 13/16] mds: fix assertion in handle_cache_expire Yan, Zheng
2012-11-19  2:43 ` [PATCH 14/16] mds: fix freeze inode deadlock Yan, Zheng
2012-11-19  2:43 ` [PATCH 15/16] mds: allow open_remote_ino() to open xlocked dentry Yan, Zheng
2012-11-19  2:43 ` [PATCH 16/16] mds: use rdlock_try() when checking NULL dentry Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.