[RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
@ 2012-02-01 15:54 Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Hi,

FWIW, I've been trying to understand op delays under very heavy write
load, and have been working a little with the policy throttler in hopes of
using throttling delays to help track down which ops were backing up.
Without much success, unfortunately.

When I saw the wip-osd-op-tracking branch, I wondered if any of this
stuff might be helpful.  Here it is, just in case.

-- Jim

Jim Schutt (6):
  msgr: print message sequence number and tid when receiving message
    envelope
  common/Throttle: track sleep/wake sequences in Throttle, report them
    for policy throttler
  common/Throttle: throttle in FIFO order
  common/Throttle: FIFO throttler doesn't need to signal waiters when
    max changes
  common/Throttle: make get() report number of waiters on entry/exit
  msg: log Message interactions with throttler

 src/common/Throttle.h      |   75 +++++++++++++++++++++++++++++++-------------
 src/msg/Message.h          |   71 +++++++++++++++++++++++++++++++++++------
 src/msg/SimpleMessenger.cc |   22 +++++++++---
 3 files changed, 129 insertions(+), 39 deletions(-)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

This simplifies post-processing logs to discover how long messages
are delayed waiting in the policy throttler.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/msg/SimpleMessenger.cc |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 246fa3e..952ed7c 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -1873,6 +1873,8 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
 
   ldout(msgr->cct,20) << "reader got envelope type=" << header.type
            << " src " << entity_name_t(header.src)
+	   << " seq=" << header.seq
+	   << " tid=" << header.tid
            << " front=" << header.front_len
 	   << " data=" << header.data_len
 	   << " off " << header.data_off
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

This simplifies post-processing logs to discover to what extent message
wait time in the policy throttler is due to thread wakeup order.

Use get()/wait() arguments rather than an accessor function to minimize
confusion from inconsistent reporting caused by racing on the Throttle mutex.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h      |   24 ++++++++++++++++++++----
 src/msg/SimpleMessenger.cc |    6 +++++-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index f13fde0..10560bf 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -6,11 +6,12 @@
 
 class Throttle {
   int64_t count, max, waiting;
+  uint64_t sseq, wseq;
   Mutex lock;
   Cond cond;
   
 public:
-  Throttle(int64_t m = 0) : count(0), max(m), waiting(0),
+  Throttle(int64_t m = 0) : count(0), max(m), waiting(0), sseq(0), wseq(0),
 			  lock("Throttle::lock") {
     assert(m >= 0);
   }
@@ -52,13 +53,21 @@ public:
 
   int64_t get_max() { return max; }
 
-  bool wait(int64_t m = 0) {
+  bool wait(int64_t m = 0,
+	    uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL) {
     Mutex::Locker l(lock);
+    sseq++;
+    if (sleep_seq)
+      *sleep_seq = sseq;
     if (m) {
       assert(m > 0);
       _reset_max(m);
     }
-    return _wait(0);
+    bool r = _wait(0);
+    wseq++;
+    if (wake_seq)
+      *wake_seq = wseq;
+    return r;
   }
 
   int64_t take(int64_t c = 1) {
@@ -68,15 +77,22 @@ public:
     return count;
   }
 
-  void get(int64_t c = 1, int64_t m = 0) {
+  void get(int64_t c = 1, int64_t m = 0,
+	   uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL) {
     assert(c >= 0);
     Mutex::Locker l(lock);
+    sseq++;
+    if (sleep_seq)
+      *sleep_seq = sseq;
     if (m) {
       assert(m > 0);
       _reset_max(m);
     }
     _wait(c);
     count += c;
+    wseq++;
+    if (wake_seq)
+      *wake_seq = wseq;
   }
 
   /* Returns true if it successfully got the requested amount,
diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 952ed7c..259d3b7 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -1898,7 +1898,11 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
       ldout(msgr->cct,10) << "reader wants " << message_size << " from policy throttler "
 	       << policy.throttler->get_current() << "/"
 	       << policy.throttler->get_max() << dendl;
-      policy.throttler->get(message_size);
+      uint64_t sseq, wseq;
+      policy.throttler->get(message_size, 0, &sseq, &wseq);
+      ldout(msgr->cct,10) << "reader got " << message_size << " from policy throttler "
+	     <<  policy.throttler->get_current() << "/" << policy.throttler->get_max()
+	     << " " << sseq << "/" << wseq << dendl;
     }
 
     // throttle total bytes waiting for dispatch.  do this _after_ the
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 3/6] common/Throttle: throttle in FIFO order
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-02 17:53   ` Gregory Farnum
  2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Under heavy write load from many clients, many reader threads will
be waiting in the policy throttler, all on a single condition variable.
When a wakeup is signalled, any of those threads may receive the
signal.  This increases the variance in the message processing
latency, and in extreme cases can significantly delay a message.

This patch causes threads to exit a throttler in the same order
they entered.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h |   42 ++++++++++++++++++++++++++++--------------
 1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index 10560bf..ca72060 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -3,23 +3,31 @@
 
 #include "Mutex.h"
 #include "Cond.h"
+#include <list>
 
 class Throttle {
-  int64_t count, max, waiting;
+  int64_t count, max;
   uint64_t sseq, wseq;
   Mutex lock;
-  Cond cond;
+  list<Cond*> cond;
   
 public:
-  Throttle(int64_t m = 0) : count(0), max(m), waiting(0), sseq(0), wseq(0),
+  Throttle(int64_t m = 0) : count(0), max(m), sseq(0), wseq(0),
 			  lock("Throttle::lock") {
     assert(m >= 0);
   }
+  ~Throttle() {
+    while (!cond.empty()) {
+      Cond *cv = cond.front();
+      delete cv;
+      cond.pop_front();
+    }
+  }
 
 private:
   void _reset_max(int64_t m) {
-    if (m < max)
-      cond.SignalOne();
+    if (m < max && !cond.empty())
+      cond.front()->SignalOne();
     max = m;
   }
   bool _should_wait(int64_t c) {
@@ -28,19 +36,24 @@ private:
       ((c < max && count + c > max) ||   // normally stay under max
        (c >= max && count > max));       // except for large c
   }
+
   bool _wait(int64_t c) {
     bool waited = false;
-    if (_should_wait(c)) {
-      waiting += c;
+    if (_should_wait(c) || !cond.empty()) { // always wait behind other waiters.
+      Cond *cv = new Cond;
+      cond.push_back(cv);
       do {
+        if (cv != cond.front())
+          cond.front()->SignalOne();  // wake up the oldest.
 	waited = true;
-	cond.Wait(lock);
-      } while (_should_wait(c));
-      waiting -= c;
+        cv->Wait(lock);
+      } while (_should_wait(c) || cv != cond.front());
+      delete cv;
+      cond.pop_front();
 
       // wake up the next guy
-      if (waiting)
-	cond.SignalOne();
+      if (!cond.empty())
+        cond.front()->SignalOne();
     }
     return waited;
   }
@@ -101,7 +114,7 @@ public:
   bool get_or_fail(int64_t c = 1) {
     assert (c >= 0);
     Mutex::Locker l(lock);
-    if (_should_wait(c)) return false;
+    if (_should_wait(c) || !cond.empty()) return false;
     count += c;
     return true;
   }
@@ -110,7 +123,8 @@ public:
     assert(c >= 0);
     Mutex::Locker l(lock);
     if (c) {
-      cond.SignalOne();
+      if (!cond.empty())
+        cond.front()->SignalOne();
       count -= c;
       assert(count >= 0); //if count goes negative, we failed somewhere!
     }
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
                   ` (2 preceding siblings ...)
  2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Only wait() and get() change the throttle max.

If there are no current waiters, the thread calling wait()/get()
doesn't need to signal itself.

If there are current waiters, the thread calling wait()/get() will
signal the oldest waiter in _wait() before sleeping, so the oldest
waiter always sees the new throttle limit immediately.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index ca72060..2f74f4a 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -25,11 +25,6 @@ public:
   }
 
 private:
-  void _reset_max(int64_t m) {
-    if (m < max && !cond.empty())
-      cond.front()->SignalOne();
-    max = m;
-  }
   bool _should_wait(int64_t c) {
     return
       max &&
@@ -74,7 +69,7 @@ public:
       *sleep_seq = sseq;
     if (m) {
       assert(m > 0);
-      _reset_max(m);
+      max = m;
     }
     bool r = _wait(0);
     wseq++;
@@ -99,7 +94,7 @@ public:
       *sleep_seq = sseq;
     if (m) {
       assert(m > 0);
-      _reset_max(m);
+      max = m;
     }
     _wait(c);
     count += c;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
                   ` (3 preceding siblings ...)
  2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
  2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
  6 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Use get() arguments rather than an accessor function to minimize confusion
from inconsistent reporting caused by racing on the Throttle mutex.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h      |    7 ++++++-
 src/msg/SimpleMessenger.cc |    5 +++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index 2f74f4a..dd74730 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -86,12 +86,15 @@ public:
   }
 
   void get(int64_t c = 1, int64_t m = 0,
-	   uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL) {
+	   uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL,
+	   int *sleep_waiters = NULL, int *wake_waiters = NULL) {
     assert(c >= 0);
     Mutex::Locker l(lock);
     sseq++;
     if (sleep_seq)
       *sleep_seq = sseq;
+    if (sleep_waiters)
+      *sleep_waiters = cond.size();
     if (m) {
       assert(m > 0);
       max = m;
@@ -101,6 +104,8 @@ public:
     wseq++;
     if (wake_seq)
       *wake_seq = wseq;
+    if (wake_waiters)
+      *wake_waiters = cond.size();
   }
 
   /* Returns true if it successfully got the requested amount,
diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 259d3b7..3167749 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -1899,10 +1899,11 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
 	       << policy.throttler->get_current() << "/"
 	       << policy.throttler->get_max() << dendl;
       uint64_t sseq, wseq;
-      policy.throttler->get(message_size, 0, &sseq, &wseq);
+      int swait, wwait;
+      policy.throttler->get(message_size, 0, &sseq, &wseq, &swait, &wwait);
       ldout(msgr->cct,10) << "reader got " << message_size << " from policy throttler "
 	     <<  policy.throttler->get_current() << "/" << policy.throttler->get_max()
-	     << " " << sseq << "/" << wseq << dendl;
+	     << " seq " << sseq << "/" << wseq  << " waiters " << swait << "/" << wwait << dendl;
     }
 
     // throttle total bytes waiting for dispatch.  do this _after_ the
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH 6/6] msg: log Message interactions with throttler
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
                   ` (4 preceding siblings ...)
  2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
@ 2012-02-01 15:54 ` Jim Schutt
  2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
  6 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-01 15:54 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Also, fix logging race about policy throttler count in SimpleMessenger.cc,
so that throttler logging is always consistent.

Finally, log message source on policy throttler messages to make extracting
them from logs for a single message simpler.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/common/Throttle.h      |    7 ++--
 src/msg/Message.h          |   71 +++++++++++++++++++++++++++++++++++++-------
 src/msg/SimpleMessenger.cc |   19 +++++++-----
 3 files changed, 75 insertions(+), 22 deletions(-)

diff --git a/src/common/Throttle.h b/src/common/Throttle.h
index dd74730..56ecb08 100644
--- a/src/common/Throttle.h
+++ b/src/common/Throttle.h
@@ -85,9 +85,9 @@ public:
     return count;
   }
 
-  void get(int64_t c = 1, int64_t m = 0,
-	   uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL,
-	   int *sleep_waiters = NULL, int *wake_waiters = NULL) {
+  int64_t get(int64_t c = 1, int64_t m = 0,
+	      uint64_t *sleep_seq = NULL, uint64_t *wake_seq = NULL,
+	      int *sleep_waiters = NULL, int *wake_waiters = NULL) {
     assert(c >= 0);
     Mutex::Locker l(lock);
     sseq++;
@@ -106,6 +106,7 @@ public:
       *wake_seq = wseq;
     if (wake_waiters)
       *wake_waiters = cond.size();
+    return count;
   }
 
   /* Returns true if it successfully got the requested amount,
diff --git a/src/msg/Message.h b/src/msg/Message.h
index f37a884..fdcb930 100644
--- a/src/msg/Message.h
+++ b/src/msg/Message.h
@@ -313,8 +313,12 @@ protected:
     assert(nref.read() == 0);
     if (connection)
       connection->put();
-    if (throttler)
-      throttler->put(payload.length() + middle.length() + data.length());
+    if (throttler) {
+      unsigned dlen = payload.length() + middle.length() + data.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "~Message() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
   }
 public:
   Connection *get_connection() { return connection; }
@@ -342,39 +346,84 @@ public:
    */
 
   void clear_payload() {
-    if (throttler) throttler->put(payload.length() + middle.length());
+    if (throttler) {
+      unsigned dlen = payload.length() + middle.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "clear_payload() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     payload.clear();
     middle.clear();
   }
   void clear_data() {
-    if (throttler) throttler->put(data.length());
+    if (throttler) {
+      unsigned dlen = data.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "clear_data() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     data.clear();
   }
 
   bool empty_payload() { return payload.length() == 0; }
   bufferlist& get_payload() { return payload; }
   void set_payload(bufferlist& bl) {
-    if (throttler) throttler->put(payload.length());
+    if (throttler) {
+      unsigned dlen = payload.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "set_payload() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     payload.claim(bl);
-    if (throttler) throttler->take(payload.length());
+    if (throttler) {
+      unsigned dlen = payload.length();
+      int64_t tcnt = throttler->take(dlen);
+      generic_dout(1) << "set_payload() on " << this << " took " << dlen
+		      << " from throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
   }
 
   void set_middle(bufferlist& bl) {
-    if (throttler) throttler->put(payload.length());
+    if (throttler) {
+      unsigned dlen = payload.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "set_middle() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     middle.claim(bl);
-    if (throttler) throttler->take(payload.length());
+    if (throttler) {
+      unsigned dlen = payload.length();
+      int64_t tcnt = throttler->take(dlen);
+      generic_dout(1) << "set_middle() on " << this << " took " << dlen
+		      << " from throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
   }
   bufferlist& get_middle() { return middle; }
 
   void set_data(const bufferlist &d) {
-    if (throttler) throttler->put(data.length());
+    if (throttler) {
+      unsigned dlen = data.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "set_data() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     data = d;
-    if (throttler) throttler->take(data.length());
+    if (throttler) {
+      unsigned dlen = data.length();
+      int64_t tcnt = throttler->take(dlen);
+      generic_dout(1) << "set_data() on " << this << " took " << dlen
+		      << " from throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
   }
 
   bufferlist& get_data() { return data; }
   void claim_data(bufferlist& bl) {
-    if (throttler) throttler->put(data.length());
+    if (throttler) {
+      unsigned dlen = data.length();
+      int64_t tcnt = throttler->put(dlen);
+      generic_dout(1) << "claim_data() on " << this << " returned " << dlen
+		      << " to throttler " << tcnt << "/" << throttler->get_max() << dendl;
+    }
     bl.claim(data);
   }
   off_t get_data_len() { return data.length(); }
diff --git a/src/msg/SimpleMessenger.cc b/src/msg/SimpleMessenger.cc
index 3167749..d7bbe7b 100644
--- a/src/msg/SimpleMessenger.cc
+++ b/src/msg/SimpleMessenger.cc
@@ -1897,13 +1897,15 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
     if (policy.throttler) {
       ldout(msgr->cct,10) << "reader wants " << message_size << " from policy throttler "
 	       << policy.throttler->get_current() << "/"
-	       << policy.throttler->get_max() << dendl;
+	       << policy.throttler->get_max() 
+	       << " for src " << entity_name_t(header.src) << " tid=" << header.tid << dendl;
       uint64_t sseq, wseq;
       int swait, wwait;
-      policy.throttler->get(message_size, 0, &sseq, &wseq, &swait, &wwait);
+      int64_t tcnt = policy.throttler->get(message_size, 0, &sseq, &wseq, &swait, &wwait);
       ldout(msgr->cct,10) << "reader got " << message_size << " from policy throttler "
-	     <<  policy.throttler->get_current() << "/" << policy.throttler->get_max()
-	     << " seq " << sseq << "/" << wseq  << " waiters " << swait << "/" << wwait << dendl;
+	     << tcnt << "/" << policy.throttler->get_max()
+	     << " seq " << sseq << "/" << wseq  << " waiters " << swait << "/" << wwait
+	     << " for src " << entity_name_t(header.src) << " tid=" << header.tid << dendl;
     }
 
     // throttle total bytes waiting for dispatch.  do this _after_ the
@@ -2028,10 +2030,11 @@ int SimpleMessenger::Pipe::read_message(Message **pm)
   // release bytes reserved from the throttlers on failure
   if (message_size) {
     if (policy.throttler) {
-      ldout(msgr->cct,10) << "reader releasing " << message_size << " to policy throttler "
-	       << policy.throttler->get_current() << "/"
-	       << policy.throttler->get_max() << dendl;
-      policy.throttler->put(message_size);
+      int64_t tcnt = policy.throttler->put(message_size);
+      ldout(msgr->cct,10) << "reader returned " << message_size << " to policy throttler "
+	       << tcnt << "/" << policy.throttler->get_max()
+     	       << " for src " << entity_name_t(header.src) << " tid=" << header.tid << dendl;
+
     }
 
     msgr->dispatch_throttle_release(message_size);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
                   ` (5 preceding siblings ...)
  2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
@ 2012-02-01 22:33 ` Gregory Farnum
  2012-02-02 15:38   ` Jim Schutt
       [not found]   ` <4F29CDAA.408@sandia.gov>
  6 siblings, 2 replies; 47+ messages in thread
From: Gregory Farnum @ 2012-02-01 22:33 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> Hi,
>
> FWIW, I've been trying to understand op delays under very heavy write
> load, and have been working a little with the policy throttler in hopes of
> using throttling delays to help track down which ops were backing up.
> Without much success, unfortunately.
>
> When I saw the wip-osd-op-tracking branch, I wondered if any of this
> stuff might be helpful.  Here it is, just in case.

In general these patches are dumping information to the logs, and part
of the wip-osd-op-tracking branch is actually keeping track of most of
the message queueing wait times as part of the message itself
(although not the information about number of waiters and sleep/wake
seqs). I'm inclined to prefer that approach to log dumping.
Are there any patches you recommend for merging? I'm a little curious
about the ordered wakeup one — do you have data about when that's a
problem?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
@ 2012-02-02 15:38   ` Jim Schutt
       [not found]   ` <4F29CDAA.408@sandia.gov>
  1 sibling, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 15:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

(resent because I forgot the list on my original reply)

On 02/01/2012 03:33 PM, Gregory Farnum wrote:
> On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> Hi,
>>
>> FWIW, I've been trying to understand op delays under very heavy write
>> load, and have been working a little with the policy throttler in hopes of
>> using throttling delays to help track down which ops were backing up.
>> Without much success, unfortunately.
>>
>> When I saw the wip-osd-op-tracking branch, I wondered if any of this
>> stuff might be helpful.  Here it is, just in case.
>
> In general these patches are dumping information to the logs, and part
> of the wip-osd-op-tracking branch is actually keeping track of most of
> the message queueing wait times as part of the message itself
> (although not the information about number of waiters and sleep/wake
> seqs). I'm inclined to prefer that approach to log dumping.

I agree - I've just been using log dumping because I can extract
any relationships I can write a perl script to find :)  So far,
not too helpful.

> Are there any patches you recommend for merging? I'm a little curious
> about the ordered wakeup one — do you have data about when that's a
> problem?

I've been trying to push the client:osd ratio, and in my testbed
I can run up to 166 linux clients. Right now I'm running them
against 48 OSDs.  The clients are 1 Gb/s ethernet, and the OSDs
have a 10 Gb/s ethernet for clients and another for the cluster.

During sustained write loads I see a factor of 10 oscillation
in aggregate throughput, and during that time I see clients
stuck in the policy throttler for hundreds of seconds, and I
see a number of waiters equal to
   number of clients - (throttler limit) / (msg size)
If I do a histogram of throttler wait times I see a handful of
messages that wait for an extra couple hundreds of seconds
without the ordered wakeup.

I'm not sure what this will look like if my throughput
variations can be fixed.  But, for our HPC loads I expect
we'll often see periods where offered load is much higher
that aggregate bandwidth of any system we can afford to
build, so ordered wakeup may be useful in such cases for
client fairness.

So I'd recommend the ordered wakeup patch if you don't
see any downsides.

Sorry for the noise on the others - mostly I just wanted
to share the sort of things I've been looking at.  I'll
be learning to use your new stuff soon...

-- Jim

> -Greg
>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
       [not found]       ` <4F2AABF5.6050803@sandia.gov>
@ 2012-02-02 17:52         ` Gregory Farnum
  2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
  2012-02-24 15:38           ` Jim Schutt
  0 siblings, 2 replies; 47+ messages in thread
From: Gregory Farnum @ 2012-02-02 17:52 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
> per OSD.  During a test I watch both OSD servers with both
> vmstat and iostat.
>
> During a "good" period, vmstat says the server is sustaining > 2 GB/s
> for multiple tens of seconds.  Since I use replication factor 2, that
> means that server is sustaining > 500 MB/s aggregate client throughput,
> right?  During such a period vmstat also reports ~10% CPU idle.
>
> During a "bad" period, vmstat says the server is doing ~200 MB/s,
> with lots of idle cycles.  It is during these periods that
> messages stuck in the policy throttler build up such long
> wait times.  Sometimes I see really bad periods with aggregate
> throughput per server < 100 MB/s.
>
> The typical pattern I see is that a run starts with tens of seconds
> of aggregate throughput > 2 GB/s.  Then it drops and bounces around
> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
> it ramps back up near 2 GB/s again.

Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space? I assume from your setup that they're sharing the disk with the
store (although it works either way), and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)
I'm also curious about memory usage as a possible explanation for the
more dramatic drops.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/6] common/Throttle: throttle in FIFO order
  2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
@ 2012-02-02 17:53   ` Gregory Farnum
  2012-02-02 18:31     ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: Gregory Farnum @ 2012-02-02 17:53 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

I went to merge this but then had a question on part of it (below).

On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> Under heavy write load from many clients, many reader threads will
> be waiting in the policy throttler, all on a single condition variable.
> When a wakeup is signalled, any of those threads may receive the
> signal.  This increases the variance in the message processing
> latency, and in extreme cases can significantly delay a message.
>
> This patch causes threads to exit a throttler in the same order
> they entered.
>
> Signed-off-by: Jim Schutt <jaschut@sandia.gov>
> ---
>  src/common/Throttle.h |   42 ++++++++++++++++++++++++++++--------------
>  1 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
> index 10560bf..ca72060 100644
> --- a/src/common/Throttle.h
> +++ b/src/common/Throttle.h
> @@ -3,23 +3,31 @@
>
>  #include "Mutex.h"
>  #include "Cond.h"
> +#include <list>
>
>  class Throttle {
> -  int64_t count, max, waiting;
> +  int64_t count, max;
>   uint64_t sseq, wseq;
>   Mutex lock;
> -  Cond cond;
> +  list<Cond*> cond;
>
>  public:
> -  Throttle(int64_t m = 0) : count(0), max(m), waiting(0), sseq(0), wseq(0),
> +  Throttle(int64_t m = 0) : count(0), max(m), sseq(0), wseq(0),
>                          lock("Throttle::lock") {
>     assert(m >= 0);
>   }
> +  ~Throttle() {
> +    while (!cond.empty()) {
> +      Cond *cv = cond.front();
> +      delete cv;
> +      cond.pop_front();
> +    }
> +  }
>
>  private:
>   void _reset_max(int64_t m) {
> -    if (m < max)
> -      cond.SignalOne();
> +    if (m < max && !cond.empty())
> +      cond.front()->SignalOne();
>     max = m;
>   }
>   bool _should_wait(int64_t c) {
> @@ -28,19 +36,24 @@ private:
>       ((c < max && count + c > max) ||   // normally stay under max
>        (c >= max && count > max));       // except for large c
>   }
> +
>   bool _wait(int64_t c) {
>     bool waited = false;
> -    if (_should_wait(c)) {
> -      waiting += c;
> +    if (_should_wait(c) || !cond.empty()) { // always wait behind other waiters.
> +      Cond *cv = new Cond;
> +      cond.push_back(cv);
>       do {
> +        if (cv != cond.front())
> +          cond.front()->SignalOne();  // wake up the oldest.

What's this extra wakeup for? Unless I'm missing something it's always
going to be gratuitous. :/

>        waited = true;
> -       cond.Wait(lock);
> -      } while (_should_wait(c));
> -      waiting -= c;
> +        cv->Wait(lock);
> +      } while (_should_wait(c) || cv != cond.front());
> +      delete cv;
> +      cond.pop_front();
>
>       // wake up the next guy
> -      if (waiting)
> -       cond.SignalOne();
> +      if (!cond.empty())
> +        cond.front()->SignalOne();
>     }
>     return waited;
>   }
> @@ -101,7 +114,7 @@ public:
>   bool get_or_fail(int64_t c = 1) {
>     assert (c >= 0);
>     Mutex::Locker l(lock);
> -    if (_should_wait(c)) return false;
> +    if (_should_wait(c) || !cond.empty()) return false;
>     count += c;
>     return true;
>   }
> @@ -110,7 +123,8 @@ public:
>     assert(c >= 0);
>     Mutex::Locker l(lock);
>     if (c) {
> -      cond.SignalOne();
> +      if (!cond.empty())
> +        cond.front()->SignalOne();
>       count -= c;
>       assert(count >= 0); //if count goes negative, we failed somewhere!
>     }
> --
> 1.7.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/6] common/Throttle: throttle in FIFO order
  2012-02-02 17:53   ` Gregory Farnum
@ 2012-02-02 18:31     ` Jim Schutt
  2012-02-02 19:01       ` Gregory Farnum
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 18:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/02/2012 10:53 AM, Gregory Farnum wrote:
> I went to merge this but then had a question on part of it (below).
>
> On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> Under heavy write load from many clients, many reader threads will
>> be waiting in the policy throttler, all on a single condition variable.
>> When a wakeup is signalled, any of those threads may receive the
>> signal.  This increases the variance in the message processing
>> latency, and in extreme cases can significantly delay a message.
>>
>> This patch causes threads to exit a throttler in the same order
>> they entered.
>>
>> Signed-off-by: Jim Schutt<jaschut@sandia.gov>
>> ---
>>   src/common/Throttle.h |   42 ++++++++++++++++++++++++++++--------------
>>   1 files changed, 28 insertions(+), 14 deletions(-)
>>
>> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
>> index 10560bf..ca72060 100644
>> --- a/src/common/Throttle.h
>> +++ b/src/common/Throttle.h
>> @@ -3,23 +3,31 @@
>>
>>   #include "Mutex.h"
>>   #include "Cond.h"
>> +#include<list>
>>
>>   class Throttle {
>> -  int64_t count, max, waiting;
>> +  int64_t count, max;
>>    uint64_t sseq, wseq;
>>    Mutex lock;
>> -  Cond cond;
>> +  list<Cond*>  cond;
>>
>>   public:
>> -  Throttle(int64_t m = 0) : count(0), max(m), waiting(0), sseq(0), wseq(0),
>> +  Throttle(int64_t m = 0) : count(0), max(m), sseq(0), wseq(0),
>>                           lock("Throttle::lock") {
>>      assert(m>= 0);
>>    }
>> +  ~Throttle() {
>> +    while (!cond.empty()) {
>> +      Cond *cv = cond.front();
>> +      delete cv;
>> +      cond.pop_front();
>> +    }
>> +  }
>>
>>   private:
>>    void _reset_max(int64_t m) {
>> -    if (m<  max)
>> -      cond.SignalOne();
>> +    if (m<  max&&  !cond.empty())
>> +      cond.front()->SignalOne();
>>      max = m;
>>    }
>>    bool _should_wait(int64_t c) {
>> @@ -28,19 +36,24 @@ private:
>>        ((c<  max&&  count + c>  max) ||   // normally stay under max
>>         (c>= max&&  count>  max));       // except for large c
>>    }
>> +
>>    bool _wait(int64_t c) {
>>      bool waited = false;
>> -    if (_should_wait(c)) {
>> -      waiting += c;
>> +    if (_should_wait(c) || !cond.empty()) { // always wait behind other waiters.
>> +      Cond *cv = new Cond;
>> +      cond.push_back(cv);
>>        do {
>> +        if (cv != cond.front())
>> +          cond.front()->SignalOne();  // wake up the oldest.
>
> What's this extra wakeup for? Unless I'm missing something it's always
> going to be gratuitous. :/

I think it was a poorly thought-out attempt at
defensive programming.  Now that I'm thinking about
it harder, I agree it is gratuitous.

Thanks -- Jim

>
>>         waited = true;
>> -       cond.Wait(lock);
>> -      } while (_should_wait(c));
>> -      waiting -= c;
>> +        cv->Wait(lock);
>> +      } while (_should_wait(c) || cv != cond.front());
>> +      delete cv;
>> +      cond.pop_front();
>>
>>        // wake up the next guy
>> -      if (waiting)
>> -       cond.SignalOne();
>> +      if (!cond.empty())
>> +        cond.front()->SignalOne();
>>      }
>>      return waited;
>>    }
>> @@ -101,7 +114,7 @@ public:
>>    bool get_or_fail(int64_t c = 1) {
>>      assert (c>= 0);
>>      Mutex::Locker l(lock);
>> -    if (_should_wait(c)) return false;
>> +    if (_should_wait(c) || !cond.empty()) return false;
>>      count += c;
>>      return true;
>>    }
>> @@ -110,7 +123,8 @@ public:
>>      assert(c>= 0);
>>      Mutex::Locker l(lock);
>>      if (c) {
>> -      cond.SignalOne();
>> +      if (!cond.empty())
>> +        cond.front()->SignalOne();
>>        count -= c;
>>        assert(count>= 0); //if count goes negative, we failed somewhere!
>>      }
>> --
>> 1.7.1
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 3/6] common/Throttle: throttle in FIFO order
  2012-02-02 18:31     ` Jim Schutt
@ 2012-02-02 19:01       ` Gregory Farnum
  0 siblings, 0 replies; 47+ messages in thread
From: Gregory Farnum @ 2012-02-02 19:01 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Feb 2, 2012 at 10:31 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/02/2012 10:53 AM, Gregory Farnum wrote:
>>
>> I went to merge this but then had a question on part of it (below).
>>
>> On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>>
>>> Under heavy write load from many clients, many reader threads will
>>> be waiting in the policy throttler, all on a single condition variable.
>>> When a wakeup is signalled, any of those threads may receive the
>>> signal.  This increases the variance in the message processing
>>> latency, and in extreme cases can significantly delay a message.
>>>
>>> This patch causes threads to exit a throttler in the same order
>>> they entered.
>>>
>>> Signed-off-by: Jim Schutt<jaschut@sandia.gov>
>>> ---
>>>  src/common/Throttle.h |   42 ++++++++++++++++++++++++++++--------------
>>>  1 files changed, 28 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/src/common/Throttle.h b/src/common/Throttle.h
>>> index 10560bf..ca72060 100644
>>> --- a/src/common/Throttle.h
>>> +++ b/src/common/Throttle.h
>>> @@ -3,23 +3,31 @@
>>>
>>>  #include "Mutex.h"
>>>  #include "Cond.h"
>>> +#include<list>
>>>
>>>  class Throttle {
>>> -  int64_t count, max, waiting;
>>> +  int64_t count, max;
>>>   uint64_t sseq, wseq;
>>>   Mutex lock;
>>> -  Cond cond;
>>> +  list<Cond*>  cond;
>>>
>>>  public:
>>> -  Throttle(int64_t m = 0) : count(0), max(m), waiting(0), sseq(0),
>>> wseq(0),
>>> +  Throttle(int64_t m = 0) : count(0), max(m), sseq(0), wseq(0),
>>>                          lock("Throttle::lock") {
>>>     assert(m>= 0);
>>>   }
>>> +  ~Throttle() {
>>> +    while (!cond.empty()) {
>>> +      Cond *cv = cond.front();
>>> +      delete cv;
>>> +      cond.pop_front();
>>> +    }
>>> +  }
>>>
>>>  private:
>>>   void _reset_max(int64_t m) {
>>> -    if (m<  max)
>>> -      cond.SignalOne();
>>> +    if (m<  max&&  !cond.empty())
>>>
>>> +      cond.front()->SignalOne();
>>>     max = m;
>>>   }
>>>   bool _should_wait(int64_t c) {
>>> @@ -28,19 +36,24 @@ private:
>>>       ((c<  max&&  count + c>  max) ||   // normally stay under max
>>>        (c>= max&&  count>  max));       // except for large c
>>>
>>>   }
>>> +
>>>   bool _wait(int64_t c) {
>>>     bool waited = false;
>>> -    if (_should_wait(c)) {
>>> -      waiting += c;
>>> +    if (_should_wait(c) || !cond.empty()) { // always wait behind other
>>> waiters.
>>> +      Cond *cv = new Cond;
>>> +      cond.push_back(cv);
>>>       do {
>>> +        if (cv != cond.front())
>>> +          cond.front()->SignalOne();  // wake up the oldest.
>>
>>
>> What's this extra wakeup for? Unless I'm missing something it's always
>> going to be gratuitous. :/
>
>
> I think it was a poorly thought-out attempt at
> defensive programming.  Now that I'm thinking about
> it harder, I agree it is gratuitous.
>
> Thanks -- Jim

Awesome. Applied to master in commit:83432af2adce75676b734d2b99dd88372ede833a.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 17:52         ` Gregory Farnum
@ 2012-02-02 19:06           ` Jim Schutt
  2012-02-02 19:15             ` Sage Weil
  2012-02-02 19:32             ` Gregory Farnum
  2012-02-24 15:38           ` Jim Schutt
  1 sibling, 2 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 19:06 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/02/2012 10:52 AM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
>> per OSD.  During a test I watch both OSD servers with both
>> vmstat and iostat.
>>
>> During a "good" period, vmstat says the server is sustaining>  2 GB/s
>> for multiple tens of seconds.  Since I use replication factor 2, that
>> means that server is sustaining>  500 MB/s aggregate client throughput,
>> right?  During such a period vmstat also reports ~10% CPU idle.
>>
>> During a "bad" period, vmstat says the server is doing ~200 MB/s,
>> with lots of idle cycles.  It is during these periods that
>> messages stuck in the policy throttler build up such long
>> wait times.  Sometimes I see really bad periods with aggregate
>> throughput per server<  100 MB/s.
>>
>> The typical pattern I see is that a run starts with tens of seconds
>> of aggregate throughput>  2 GB/s.  Then it drops and bounces around
>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>> it ramps back up near 2 GB/s again.
>
> Hmm. 100MB/s is awfully low for this theory, but have you tried to
> correlate the drops in throughput with the OSD journals running out of
> space?

A spot check of logs from my last run doesn't seem to have any
"journal throttle: waited for" messages during a slowdown.
Is that what you mean?

During the fast part of I run I see lots of journal messages
with this pattern:

2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes
2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes

which I think means my journal is doing 50 MB/s, right?

> I assume from your setup that they're sharing the disk with the
> store (although it works either way),

I've got a 4 GB journal partition on the outer tracks of the disk.

> and your description makes me
> think that throughput is initially constrained by sequential journal
> writes but then the journal runs out of space and the OSD has to wait
> for the main store to catch up (with random IO), and that sends the IO
> patterns all to hell. (If you can say that random 4MB IOs are
> hellish.)

iostat 1 during the fast part of a run shows both journal and data
partitions running at 45-50 MB/s.  During the slow part of a run
they both show similar but low data rates.

> I'm also curious about memory usage as a possible explanation for the
> more dramatic drops.

My OSD servers have 48 GB memory.  During a run I rarely see less than
24 GB used by the page cache, with the rest mostly used by anonymous memory.
I don't run with any swap.

So far I'm looking at two behaviours I've noticed that seem anomalous to me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?

Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less.  I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with that.

Any other ideas on what to look for would be greatly appreciated.

-- Jim

> -Greg
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
@ 2012-02-02 19:15             ` Sage Weil
  2012-02-02 19:33               ` Jim Schutt
  2012-02-02 19:32             ` Gregory Farnum
  1 sibling, 1 reply; 47+ messages in thread
From: Sage Weil @ 2012-02-02 19:15 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel

On Thu, 2 Feb 2012, Jim Schutt wrote:
> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
> > On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
> > > I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
> > > per OSD.  During a test I watch both OSD servers with both
> > > vmstat and iostat.
> > > 
> > > During a "good" period, vmstat says the server is sustaining>  2 GB/s
> > > for multiple tens of seconds.  Since I use replication factor 2, that
> > > means that server is sustaining>  500 MB/s aggregate client throughput,
> > > right?  During such a period vmstat also reports ~10% CPU idle.
> > > 
> > > During a "bad" period, vmstat says the server is doing ~200 MB/s,
> > > with lots of idle cycles.  It is during these periods that
> > > messages stuck in the policy throttler build up such long
> > > wait times.  Sometimes I see really bad periods with aggregate
> > > throughput per server<  100 MB/s.
> > > 
> > > The typical pattern I see is that a run starts with tens of seconds
> > > of aggregate throughput>  2 GB/s.  Then it drops and bounces around
> > > 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
> > > it ramps back up near 2 GB/s again.
> > 
> > Hmm. 100MB/s is awfully low for this theory, but have you tried to
> > correlate the drops in throughput with the OSD journals running out of
> > space?
> 
> A spot check of logs from my last run doesn't seem to have any
> "journal throttle: waited for" messages during a slowdown.
> Is that what you mean?
> 
> During the fast part of I run I see lots of journal messages
> with this pattern:
> 
> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
> and 50346596 bytes, now 22 ops and 90041106 bytes
> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
> and 50346596 bytes, now 21 ops and 85845571 bytes

It occurs to me that part of the problem may be the current sync 
io behavior in the journal.  It ends up doing really big writes, which 
makes things bursty, and will get stuff blocked up behind the throttler.  
You might try making 'journal max write bytes' smaller?  Hmm, although 
it's currently 10MB, which isn't too bad.  So unless you've changed it 
from the default, that's probably not it.

> which I think means my journal is doing 50 MB/s, right?
> 
> > I assume from your setup that they're sharing the disk with the
> > store (although it works either way),
> 
> I've got a 4 GB journal partition on the outer tracks of the disk.

This is on the same disk as the osd data?  As an experiement, you could 
try putting the journal on a separate disk (half disks for journals, half 
for data).  That's obviously not what you want in the real world, but it 
would be interesting to see if contention for the spindle is responsible 
for this.

> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
> 
> One is that I instrumented ms_dispatch(), and I see it take
> a half-second or more several hundred times, out of several
> thousand messages.  Is that expected?

I don't think so, but it could happen if the cpu/memory contention, or in 
the case of osdmap updates where we block on io in that thread.

> Another is that once a message receive starts, I see ~50 messages
> that take tens of seconds to receive, when the nominal receive time is
> a half-second or less.  I'm in the process of tooling up to collect
> tcpdump data on all my clients to try to catch what is going on with that.
> 
> Any other ideas on what to look for would be greatly appreciated.

I'd rule out the journal+data on same disk as the source of pain first.  
If that's what's going on, we can take a closer look at speficically how 
to make it behave better!

sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
  2012-02-02 19:15             ` Sage Weil
@ 2012-02-02 19:32             ` Gregory Farnum
  2012-02-02 20:22               ` Jim Schutt
  1 sibling, 1 reply; 47+ messages in thread
From: Gregory Farnum @ 2012-02-02 19:32 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>
>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>> The typical pattern I see is that a run starts with tens of seconds
>>> of aggregate throughput>  2 GB/s.  Then it drops and bounces around
>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>>> it ramps back up near 2 GB/s again.
>>
>>
>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>> correlate the drops in throughput with the OSD journals running out of
>> space?
>
>
> A spot check of logs from my last run doesn't seem to have any
> "journal throttle: waited for" messages during a slowdown.
> Is that what you mean?

I'd expect to see those, yes, but I actually meant the on-disk journal
itself getting full. I believe that should result in output like:
    write_thread_entry full, going to sleep (waiting for commit)
...although I now notice that's a much higher log level (20) than the
other messages (1/5).

> During the fast part of I run I see lots of journal messages
> with this pattern:
>
> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
> and 50346596 bytes, now 22 ops and 90041106 bytes
> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
> and 50346596 bytes, now 21 ops and 85845571 bytes
>
> which I think means my journal is doing 50 MB/s, right?

Generally, yes — although that'll also pop up if the store manages to
commit faster than the journal (unlikely). :)

>> and your description makes me
>> think that throughput is initially constrained by sequential journal
>> writes but then the journal runs out of space and the OSD has to wait
>> for the main store to catch up (with random IO), and that sends the IO
>> patterns all to hell. (If you can say that random 4MB IOs are
>> hellish.)
>
>
> iostat 1 during the fast part of a run shows both journal and data
> partitions running at 45-50 MB/s.  During the slow part of a run
> they both show similar but low data rates.

All right. That's actually not that surprising; random 4MB writes are
pretty nice to a modern drive.

>> I'm also curious about memory usage as a possible explanation for the
>> more dramatic drops.
>
> My OSD servers have 48 GB memory.  During a run I rarely see less than
> 24 GB used by the page cache, with the rest mostly used by anonymous memory.
> I don't run with any swap.
>
> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>
> One is that I instrumented ms_dispatch(), and I see it take
> a half-second or more several hundred times, out of several
> thousand messages.  Is that expected?

How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)

> Another is that once a message receive starts, I see ~50 messages
> that take tens of seconds to receive, when the nominal receive time is
> a half-second or less.  I'm in the process of tooling up to collect
> tcpdump data on all my clients to try to catch what is going on with that.

Again, how are you instrumenting that?

-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 19:15             ` Sage Weil
@ 2012-02-02 19:33               ` Jim Schutt
  0 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 19:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On 02/02/2012 12:15 PM, Sage Weil wrote:
> On Thu, 2 Feb 2012, Jim Schutt wrote:
>> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>   wrote:
>>>> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
>>>> per OSD.  During a test I watch both OSD servers with both
>>>> vmstat and iostat.
>>>>
>>>> During a "good" period, vmstat says the server is sustaining>   2 GB/s
>>>> for multiple tens of seconds.  Since I use replication factor 2, that
>>>> means that server is sustaining>   500 MB/s aggregate client throughput,
>>>> right?  During such a period vmstat also reports ~10% CPU idle.
>>>>
>>>> During a "bad" period, vmstat says the server is doing ~200 MB/s,
>>>> with lots of idle cycles.  It is during these periods that
>>>> messages stuck in the policy throttler build up such long
>>>> wait times.  Sometimes I see really bad periods with aggregate
>>>> throughput per server<   100 MB/s.
>>>>
>>>> The typical pattern I see is that a run starts with tens of seconds
>>>> of aggregate throughput>   2 GB/s.  Then it drops and bounces around
>>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>>>> it ramps back up near 2 GB/s again.
>>>
>>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>>> correlate the drops in throughput with the OSD journals running out of
>>> space?
>>
>> A spot check of logs from my last run doesn't seem to have any
>> "journal throttle: waited for" messages during a slowdown.
>> Is that what you mean?
>>
>> During the fast part of I run I see lots of journal messages
>> with this pattern:
>>
>> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 22 ops and 90041106 bytes
>> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 21 ops and 85845571 bytes
>
> It occurs to me that part of the problem may be the current sync
> io behavior in the journal.  It ends up doing really big writes, which
> makes things bursty, and will get stuff blocked up behind the throttler.
> You might try making 'journal max write bytes' smaller?  Hmm, although
> it's currently 10MB, which isn't too bad.  So unless you've changed it
> from the default, that's probably not it.

I have changed it recently, but I was seeing this type of
behaviour before making that change.

FWIW, here's my current non-standard tunings.  I'm using
these because the standard ones work even worse for me
on this test case:

	osd op threads = 48
	filestore queue max ops = 16
	osd client message size cap = 50000000
	client oc max dirty =         50000000
	journal max write bytes =     50000000
	ms dispatch throttle bytes =  66666666
	client oc size =              100000000
	journal queue max bytes =     125000000
	filestore queue max bytes =   125000000
	objector inflight op bytes =  200000000


FWIW, turning down "filestore queue max ops" and turning up "osd op threads"
made the biggest positive impact on my performance levels on this test.

With default values for those I was seeing much worse stalling behaviour.

>
>> which I think means my journal is doing 50 MB/s, right?
>>
>>> I assume from your setup that they're sharing the disk with the
>>> store (although it works either way),
>>
>> I've got a 4 GB journal partition on the outer tracks of the disk.
>
> This is on the same disk as the osd data?  As an experiement, you could
> try putting the journal on a separate disk (half disks for journals, half
> for data).  That's obviously not what you want in the real world, but it
> would be interesting to see if contention for the spindle is responsible
> for this.
>
>> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>>
>> One is that I instrumented ms_dispatch(), and I see it take
>> a half-second or more several hundred times, out of several
>> thousand messages.  Is that expected?
>
> I don't think so, but it could happen if the cpu/memory contention, or in
> the case of osdmap updates where we block on io in that thread.

Hmmm.  Would those go into the filestore op queue?

I didn't think to check ms_dispatch() ET until after I had tuned
down "filestore queue max ops".  Is it worth checking this?

>
>> Another is that once a message receive starts, I see ~50 messages
>> that take tens of seconds to receive, when the nominal receive time is
>> a half-second or less.  I'm in the process of tooling up to collect
>> tcpdump data on all my clients to try to catch what is going on with that.
>>
>> Any other ideas on what to look for would be greatly appreciated.
>
> I'd rule out the journal+data on same disk as the source of pain first.
> If that's what's going on, we can take a closer look at speficically how
> to make it behave better!

OK, I'll try that next.

Thanks -- Jim

>
> sage
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 19:32             ` Gregory Farnum
@ 2012-02-02 20:22               ` Jim Schutt
  2012-02-02 20:31                 ` Jim Schutt
  2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
  0 siblings, 2 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 20:22 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/02/2012 12:32 PM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>>
>>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>    wrote:
>>>> The typical pattern I see is that a run starts with tens of seconds
>>>> of aggregate throughput>    2 GB/s.  Then it drops and bounces around
>>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>>>> it ramps back up near 2 GB/s again.
>>>
>>>
>>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>>> correlate the drops in throughput with the OSD journals running out of
>>> space?
>>
>>
>> A spot check of logs from my last run doesn't seem to have any
>> "journal throttle: waited for" messages during a slowdown.
>> Is that what you mean?
>
> I'd expect to see those, yes, but I actually meant the on-disk journal
> itself getting full. I believe that should result in output like:
>      write_thread_entry full, going to sleep (waiting for commit)
> ...although I now notice that's a much higher log level (20) than the
> other messages (1/5).

So I've been running OSDs with
	debug osd = 20
	debug journal = 20   ; local journaling
	debug filestore = 20 ; local object storage
	debug objector = 20
	debug ms = 20
	debug = 1

I found 0 instances of "waiting for commit" in all my OSD logs for my last run.

So I never waited on the journal?

>
>> During the fast part of I run I see lots of journal messages
>> with this pattern:
>>
>> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 22 ops and 90041106 bytes
>> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
>> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
>> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
>> and 50346596 bytes, now 21 ops and 85845571 bytes
>>
>> which I think means my journal is doing 50 MB/s, right?
>
> Generally, yes — although that'll also pop up if the store manages to
> commit faster than the journal (unlikely). :)
>
>>> and your description makes me
>>> think that throughput is initially constrained by sequential journal
>>> writes but then the journal runs out of space and the OSD has to wait
>>> for the main store to catch up (with random IO), and that sends the IO
>>> patterns all to hell. (If you can say that random 4MB IOs are
>>> hellish.)
>>
>>
>> iostat 1 during the fast part of a run shows both journal and data
>> partitions running at 45-50 MB/s.  During the slow part of a run
>> they both show similar but low data rates.
>
> All right. That's actually not that surprising; random 4MB writes are
> pretty nice to a modern drive.
>
>>> I'm also curious about memory usage as a possible explanation for the
>>> more dramatic drops.
>>
>> My OSD servers have 48 GB memory.  During a run I rarely see less than
>> 24 GB used by the page cache, with the rest mostly used by anonymous memory.
>> I don't run with any swap.
>>
>> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>>
>> One is that I instrumented ms_dispatch(), and I see it take
>> a half-second or more several hundred times, out of several
>> thousand messages.  Is that expected?
>
> How did you instrument it? If you wrapped the whole function it's
> possible that those longer runs are actually chewing through several
> messages that had to get waitlisted for some reason previously.
> (That's the call to do_waiters().)

Yep, I wrapped the whole function, and also instrumented taking osd_lock
while I was there.  About half the time that ms_dispatch() takes more than
0.5 seconds, taking osd_lock is responsible for the delay.  There's two
dispatch threads, one for ops and one for rep_ops, right?  So one's
waiting on the other?

>
>> Another is that once a message receive starts, I see ~50 messages
>> that take tens of seconds to receive, when the nominal receive time is
>> a half-second or less.  I'm in the process of tooling up to collect
>> tcpdump data on all my clients to try to catch what is going on with that.
>
> Again, how are you instrumenting that?

I post-process the logs, looking at the time difference between
"reader got .* policy throttler" and "reader got .* osd_op(client".

When I find a candidate message, I grep the log for just that reader thread,
and see, e.g., this:

osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000 seq 828/828 waiters 157/149 for src client.4301 tid=247
osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/66666666
osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128
osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172

Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads.

During that time 129 sockets were processed - what makes sd=215 special?

I've added tracepoints in my client kernel try_write(), and nothing seems
unusual (that's with running the patch to ceph_write_space() I posted earlier):

      kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0 tid 179 seq 3 sent 4194304
      kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0 tid 207 seq 4 sent 0
      kworker/0:2-1790  [000]  1543.200904: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 0
      kworker/0:2-1790  [000]  1543.203475: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 123951
      kworker/0:2-1790  [000]  1543.206069: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 251375
      kworker/0:2-1790  [000]  1543.208505: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 378799
      kworker/0:2-1790  [000]  1543.210898: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 506223
      kworker/0:2-1790  [000]  1543.213354: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 633647
      kworker/0:2-1790  [000]  1543.215095: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 761071
      kworker/0:2-1790  [000]  1543.217636: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 856639
      kworker/0:2-1790  [000]  1543.221925: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 984063
      kworker/0:2-1790  [000]  1543.225468: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1111487
      kworker/0:2-1790  [000]  1543.228113: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1238911
      kworker/0:2-1790  [000]  1543.231166: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1366335
      kworker/0:2-1790  [000]  1543.236256: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1493759
      kworker/0:2-1790  [000]  1569.020329: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1621183
      kworker/0:2-1790  [000]  1569.022522: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1748607
      kworker/0:2-1790  [000]  1569.024716: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 1876031
      kworker/0:2-1790  [000]  1569.027872: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2003455
      kworker/0:2-1790  [000]  1569.030603: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2130879
      kworker/0:2-1790  [000]  1569.034906: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2258303
      kworker/0:2-1790  [000]  1569.037398: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2385727
      kworker/0:2-1790  [000]  1569.040094: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2513151
      kworker/0:2-1790  [000]  1569.042541: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2640575
      kworker/0:2-1790  [000]  1569.045323: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2767999
      kworker/0:2-1790  [000]  1569.048565: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 2895423
      kworker/0:2-1790  [000]  1569.051410: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3022847
      kworker/0:2-1790  [000]  1569.053606: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3150271
      kworker/0:2-1790  [000]  1569.055802: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3277695
      kworker/0:2-1790  [000]  1569.058857: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3405119
      kworker/0:2-1790  [000]  1569.061298: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3532543
      kworker/0:2-1790  [000]  1569.063692: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3659967
      kworker/0:2-1790  [000]  1569.069931: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3787391
      kworker/0:2-1790  [000]  1569.072926: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 3914815
      kworker/0:2-1790  [000]  1569.075909: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 4042239
      kworker/0:2-1790  [000]  1569.078603: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 4169663
      kworker/0:2-1790  [000]  1569.078613: ceph_try_write_msg: peer osd0 tid 207 seq 4 sent 4194304
      kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0 tid 207 seq 4 sent 4194304

There's a 25 second gap at 1543.236256, but nothing like the
100 second gap in the reader.

Hence, tcpdump seems like a good idea?

-- Jim

>
> -Greg
>
>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 20:22               ` Jim Schutt
@ 2012-02-02 20:31                 ` Jim Schutt
  2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
  1 sibling, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-02 20:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/02/2012 01:22 PM, Jim Schutt wrote:
> On 02/02/2012 12:32 PM, Gregory Farnum wrote:
>> On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt<jaschut@sandia.gov> wrote:
>>> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
>>>>
>>>> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov> wrote:
>>>>> The typical pattern I see is that a run starts with tens of seconds
>>>>> of aggregate throughput> 2 GB/s. Then it drops and bounces around
>>>>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then
>>>>> it ramps back up near 2 GB/s again.
>>>>
>>>>
>>>> Hmm. 100MB/s is awfully low for this theory, but have you tried to
>>>> correlate the drops in throughput with the OSD journals running out of
>>>> space?
>>>
>>>
>>> A spot check of logs from my last run doesn't seem to have any
>>> "journal throttle: waited for" messages during a slowdown.
>>> Is that what you mean?
>>
>> I'd expect to see those, yes, but I actually meant the on-disk journal
>> itself getting full. I believe that should result in output like:
>> write_thread_entry full, going to sleep (waiting for commit)
>> ...although I now notice that's a much higher log level (20) than the
>> other messages (1/5).
>
> So I've been running OSDs with
> debug osd = 20
> debug journal = 20 ; local journaling
> debug filestore = 20 ; local object storage
> debug objector = 20
> debug ms = 20
> debug = 1
>
> I found 0 instances of "waiting for commit" in all my OSD logs for my last run.
>
> So I never waited on the journal?
>
>>
>>> During the fast part of I run I see lots of journal messages
>>> with this pattern:
>>>
>>> 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
>>> and 50346596 bytes, now 22 ops and 90041106 bytes
>>> 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
>>> 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
>>> and 50346596 bytes, now 21 ops and 85845571 bytes
>>>
>>> which I think means my journal is doing 50 MB/s, right?
>>
>> Generally, yes — although that'll also pop up if the store manages to
>> commit faster than the journal (unlikely). :)
>>
>>>> and your description makes me
>>>> think that throughput is initially constrained by sequential journal
>>>> writes but then the journal runs out of space and the OSD has to wait
>>>> for the main store to catch up (with random IO), and that sends the IO
>>>> patterns all to hell. (If you can say that random 4MB IOs are
>>>> hellish.)
>>>
>>>
>>> iostat 1 during the fast part of a run shows both journal and data
>>> partitions running at 45-50 MB/s. During the slow part of a run
>>> they both show similar but low data rates.
>>
>> All right. That's actually not that surprising; random 4MB writes are
>> pretty nice to a modern drive.
>>
>>>> I'm also curious about memory usage as a possible explanation for the
>>>> more dramatic drops.
>>>
>>> My OSD servers have 48 GB memory. During a run I rarely see less than
>>> 24 GB used by the page cache, with the rest mostly used by anonymous memory.
>>> I don't run with any swap.
>>>
>>> So far I'm looking at two behaviours I've noticed that seem anomalous to me.
>>>
>>> One is that I instrumented ms_dispatch(), and I see it take
>>> a half-second or more several hundred times, out of several
>>> thousand messages. Is that expected?
>>
>> How did you instrument it? If you wrapped the whole function it's
>> possible that those longer runs are actually chewing through several
>> messages that had to get waitlisted for some reason previously.
>> (That's the call to do_waiters().)
>
> Yep, I wrapped the whole function, and also instrumented taking osd_lock
> while I was there. About half the time that ms_dispatch() takes more than
> 0.5 seconds, taking osd_lock is responsible for the delay. There's two
> dispatch threads, one for ops and one for rep_ops, right? So one's
> waiting on the other?
>
>>
>>> Another is that once a message receive starts, I see ~50 messages
>>> that take tens of seconds to receive, when the nominal receive time is
>>> a half-second or less. I'm in the process of tooling up to collect
>>> tcpdump data on all my clients to try to catch what is going on with that.
>>
>> Again, how are you instrumenting that?
>
> I post-process the logs, looking at the time difference between
> "reader got .* policy throttler" and "reader got .* osd_op(client".
>
> When I find a candidate message, I grep the log for just that reader thread,
> and see, e.g., this:
>
> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000 seq 828/828 waiters 157/149 for src client.4301 tid=247
> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/66666666
> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128
> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>
> Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads.
>
> During that time 129 sockets were processed - what makes sd=215 special?
>
> I've added tracepoints in my client kernel try_write(), and nothing seems
> unusual (that's with running the patch to ceph_write_space() I posted earlier):
>
[snip wrong trace output]

      kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0 tid 207 seq 4 sent 4194304
      kworker/0:2-1790  [000]  1569.078618: ceph_prepare_write_msg: peer osd0 tid 247 seq 5 sent 0
      kworker/0:2-1790  [000]  1569.078621: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 0
      kworker/0:2-1790  [000]  1569.281943: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 102588
      kworker/0:2-1790  [000]  1569.299540: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 230012
      kworker/0:2-1790  [000]  1569.303088: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 357436
      kworker/0:2-1790  [000]  1569.305580: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 453004
      kworker/0:2-1790  [000]  1569.308217: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 580428
      kworker/0:2-1790  [000]  1569.310914: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 707852
      kworker/0:2-1790  [000]  1569.313742: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 835276
      kworker/0:2-1790  [000]  1569.316653: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 962700
      kworker/0:2-1790  [000]  1569.319203: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1090124
      kworker/0:2-1790  [000]  1569.323786: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1217548
      kworker/0:2-1790  [000]  1569.325982: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1344972
      kworker/0:2-1790  [000]  1569.328577: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1472396
      kworker/0:2-1790  [000]  1573.844837: ceph_handle_reply_msg: peer osd0 tid 179 result 0 flags 0x00000025 (req ffff88018aa2c400)
      kworker/0:2-1790  [000]  1573.845487: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1599820
     flush-ceph-1-11910 [000]  1576.891480: ceph_async_writepages_req: tid 377 osd0 ops 1 0x2201/0x0000/0x0000 pages 1024
      kworker/0:2-1790  [000]  1602.214574: ceph_handle_reply_msg: peer osd0 tid 207 result 0 flags 0x00000025 (req ffff88018aa7d400)
      kworker/0:2-1790  [000]  1602.215230: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1663532
     flush-ceph-1-11910 [000]  1636.926835: ceph_async_writepages_req: tid 410 osd0 ops 1 0x2201/0x0000/0x0000 pages 1024
      kworker/0:2-1790  [000]  1775.409415: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1663532
      kworker/0:2-1790  [000]  1775.411643: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1790956
      kworker/0:2-1790  [000]  1775.414125: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 1918380
      kworker/0:2-1790  [000]  1775.416520: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2045804
      kworker/0:2-1790  [000]  1775.419163: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2173228
      kworker/0:2-1790  [000]  1775.421620: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2300652
      kworker/0:2-1790  [000]  1775.423868: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2428076
      kworker/0:2-1790  [000]  1775.426260: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2555500
      kworker/0:2-1790  [000]  1775.426297: ceph_try_write_msg: peer osd0 tid 247 seq 5 sent 2670592
      kworker/0:2-1790  [000]  1775.426298: ceph_try_write_msg_done: peer osd0 tid 247 seq 5 sent 2670592

There's a 170 second gap at 1602.215230 - why?

-- Jim

>
> Hence, tcpdump seems like a good idea?
>
> -- Jim
>
>>
>> -Greg
>>
>>
>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 20:22               ` Jim Schutt
  2012-02-02 20:31                 ` Jim Schutt
@ 2012-02-03  0:28                 ` Gregory Farnum
  2012-02-03 16:17                   ` Jim Schutt
  1 sibling, 1 reply; 47+ messages in thread
From: Gregory Farnum @ 2012-02-03  0:28 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> I found 0 instances of "waiting for commit" in all my OSD logs for my last
> run.
>
> So I never waited on the journal?

Looks like it. Interesting.


>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>> me.
>>>
>>> One is that I instrumented ms_dispatch(), and I see it take
>>> a half-second or more several hundred times, out of several
>>> thousand messages.  Is that expected?
>>
>>
>> How did you instrument it? If you wrapped the whole function it's
>> possible that those longer runs are actually chewing through several
>> messages that had to get waitlisted for some reason previously.
>> (That's the call to do_waiters().)
>
>
> Yep, I wrapped the whole function, and also instrumented taking osd_lock
> while I was there.  About half the time that ms_dispatch() takes more than
> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
> dispatch threads, one for ops and one for rep_ops, right?  So one's
> waiting on the other?

There's just one main dispatcher; no split for the ops and rep_ops .
The reason for that "dispatch_running" is that if there are requests
waiting then the tick() function will run through them if the
messenger dispatch thread is currently idle.
But it is possible for the Messenger to try and dispatch, and for that
to be blocked while some amount of (usually trivial) work is being
done by a different thread, yes. I don't think we've ever observed it
being a problem for anything other than updating OSD maps, though...


>>> Another is that once a message receive starts, I see ~50 messages
>>> that take tens of seconds to receive, when the nominal receive time is
>>> a half-second or less.  I'm in the process of tooling up to collect
>>> tcpdump data on all my clients to try to catch what is going on with
>>> that.
>>
>>
>> Again, how are you instrumenting that?
>
>
> I post-process the logs, looking at the time difference between
> "reader got .* policy throttler" and "reader got .* osd_op(client".

I guess the logging output must have changed a bit at some pointer (or
was that one of your patches?). master has "reader wants" not "reader
got" for the policy throttler. (Just got a little confused when
checking the code.)

> When I find a candidate message, I grep the log for just that reader thread,
> and see, e.g., this:
>
> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000
> seq 828/828 waiters 157/149 for src client.4301 tid=247
> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
> 41944358/66666666
> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader got front 128
> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
> 172.17.131.32:6800/14974 >> 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>
> Note the ~2 minute delay (and ~300,000 lines of logging) between the first
> and second reads.
>
> During that time 129 sockets were processed - what makes sd=215 special?

Hrm. Well, you can try turning up the messenger debugging to 30 and
taking advantage of the "reader reading" "reader read" pair right
around tcp_read_nonblocking.

> I've added tracepoints in my client kernel try_write(), and nothing seems
> unusual (that's with running the patch to ceph_write_space() I posted
> earlier):
>
>     kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
> tid 179 seq 3 sent 4194304
>     kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
> tid 207 seq 4 sent 0
*snip*
>     kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
> tid 207 seq 4 sent 4194304
>
> There's a 25 second gap at 1543.236256, but nothing like the
> 100 second gap in the reader.
>
> Hence, tcpdump seems like a good idea?

You do bring us interesting problems! Let us know what info you come up with.

Oh, and I keep forgetting to ask: what does the write workload look
like? At first I assumed this was a CephFS workload, but given that
you're changing max message sizes and have half-second writes you're
probably doing something else?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
@ 2012-02-03 16:17                   ` Jim Schutt
  2012-02-03 17:06                     ` Gregory Farnum
  2012-02-03 17:07                     ` Sage Weil
  0 siblings, 2 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-03 16:17 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/02/2012 05:28 PM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> I found 0 instances of "waiting for commit" in all my OSD logs for my last
>> run.
>>
>> So I never waited on the journal?
>
> Looks like it. Interesting.
>
>
>>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>>> me.
>>>>
>>>> One is that I instrumented ms_dispatch(), and I see it take
>>>> a half-second or more several hundred times, out of several
>>>> thousand messages.  Is that expected?
>>>
>>>
>>> How did you instrument it? If you wrapped the whole function it's
>>> possible that those longer runs are actually chewing through several
>>> messages that had to get waitlisted for some reason previously.
>>> (That's the call to do_waiters().)
>>
>>
>> Yep, I wrapped the whole function, and also instrumented taking osd_lock
>> while I was there.  About half the time that ms_dispatch() takes more than
>> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
>> dispatch threads, one for ops and one for rep_ops, right?  So one's
>> waiting on the other?
>
> There's just one main dispatcher; no split for the ops and rep_ops .
> The reason for that "dispatch_running" is that if there are requests
> waiting then the tick() function will run through them if the
> messenger dispatch thread is currently idle.
> But it is possible for the Messenger to try and dispatch, and for that
> to be blocked while some amount of (usually trivial) work is being
> done by a different thread, yes. I don't think we've ever observed it
> being a problem for anything other than updating OSD maps, though...

Ah, OK.

I guess I was confused by my log output, e.g.:

osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

I need to go study that code some more....

>
>
>>>> Another is that once a message receive starts, I see ~50 messages
>>>> that take tens of seconds to receive, when the nominal receive time is
>>>> a half-second or less.  I'm in the process of tooling up to collect
>>>> tcpdump data on all my clients to try to catch what is going on with
>>>> that.
>>>
>>>
>>> Again, how are you instrumenting that?
>>
>>
>> I post-process the logs, looking at the time difference between
>> "reader got .* policy throttler" and "reader got .* osd_op(client".
>
> I guess the logging output must have changed a bit at some pointer (or
> was that one of your patches?). master has "reader wants" not "reader
> got" for the policy throttler. (Just got a little confused when
> checking the code.)

Yep, I added an extra message to make post-processing logs easier, sorry.

>
>> When I find a candidate message, I grep the log for just that reader thread,
>> and see, e.g., this:
>>
>> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000
>> seq 828/828 waiters 157/149 for src client.4301 tid=247
>> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
>> 41944358/66666666
>> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader got front 128
>> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
>> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
>> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
>> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>>
>> Note the ~2 minute delay (and ~300,000 lines of logging) between the first
>> and second reads.
>>
>> During that time 129 sockets were processed - what makes sd=215 special?
>
> Hrm. Well, you can try turning up the messenger debugging to 30 and
> taking advantage of the "reader reading" "reader read" pair right
> around tcp_read_nonblocking.

OK, I'll give that a try as well, thanks.
>
>> I've added tracepoints in my client kernel try_write(), and nothing seems
>> unusual (that's with running the patch to ceph_write_space() I posted
>> earlier):
>>
>>      kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
>> tid 179 seq 3 sent 4194304
>>      kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
>> tid 207 seq 4 sent 0
> *snip*
>>      kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
>> tid 207 seq 4 sent 4194304
>>
>> There's a 25 second gap at 1543.236256, but nothing like the
>> 100 second gap in the reader.
>>
>> Hence, tcpdump seems like a good idea?
>
> You do bring us interesting problems! Let us know what info you come up with.
>
> Oh, and I keep forgetting to ask: what does the write workload look
> like? At first I assumed this was a CephFS workload, but given that
> you're changing max message sizes and have half-second writes you're
> probably doing something else?

I'm just using "pdsh -f <number of clients> -w <client list>"
to start up a "dd conv=fdatasync" on each client, roughly
simultaneously.

I think the short messages are coming from writeback control.
I've got the writeback tracepoints enabled, and most of the time
I see things like this:

tc85.trace.log:166469:    flush-ceph-1-11910 [000]  1787.028175: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=513024 to_write=1024 wrote=1024
tc85.trace.log:166474:    flush-ceph-1-11910 [000]  1787.028889: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=514048 to_write=1024 wrote=1024

But occasionally I see this sort of thing:

tc85.trace.log:22410:    flush-ceph-1-11910 [001]  1546.957999: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=295936 to_write=11264 wrote=11264
tc85.trace.log:29383:    flush-ceph-1-11910 [001]  1547.327652: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=307200 to_write=11264 wrote=11264
tc85.trace.log:37048:    flush-ceph-1-11910 [001]  1548.861577: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=2 index=316416 to_write=9216 wrote=9216
tc85.trace.log:42864:    flush-ceph-1-11910 [000]  1550.023496: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=3 index=323584 to_write=7168 wrote=7168
tc85.trace.log:47626:    flush-ceph-1-11910 [000]  1550.976374: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=4 index=329728 to_write=6144 wrote=6144
tc85.trace.log:51607:    flush-ceph-1-11910 [001]  1551.781108: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=334848 to_write=5120 wrote=5120
tc85.trace.log:51998:    flush-ceph-1-11910 [001]  1551.860104: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=339968 to_write=5120 wrote=5120
tc85.trace.log:52018:    flush-ceph-1-11910 [001]  1551.863599: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=345088 to_write=5120 wrote=5120
tc85.trace.log:52034:    flush-ceph-1-11910 [001]  1551.866372: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=350208 to_write=5120 wrote=5120
tc85.trace.log:52044:    flush-ceph-1-11910 [001]  1551.866767: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=0 to_write=5120 wrote=648
tc85.trace.log:69705:    flush-ceph-1-11910 [000]  1576.878034: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=30 index=352256 to_write=1024 wrote=1400
tc85.trace.log:69830:    flush-ceph-1-11910 [000]  1576.892907: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296214116 age=30 index=0 to_write=1024 wrote=576
tc85.trace.log:81609:    flush-ceph-1-11910 [001]  1606.907407: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296244466 age=30 index=378880 to_write=1024 wrote=1472
tc85.trace.log:81678:    flush-ceph-1-11910 [001]  1606.916107: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296244466 age=30 index=0 to_write=1024 wrote=831
tc85.trace.log:96729:    flush-ceph-1-11910 [000]  1636.918264: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296274503 age=30 index=393216 to_write=1024 wrote=1217
tc85.trace.log:96839:    flush-ceph-1-11910 [000]  1636.931363: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296274503 age=30 index=0 to_write=1024 wrote=933
tc85.trace.log:111179:    flush-ceph-1-11910 [001]  1666.932329: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296304541 age=30 index=415744 to_write=1024 wrote=1115
tc85.trace.log:111298:    flush-ceph-1-11910 [001]  1666.945162: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296304541 age=30 index=0 to_write=1024 wrote=941

I eventually want to understand what is happening here.....

BTW, should I post my ceph client tracepoint patches?  I ask because
it's not clear to me they would be useful to anyone but me.

-- Jim

> -Greg
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-03 16:17                   ` Jim Schutt
@ 2012-02-03 17:06                     ` Gregory Farnum
  2012-02-03 23:33                       ` Jim Schutt
  2012-02-03 17:07                     ` Sage Weil
  1 sibling, 1 reply; 47+ messages in thread
From: Gregory Farnum @ 2012-02-03 17:06 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Feb 3, 2012, at 8:18 AM, Jim Schutt <jaschut@sandia.gov> wrote:

> On 02/02/2012 05:28 PM, Gregory Farnum wrote:
>> On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>> I found 0 instances of "waiting for commit" in all my OSD logs for my last
>>> run.
>>>
>>> So I never waited on the journal?
>>
>> Looks like it. Interesting.
>>
>>
>>>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>>>> me.
>>>>>
>>>>> One is that I instrumented ms_dispatch(), and I see it take
>>>>> a half-second or more several hundred times, out of several
>>>>> thousand messages.  Is that expected?
>>>>
>>>>
>>>> How did you instrument it? If you wrapped the whole function it's
>>>> possible that those longer runs are actually chewing through several
>>>> messages that had to get waitlisted for some reason previously.
>>>> (That's the call to do_waiters().)
>>>
>>>
>>> Yep, I wrapped the whole function, and also instrumented taking osd_lock
>>> while I was there.  About half the time that ms_dispatch() takes more than
>>> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
>>> dispatch threads, one for ops and one for rep_ops, right?  So one's
>>> waiting on the other?
>>
>> There's just one main dispatcher; no split for the ops and rep_ops .
>> The reason for that "dispatch_running" is that if there are requests
>> waiting then the tick() function will run through them if the
>> messenger dispatch thread is currently idle.
>> But it is possible for the Messenger to try and dispatch, and for that
>> to be blocked while some amount of (usually trivial) work is being
>> done by a different thread, yes. I don't think we've ever observed it
>> being a problem for anything other than updating OSD maps, though...
>
> Ah, OK.
>
> I guess I was confused by my log output, e.g.:

D'oh. Sorry, you confused me with your reference to repops, which
aren't special-cased or anything. But there are two messengers on the
OSD, each with their own dispatch thread. One of those messengers is
for clients and one is for other OSDs.

And now that you point that out, I wonder if the problem is lack of
Cond signaling in ms_dispatch. I'm on my phone right now but I believe
there's a chunk of commented-out code (why commented instead of
deleted? I don't know) that we want to uncomment for reasons that will
become clear when you look at it. :)
Try that and see how many of your problems disappear?


>
> osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
> osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680
>
> I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.
>
> I need to go study that code some more....
>
>>
>>
>>>>> Another is that once a message receive starts, I see ~50 messages
>>>>> that take tens of seconds to receive, when the nominal receive time is
>>>>> a half-second or less.  I'm in the process of tooling up to collect
>>>>> tcpdump data on all my clients to try to catch what is going on with
>>>>> that.
>>>>
>>>>
>>>> Again, how are you instrumenting that?
>>>
>>>
>>> I post-process the logs, looking at the time difference between
>>> "reader got .* policy throttler" and "reader got .* osd_op(client".
>>
>> I guess the logging output must have changed a bit at some pointer (or
>> was that one of your patches?). master has "reader wants" not "reader
>> got" for the policy throttler. (Just got a little confused when
>> checking the code.)
>
> Yep, I added an extra message to make post-processing logs easier, sorry.
>
>>
>>> When I find a candidate message, I grep the log for just that reader thread,
>>> and see, e.g., this:
>>>
>>> osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/50000000
>>> seq 828/828 waiters 157/149 for src client.4301 tid=247
>>> osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
>>> 41944358/66666666
>>> osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader got front 128
>>> osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
>>> osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
>>> osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
>>> osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
>>> 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
>>> pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
>>>
>>> Note the ~2 minute delay (and ~300,000 lines of logging) between the first
>>> and second reads.
>>>
>>> During that time 129 sockets were processed - what makes sd=215 special?
>>
>> Hrm. Well, you can try turning up the messenger debugging to 30 and
>> taking advantage of the "reader reading" "reader read" pair right
>> around tcp_read_nonblocking.
>
> OK, I'll give that a try as well, thanks.
>>
>>> I've added tracepoints in my client kernel try_write(), and nothing seems
>>> unusual (that's with running the patch to ceph_write_space() I posted
>>> earlier):
>>>
>>>     kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
>>> tid 179 seq 3 sent 4194304
>>>     kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
>>> tid 207 seq 4 sent 0
>> *snip*
>>>     kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
>>> tid 207 seq 4 sent 4194304
>>>
>>> There's a 25 second gap at 1543.236256, but nothing like the
>>> 100 second gap in the reader.
>>>
>>> Hence, tcpdump seems like a good idea?
>>
>> You do bring us interesting problems! Let us know what info you come up with.
>>
>> Oh, and I keep forgetting to ask: what does the write workload look
>> like? At first I assumed this was a CephFS workload, but given that
>> you're changing max message sizes and have half-second writes you're
>> probably doing something else?
>
> I'm just using "pdsh -f <number of clients> -w <client list>"
> to start up a "dd conv=fdatasync" on each client, roughly
> simultaneously.
>
> I think the short messages are coming from writeback control.
> I've got the writeback tracepoints enabled, and most of the time
> I see things like this:
>
> tc85.trace.log:166469:    flush-ceph-1-11910 [000]  1787.028175: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=513024 to_write=1024 wrote=1024
> tc85.trace.log:166474:    flush-ceph-1-11910 [000]  1787.028889: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296424688 age=30 index=514048 to_write=1024 wrote=1024
>
> But occasionally I see this sort of thing:
>
> tc85.trace.log:22410:    flush-ceph-1-11910 [001]  1546.957999: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=295936 to_write=11264 wrote=11264
> tc85.trace.log:29383:    flush-ceph-1-11910 [001]  1547.327652: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=0 index=307200 to_write=11264 wrote=11264
> tc85.trace.log:37048:    flush-ceph-1-11910 [001]  1548.861577: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=2 index=316416 to_write=9216 wrote=9216
> tc85.trace.log:42864:    flush-ceph-1-11910 [000]  1550.023496: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=3 index=323584 to_write=7168 wrote=7168
> tc85.trace.log:47626:    flush-ceph-1-11910 [000]  1550.976374: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=4 index=329728 to_write=6144 wrote=6144
> tc85.trace.log:51607:    flush-ceph-1-11910 [001]  1551.781108: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=334848 to_write=5120 wrote=5120
> tc85.trace.log:51998:    flush-ceph-1-11910 [001]  1551.860104: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=339968 to_write=5120 wrote=5120
> tc85.trace.log:52018:    flush-ceph-1-11910 [001]  1551.863599: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=345088 to_write=5120 wrote=5120
> tc85.trace.log:52034:    flush-ceph-1-11910 [001]  1551.866372: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=350208 to_write=5120 wrote=5120
> tc85.trace.log:52044:    flush-ceph-1-11910 [001]  1551.866767: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=5 index=0 to_write=5120 wrote=648
> tc85.trace.log:69705:    flush-ceph-1-11910 [000]  1576.878034: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296214116 age=30 index=352256 to_write=1024 wrote=1400
> tc85.trace.log:69830:    flush-ceph-1-11910 [000]  1576.892907: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296214116 age=30 index=0 to_write=1024 wrote=576
> tc85.trace.log:81609:    flush-ceph-1-11910 [001]  1606.907407: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296244466 age=30 index=378880 to_write=1024 wrote=1472
> tc85.trace.log:81678:    flush-ceph-1-11910 [001]  1606.916107: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296244466 age=30 index=0 to_write=1024 wrote=831
> tc85.trace.log:96729:    flush-ceph-1-11910 [000]  1636.918264: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296274503 age=30 index=393216 to_write=1024 wrote=1217
> tc85.trace.log:96839:    flush-ceph-1-11910 [000]  1636.931363: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296274503 age=30 index=0 to_write=1024 wrote=933
> tc85.trace.log:111179:    flush-ceph-1-11910 [001]  1666.932329: writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES dirtied_when=4296304541 age=30 index=415744 to_write=1024 wrote=1115
> tc85.trace.log:111298:    flush-ceph-1-11910 [001]  1666.945162: writeback_single_inode: bdi ceph-1: ino=1099511712863 state= dirtied_when=4296304541 age=30 index=0 to_write=1024 wrote=941
>
> I eventually want to understand what is happening here.....
>
> BTW, should I post my ceph client tracepoint patches?  I ask because
> it's not clear to me they would be useful to anyone but me.
>
> -- Jim
>
>> -Greg
>>
>>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-03 16:17                   ` Jim Schutt
  2012-02-03 17:06                     ` Gregory Farnum
@ 2012-02-03 17:07                     ` Sage Weil
  1 sibling, 0 replies; 47+ messages in thread
From: Sage Weil @ 2012-02-03 17:07 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel

On Fri, 3 Feb 2012, Jim Schutt wrote:
> On 02/02/2012 05:28 PM, Gregory Farnum wrote:
> > On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
> > > I found 0 instances of "waiting for commit" in all my OSD logs for my last
> > > run.
> > > 
> > > So I never waited on the journal?
> > 
> > Looks like it. Interesting.
> > 
> > 
> > > > > So far I'm looking at two behaviours I've noticed that seem anomalous
> > > > > to
> > > > > me.
> > > > > 
> > > > > One is that I instrumented ms_dispatch(), and I see it take
> > > > > a half-second or more several hundred times, out of several
> > > > > thousand messages.  Is that expected?
> > > > 
> > > > 
> > > > How did you instrument it? If you wrapped the whole function it's
> > > > possible that those longer runs are actually chewing through several
> > > > messages that had to get waitlisted for some reason previously.
> > > > (That's the call to do_waiters().)
> > > 
> > > 
> > > Yep, I wrapped the whole function, and also instrumented taking osd_lock
> > > while I was there.  About half the time that ms_dispatch() takes more than
> > > 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
> > > dispatch threads, one for ops and one for rep_ops, right?  So one's
> > > waiting on the other?
> > 
> > There's just one main dispatcher; no split for the ops and rep_ops .
> > The reason for that "dispatch_running" is that if there are requests
> > waiting then the tick() function will run through them if the
> > messenger dispatch thread is currently idle.
> > But it is possible for the Messenger to try and dispatch, and for that
> > to be blocked while some amount of (usually trivial) work is being
> > done by a different thread, yes. I don't think we've ever observed it
> > being a problem for anything other than updating OSD maps, though...
> 
> Ah, OK.
> 
> I guess I was confused by my log output, e.g.:
> 
> osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch
> ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
> osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch
> ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680
> 
> I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.
> 
> I need to go study that code some more....

Oh... they are separate thread.  In the OSD's case two different 
messengers (the public and cluster ones) are wired up to the same 
dispatcher (OSD::ms_dispatch).  MOSDOps come in on the public thread, 
MOSDSubOps on the cluster one, but they're both fed to the same function.  
That's why there's some funkiness going on in, say, handle_osd_map().

But like Greg said, I don't think we've seen any significant latencies 
there except from map processing.  If you have a log, that would be 
interesting to look at!

sage


> 
> > 
> > 
> > > > > Another is that once a message receive starts, I see ~50 messages
> > > > > that take tens of seconds to receive, when the nominal receive time is
> > > > > a half-second or less.  I'm in the process of tooling up to collect
> > > > > tcpdump data on all my clients to try to catch what is going on with
> > > > > that.
> > > > 
> > > > 
> > > > Again, how are you instrumenting that?
> > > 
> > > 
> > > I post-process the logs, looking at the time difference between
> > > "reader got .* policy throttler" and "reader got .* osd_op(client".
> > 
> > I guess the logging output must have changed a bit at some pointer (or
> > was that one of your patches?). master has "reader wants" not "reader
> > got" for the policy throttler. (Just got a little confused when
> > checking the code.)
> 
> Yep, I added an extra message to make post-processing logs easier, sorry.
> 
> > 
> > > When I find a candidate message, I grep the log for just that reader
> > > thread,
> > > and see, e.g., this:
> > > 
> > > osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader got 2670720 from policy throttler
> > > 48809510/50000000
> > > seq 828/828 waiters 157/149 for src client.4301 tid=247
> > > osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
> > > 41944358/66666666
> > > osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader got front 128
> > > osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
> > > osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
> > > osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
> > > osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
> > > 172.17.131.32:6800/14974>>  172.17.135.85:0/1283168808 pipe(0xbdc9680
> > > sd=215
> > > pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172
> > > 
> > > Note the ~2 minute delay (and ~300,000 lines of logging) between the first
> > > and second reads.
> > > 
> > > During that time 129 sockets were processed - what makes sd=215 special?
> > 
> > Hrm. Well, you can try turning up the messenger debugging to 30 and
> > taking advantage of the "reader reading" "reader read" pair right
> > around tcp_read_nonblocking.
> 
> OK, I'll give that a try as well, thanks.
> > 
> > > I've added tracepoints in my client kernel try_write(), and nothing seems
> > > unusual (that's with running the patch to ceph_write_space() I posted
> > > earlier):
> > > 
> > >      kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer
> > > osd0
> > > tid 179 seq 3 sent 4194304
> > >      kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer
> > > osd0
> > > tid 207 seq 4 sent 0
> > *snip*
> > >      kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer
> > > osd0
> > > tid 207 seq 4 sent 4194304
> > > 
> > > There's a 25 second gap at 1543.236256, but nothing like the
> > > 100 second gap in the reader.
> > > 
> > > Hence, tcpdump seems like a good idea?
> > 
> > You do bring us interesting problems! Let us know what info you come up
> > with.
> > 
> > Oh, and I keep forgetting to ask: what does the write workload look
> > like? At first I assumed this was a CephFS workload, but given that
> > you're changing max message sizes and have half-second writes you're
> > probably doing something else?
> 
> I'm just using "pdsh -f <number of clients> -w <client list>"
> to start up a "dd conv=fdatasync" on each client, roughly
> simultaneously.
> 
> I think the short messages are coming from writeback control.
> I've got the writeback tracepoints enabled, and most of the time
> I see things like this:
> 
> tc85.trace.log:166469:    flush-ceph-1-11910 [000]  1787.028175:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296424688 age=30 index=513024 to_write=1024 wrote=1024
> tc85.trace.log:166474:    flush-ceph-1-11910 [000]  1787.028889:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296424688 age=30 index=514048 to_write=1024 wrote=1024
> 
> But occasionally I see this sort of thing:
> 
> tc85.trace.log:22410:    flush-ceph-1-11910 [001]  1546.957999:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=0 index=295936 to_write=11264 wrote=11264
> tc85.trace.log:29383:    flush-ceph-1-11910 [001]  1547.327652:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=0 index=307200 to_write=11264 wrote=11264
> tc85.trace.log:37048:    flush-ceph-1-11910 [001]  1548.861577:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=2 index=316416 to_write=9216 wrote=9216
> tc85.trace.log:42864:    flush-ceph-1-11910 [000]  1550.023496:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=3 index=323584 to_write=7168 wrote=7168
> tc85.trace.log:47626:    flush-ceph-1-11910 [000]  1550.976374:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=4 index=329728 to_write=6144 wrote=6144
> tc85.trace.log:51607:    flush-ceph-1-11910 [001]  1551.781108:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=5 index=334848 to_write=5120 wrote=5120
> tc85.trace.log:51998:    flush-ceph-1-11910 [001]  1551.860104:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=5 index=339968 to_write=5120 wrote=5120
> tc85.trace.log:52018:    flush-ceph-1-11910 [001]  1551.863599:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=5 index=345088 to_write=5120 wrote=5120
> tc85.trace.log:52034:    flush-ceph-1-11910 [001]  1551.866372:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=5 index=350208 to_write=5120 wrote=5120
> tc85.trace.log:52044:    flush-ceph-1-11910 [001]  1551.866767:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=5 index=0 to_write=5120 wrote=648
> tc85.trace.log:69705:    flush-ceph-1-11910 [000]  1576.878034:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296214116 age=30 index=352256 to_write=1024 wrote=1400
> tc85.trace.log:69830:    flush-ceph-1-11910 [000]  1576.892907:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=
> dirtied_when=4296214116 age=30 index=0 to_write=1024 wrote=576
> tc85.trace.log:81609:    flush-ceph-1-11910 [001]  1606.907407:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296244466 age=30 index=378880 to_write=1024 wrote=1472
> tc85.trace.log:81678:    flush-ceph-1-11910 [001]  1606.916107:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=
> dirtied_when=4296244466 age=30 index=0 to_write=1024 wrote=831
> tc85.trace.log:96729:    flush-ceph-1-11910 [000]  1636.918264:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296274503 age=30 index=393216 to_write=1024 wrote=1217
> tc85.trace.log:96839:    flush-ceph-1-11910 [000]  1636.931363:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=
> dirtied_when=4296274503 age=30 index=0 to_write=1024 wrote=933
> tc85.trace.log:111179:    flush-ceph-1-11910 [001]  1666.932329:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=I_DIRTY_PAGES
> dirtied_when=4296304541 age=30 index=415744 to_write=1024 wrote=1115
> tc85.trace.log:111298:    flush-ceph-1-11910 [001]  1666.945162:
> writeback_single_inode: bdi ceph-1: ino=1099511712863 state=
> dirtied_when=4296304541 age=30 index=0 to_write=1024 wrote=941
> 
> I eventually want to understand what is happening here.....
> 
> BTW, should I post my ceph client tracepoint patches?  I ask because
> it's not clear to me they would be useful to anyone but me.
> 
> -- Jim
> 
> > -Greg
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-03 17:06                     ` Gregory Farnum
@ 2012-02-03 23:33                       ` Jim Schutt
       [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-03 23:33 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 02/03/2012 10:06 AM, Gregory Farnum wrote:
> On Feb 3, 2012, at 8:18 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>
>> On 02/02/2012 05:28 PM, Gregory Farnum wrote:
>>> On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>   wrote:
>>>> I found 0 instances of "waiting for commit" in all my OSD logs for my last
>>>> run.
>>>>
>>>> So I never waited on the journal?
>>>
>>> Looks like it. Interesting.
>>>
>>>
>>>>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>>>>> me.
>>>>>>
>>>>>> One is that I instrumented ms_dispatch(), and I see it take
>>>>>> a half-second or more several hundred times, out of several
>>>>>> thousand messages.  Is that expected?
>>>>>
>>>>>
>>>>> How did you instrument it? If you wrapped the whole function it's
>>>>> possible that those longer runs are actually chewing through several
>>>>> messages that had to get waitlisted for some reason previously.
>>>>> (That's the call to do_waiters().)
>>>>
>>>>
>>>> Yep, I wrapped the whole function, and also instrumented taking osd_lock
>>>> while I was there.  About half the time that ms_dispatch() takes more than
>>>> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
>>>> dispatch threads, one for ops and one for rep_ops, right?  So one's
>>>> waiting on the other?
>>>
>>> There's just one main dispatcher; no split for the ops and rep_ops .
>>> The reason for that "dispatch_running" is that if there are requests
>>> waiting then the tick() function will run through them if the
>>> messenger dispatch thread is currently idle.
>>> But it is possible for the Messenger to try and dispatch, and for that
>>> to be blocked while some amount of (usually trivial) work is being
>>> done by a different thread, yes. I don't think we've ever observed it
>>> being a problem for anything other than updating OSD maps, though...
>>
>> Ah, OK.
>>
>> I guess I was confused by my log output, e.g.:
>
> D'oh. Sorry, you confused me with your reference to repops, which
> aren't special-cased or anything. But there are two messengers on the
> OSD, each with their own dispatch thread. One of those messengers is
> for clients and one is for other OSDs.
>
> And now that you point that out, I wonder if the problem is lack of
> Cond signaling in ms_dispatch. I'm on my phone right now but I believe
> there's a chunk of commented-out code (why commented instead of
> deleted? I don't know) that we want to uncomment for reasons that will
> become clear when you look at it. :)
> Try that and see how many of your problems disappear?
>

So I cherry-picked Sage's commit 7641a0e171f onto the code
I've been running (1fe75ee6419 + some debug stuff), and saw
no obvious difference in behaviour.

I also tested Sage's suggestion of separating journals and
data, by putting two journal partitions on half my disks,
and two data partitions on the other half.  I made the data
partitions relatively small (~200 GiB each on a 1 TiB drive)
to minimize the effect of inner vs. outer tracks.

That didn't seem to help either.

Still looking -- Jim


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
       [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
@ 2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
  2012-02-06 16:20                           ` Jim Schutt
  1 sibling, 0 replies; 47+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-04  0:04 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel

(resending to list)

On Fri, Feb 3, 2012 at 3:33 PM, Jim Schutt <jaschut@sandia.gov> wrote:
>
> On 02/03/2012 10:06 AM, Gregory Farnum wrote:
>>
>> On Feb 3, 2012, at 8:18 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>
>>> On 02/02/2012 05:28 PM, Gregory Farnum wrote:
>>>>
>>>> On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt<jaschut@sandia.gov>   wrote:
>>>>>
>>>>> I found 0 instances of "waiting for commit" in all my OSD logs for my last
>>>>> run.
>>>>>
>>>>> So I never waited on the journal?
>>>>
>>>>
>>>> Looks like it. Interesting.
>>>>
>>>>
>>>>>>> So far I'm looking at two behaviours I've noticed that seem anomalous to
>>>>>>> me.
>>>>>>>
>>>>>>> One is that I instrumented ms_dispatch(), and I see it take
>>>>>>> a half-second or more several hundred times, out of several
>>>>>>> thousand messages.  Is that expected?
>>>>>>
>>>>>>
>>>>>>
>>>>>> How did you instrument it? If you wrapped the whole function it's
>>>>>> possible that those longer runs are actually chewing through several
>>>>>> messages that had to get waitlisted for some reason previously.
>>>>>> (That's the call to do_waiters().)
>>>>>
>>>>>
>>>>>
>>>>> Yep, I wrapped the whole function, and also instrumented taking osd_lock
>>>>> while I was there.  About half the time that ms_dispatch() takes more than
>>>>> 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
>>>>> dispatch threads, one for ops and one for rep_ops, right?  So one's
>>>>> waiting on the other?
>>>>
>>>>
>>>> There's just one main dispatcher; no split for the ops and rep_ops .
>>>> The reason for that "dispatch_running" is that if there are requests
>>>> waiting then the tick() function will run through them if the
>>>> messenger dispatch thread is currently idle.
>>>> But it is possible for the Messenger to try and dispatch, and for that
>>>> to be blocked while some amount of (usually trivial) work is being
>>>> done by a different thread, yes. I don't think we've ever observed it
>>>> being a problem for anything other than updating OSD maps, though...
>>>
>>>
>>> Ah, OK.
>>>
>>> I guess I was confused by my log output, e.g.:
>>
>>
>> D'oh. Sorry, you confused me with your reference to repops, which
>> aren't special-cased or anything. But there are two messengers on the
>> OSD, each with their own dispatch thread. One of those messengers is
>> for clients and one is for other OSDs.
>>
>> And now that you point that out, I wonder if the problem is lack of
>> Cond signaling in ms_dispatch. I'm on my phone right now but I believe
>> there's a chunk of commented-out code (why commented instead of
>> deleted? I don't know) that we want to uncomment for reasons that will
>> become clear when you look at it. :)
>> Try that and see how many of your problems disappear?
>>
>
> So I cherry-picked Sage's commit 7641a0e171f onto the code
> I've been running (1fe75ee6419 + some debug stuff), and saw
> no obvious difference in behaviour.
>
> I also tested Sage's suggestion of separating journals and
> data, by putting two journal partitions on half my disks,
> and two data partitions on the other half.  I made the data
> partitions relatively small (~200 GiB each on a 1 TiB drive)
> to minimize the effect of inner vs. outer tracks.
>
> That didn't seem to help either.
>

You can try running 'iostat -t -kx -d 1' on the osds, and see whether
%util reaches 100%, and when it happens whether it's due to number of
io operations that are thrashing, or whether it's due to high amount
of data.
FWIW, you may try setting  'filestore flusher = false', and set
/proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M).

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
       [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
  2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
@ 2012-02-06 16:20                           ` Jim Schutt
  2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
  1 sibling, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-06 16:20 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote:
> On Fri, Feb 3, 2012 at 3:33 PM, Jim Schutt<jaschut@sandia.gov>  wrote:

>
> You can try running 'iostat -t -kx -d 1' on the osds, and see whether %util
> reaches 100%, and when it happens whether it's due to number of io
> operations that are thrashing, or whether it's due to high amount of data.
> FWIW, you may try setting  'filestore flusher = false', and set
> /proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M).

Here's some iostat data from early in a run, when things are
running well:


02/02/2012 09:14:13 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           23.24    0.00   61.99    7.38    0.00    7.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00  206.00     0.00   101.57  1009.79    54.80  251.27   4.86 100.10
sdd               0.00     0.00    0.00  202.00     0.00    98.10   994.61    27.85  132.42   4.96 100.10
sde               0.00     4.00    0.00  212.00     0.00   105.09  1015.25    96.06  588.43   4.72 100.10
sdh               0.00     0.00    0.00  200.00     0.00    97.11   994.40    69.77  535.01   5.00 100.10
sdg               0.00     2.00    0.00  221.00     0.00   109.59  1015.60    82.05  298.71   4.53 100.10
sda               0.00     1.00    0.00  212.00     0.00    83.93   810.75    18.26   84.82   4.68  99.30
sdf               0.00     0.00    0.00  208.00     0.00   102.55  1009.73    77.23  383.19   4.50  93.70
sdb               0.00     0.00    0.00  205.00     0.00    98.66   985.68    19.97  133.98   4.84  99.20
sdj               0.00     0.00    0.00  202.00     0.00    99.59  1009.66    69.97  257.47   4.95 100.00
sdk               0.00     0.00    0.00  204.00     0.00    98.10   984.86    20.83  100.34   4.87  99.30
sdm               0.00     0.00    0.00  216.00     0.00   106.55  1010.22    77.73  268.67   4.63 100.00
sdn               0.00     0.00    0.00  205.00     0.00    98.60   985.05    19.33   95.88   4.81  98.60
sdo               0.00     0.00    0.00  232.00     0.00   106.25   937.93    23.26   82.19   4.29  99.50
sdl               0.00     0.00    0.00  181.00     0.00    85.12   963.09    24.73  131.71   4.80  86.80
sdp               0.00     4.00    0.00  207.00     0.00    87.41   864.77    37.01  111.13   4.49  93.00
sdi               0.00     0.00    0.00  208.00     0.00   103.04  1014.54    72.30  263.72   4.70  97.70
sdr               0.00     0.00    0.00  191.00     0.00    76.75   822.95    11.51   83.69   4.59  87.60
sds               0.00     0.00    0.00  209.00     0.00   101.91   998.58    49.95  278.08   4.70  98.20
sdt               0.00     0.00    0.00  209.00     0.00    99.57   975.69    27.31  157.44   4.79 100.10
sdu               0.00     0.00    0.00  216.00     0.00   107.09  1015.41    79.82  345.88   4.63 100.10
sdw               0.00     0.00    0.00  208.00     0.00   103.09  1015.00    74.55  308.15   4.81 100.10
sdv               0.00     0.00    0.00  201.00     0.00    98.05   999.08    76.87  265.88   4.98 100.10
sdx               0.00     0.00    0.00  202.00     0.00   100.50  1018.93   110.40  327.68   4.96 100.10
sdq               0.00     0.00    0.00  228.00     0.00   112.59  1011.30    54.84  281.04   4.39 100.10

02/02/2012 09:14:14 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           22.11    0.00   54.03   15.38    0.00    8.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00  233.00     0.00    99.68   876.15    95.98  384.42   4.29 100.00
sdd               0.00     0.00    0.00  205.00     0.00    96.64   965.46    20.37  108.51   4.84  99.30
sde               0.00     0.00    0.00  225.00     0.00    99.54   906.03    92.38  420.67   4.44 100.00
sdh               0.00     0.00    0.00  198.00     0.00    97.05  1003.84    79.39  410.56   5.05 100.00
sdg               0.00     0.00    0.00  245.00     0.00   108.38   905.99    84.40  385.47   4.08 100.00
sda               0.00     4.00    0.00  220.00     0.00    96.23   895.78    63.24  294.59   4.44  97.60
sdf               0.00     0.00    0.00  216.00     0.00   107.09  1015.41    87.67  399.14   4.57  98.80
sdb               0.00     0.00    0.00  156.00     0.00    72.05   945.95    11.61   58.94   4.84  75.50
sdj               0.00     0.00    0.00  199.00     0.00    95.41   981.95    56.28  366.11   4.84  96.40
sdk               0.00     0.00    0.00  206.00     0.00   100.14   995.57    54.69  241.41   4.86 100.10
sdm               0.00     0.00    0.00  200.00     0.00    99.09  1014.72    79.51  506.47   4.74  94.70
sdn               0.00     0.00    0.00  191.00     0.00    91.29   978.81    26.82  128.39   5.18  98.90
sdo               0.00     0.00    0.00  234.00     0.00   106.75   934.32    49.82  231.07   4.27 100.00
sdl               0.00     0.00    0.00  214.00     0.00   103.62   991.70    33.03  168.13   4.62  98.80
sdp               0.00     0.00    0.00  219.00     0.00   106.08   992.00    64.69  328.92   4.57 100.00
sdi               0.00     0.00    0.00  210.00     0.00   104.09  1015.09   100.98  421.01   4.76 100.00
sdr               0.00     0.00    0.00  180.00     0.00    81.66   929.07    10.31   63.59   5.12  92.20
sds               0.00     0.00    0.00  201.00     0.00    95.15   969.47    32.60  144.16   4.98 100.00
sdt               0.00     0.00    0.00  198.00     0.00    95.72   990.10    33.26  155.98   4.84  95.90
sdu               0.00     0.00    0.00  219.00     0.00   108.59  1015.53    66.10  347.91   4.57 100.00
sdw               0.00     0.00    0.00  204.00     0.00   100.75  1011.41    81.20  456.47   4.80  98.00
sdv               0.00     0.00    0.00  197.00     0.00    96.09   998.90    44.19  284.65   5.08 100.00
sdx               0.00     0.00    0.00  211.00     0.00   104.19  1011.26    84.87  542.85   4.69  99.00
sdq               0.00     0.00    0.00  216.00     0.00   105.10   996.52    36.63  134.40   4.63 100.00


This is later in the same run, when things are not going as well:

02/02/2012 09:21:52 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            5.13    0.00   13.31    8.52    0.00   73.04

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00   36.00     0.00    16.02   911.11     1.43   39.72   5.64  20.30
sdd               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.85   47.28   6.39  11.50
sde               0.00     0.00    0.00    4.00     0.00     0.01     6.00     0.08   20.00  13.00   5.20
sdh               0.00     0.00    0.00   20.00     0.00     8.01   820.40     0.65   32.40   5.30  10.60
sdg               0.00     0.00    0.00   19.00     0.00     8.01   863.58     0.60   31.63   4.63   8.80
sda               0.00     0.00    0.00   82.00     0.00    36.04   900.10     3.13   37.05   5.15  42.20
sdf               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.80   44.22   6.39  11.50
sdb               0.00     8.00    0.00   42.00     0.00     1.75    85.52     0.14    3.43   1.40   5.90
sdj               0.00    16.00    0.00  103.00     0.00    25.64   509.83     2.21   21.36   3.65  37.60
sdk               0.00    14.00    0.00  152.00     0.00    47.93   645.79     3.96   27.31   4.12  62.60
sdm               0.00     0.00    0.00   21.00     0.00     9.39   915.81     0.94   44.57   5.71  12.00
sdn               0.00    34.00    0.00  197.00     0.00    64.61   671.72    28.66   85.62   4.02  79.10
sdo               0.00     0.00    0.00   92.00     0.00    42.54   946.87     6.22   55.58   4.85  44.60
sdl               0.00     0.00    0.00    6.00     0.00     2.01   685.33     0.09   59.67   6.33   3.80
sdp               0.00    10.00    0.00   58.00     0.00     9.56   337.52     1.20   20.60   3.05  17.70
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdr               0.00     0.00    0.00   37.00     0.00    16.02   886.92     1.19   32.27   5.11  18.90
sds               0.00    18.00    0.00  115.00     0.00    26.54   472.70     4.03   25.94   3.20  36.80
sdt               0.00     0.00    0.00  131.00     0.00    60.05   938.87     6.13   46.33   5.11  67.00
sdu               0.00    12.00    0.00  119.00     0.00    31.40   540.44     2.93   24.65   3.05  36.30
sdw               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdv               0.00     4.00    0.00   63.00     0.00     9.46   307.68     0.83   14.32   2.38  15.00
sdx               0.00     0.00    0.00   35.00     0.00    15.51   907.66     0.79   28.20   4.89  17.10
sdq               0.00     0.00    0.00   37.00     0.00    16.02   886.70     1.52   41.00   5.86  21.70

02/02/2012 09:21:53 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            3.74    0.00    8.75    6.60    0.00   80.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.88   48.94   6.83  12.30
sda               0.00     0.00    0.00   45.00     0.00     7.38   335.64     0.54   18.87   1.78   8.00
sdf               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.93   51.44   6.78  12.20
sdb               0.00     0.00    0.00    5.00     0.00     0.74   302.40     0.05   10.20   8.20   4.10
sdj               0.00     0.00    0.00   72.00     0.00    32.03   911.11     2.51   34.99   5.01  36.10
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00  123.00     0.00    52.60   875.84    13.83  209.72   4.84  59.50
sdo               0.00     0.00    0.00   13.00     0.00     5.52   868.92     0.30  108.31   4.69   6.10
sdl               0.00     0.00    0.00   27.00     0.00    12.47   945.78     1.33   47.15   6.59  17.80
sdp               0.00     0.00    0.00   11.00     0.00     4.50   838.55     0.51   14.09   5.09   5.60
sdi               0.00     0.00    0.00   19.00     0.00     8.01   863.58     0.72   38.05   5.74  10.90
sdr               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.69   38.33   5.89  10.60
sds               0.00     0.00    0.00   56.00     0.00    19.66   718.86     1.31   39.16   5.11  28.60
sdt               0.00     0.00    0.00  161.00     0.00    72.57   923.18     6.97   37.39   5.07  81.70
sdu               0.00     0.00    0.00   66.00     0.00    30.02   931.64     2.77   39.85   5.09  33.60
sdw               0.00     0.00    0.00   20.00     0.00     8.51   871.60     1.47   27.80   4.85   9.70
sdv               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdx               0.00     0.00    0.00   36.00     0.00    16.02   911.11     1.37   38.08   5.72  20.60
sdq               0.00     0.00    0.00   44.00     0.00    19.46   906.00     1.15   26.02   4.50  19.80

And finally, this is still later, near the end of the run, when things have recovered
somewhat:

02/02/2012 09:22:34 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           15.25    0.00   52.27   20.88    0.00   11.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     1.00    0.00  217.00     0.00    95.20   898.51    84.43  413.56   4.60  99.90
sdd               0.00     0.00    0.00   40.00     0.00    16.86   863.00     1.59   28.45   5.55  22.20
sde               0.00     0.00    0.00  206.00     0.00    99.27   986.95    89.64  452.92   4.85  99.90
sdh               0.00     0.00    0.00   51.00     0.00    22.53   904.63     2.02   35.45   5.47  27.90
sdg               0.00     0.00    0.00  230.00     0.00   112.49  1001.63    92.87  283.01   4.33  99.60
sda               0.00     0.00    0.00  215.00     0.00   106.10  1010.68    94.45  253.40   4.65  99.90
sdf               0.00     0.00    0.00   73.00     0.00    32.04   898.74     2.20   30.08   5.11  37.30
sdb               0.00     0.00    0.00   92.00     0.00    40.05   891.48     2.55   27.70   4.85  44.60
sdj               0.00    44.00    0.00  280.00     0.00    91.61   670.03   109.32  314.59   3.57  99.90
sdk               0.00     1.00    0.00  210.00     0.00   100.63   981.41    97.79  419.98   4.76  99.90
sdm               0.00    42.00    0.00  282.00     0.00   100.27   728.23    92.86  285.38   3.54  99.90
sdn               0.00     0.00    0.00  213.00     0.00   100.81   969.31    41.62  301.33   4.67  99.40
sdo               0.00    39.00    0.00  306.00     0.00   102.84   688.29    82.44  279.69   3.26  99.70
sdl               0.00     0.00    0.00  219.00     0.00   104.16   974.06    83.05  421.80   4.56  99.90
sdp               0.00    46.00    0.00  277.00     0.00    97.01   717.23   106.44  324.31   3.61  99.90
sdi               0.00     0.00    0.00   56.00     0.00    24.03   878.86     1.73   30.91   5.05  28.30
sdr               0.00    34.00    0.00  266.00     0.00    97.66   751.91    63.86  304.39   3.76 100.00
sds               0.00    18.00    0.00   67.00     0.00    17.41   532.18     1.68   25.03   3.79  25.40
sdt               0.00     0.00    0.00  130.00     0.00    64.01  1008.37    56.33  166.52   4.99  64.90
sdu               0.00     0.00    0.00  197.00     0.00    95.02   987.82    44.70  282.45   4.95  97.60
sdw               0.00     0.00    0.00  207.00     0.00    93.39   923.98    90.21  448.08   4.83  99.90
sdv               0.00     0.00    0.00  204.00     0.00   100.52  1009.14    84.16  425.70   4.85  98.90
sdx               0.00     0.00    0.00  203.00     0.00    88.75   895.33    87.10  475.92   4.92  99.90
sdq               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.52   28.83   4.83   8.70

02/02/2012 09:22:35 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           14.63    0.00   50.99   22.22    0.00   12.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00  209.00     0.00    99.54   975.35    84.02  409.76   4.78  99.90
sdd               0.00     0.00    0.00   13.00     0.00     5.50   867.08     0.34   57.31   6.23   8.10
sde               0.00     0.00    0.00  204.00     0.00    98.12   985.06    87.28  418.62   4.88  99.50
sdh               0.00     0.00    0.00   78.00     0.00    34.12   895.79     2.15   30.26   5.37  41.90
sdg               0.00     0.00    0.00  226.00     0.00   108.48   983.04    93.54  336.46   4.42  99.80
sda               0.00     0.00    0.00  219.00     0.00   108.07  1010.63    80.90  510.96   4.53  99.20
sdf               0.00     6.00    0.00   81.00     0.00    21.20   535.90     1.99   24.47   3.59  29.10
sdb               0.00     0.00    0.00   71.00     0.00    32.03   923.94     2.46   34.63   4.65  33.00
sdj               0.00     0.00    0.00  192.00     0.00    83.87   894.62    83.33  459.53   5.21 100.10
sdk               0.00    41.00    0.00  285.00     0.00    94.12   676.32   104.34  310.17   3.51 100.10
sdm               0.00     0.00    0.00  202.00     0.00    90.44   916.91    86.45  506.52   4.96 100.10
sdn               0.00     0.00    0.00  208.00     0.00   101.48   999.23    87.79  323.35   4.79  99.70
sdo               0.00     1.00    0.00  228.00     0.00   108.63   975.75    89.79  327.24   4.38  99.80
sdl               0.00    28.00    0.00  270.00     0.00    97.64   740.65    52.06  281.67   3.54  95.60
sdp               0.00     0.00    0.00  195.00     0.00    85.65   899.57    92.28  453.54   5.14 100.20
sdi               0.00    14.00    0.00   31.00     0.00     9.02   595.61     0.96   30.94   4.77  14.80
sdr               0.00     0.00    0.00  192.00     0.00    83.11   886.46    14.22  142.39   5.06  97.10
sds               0.00     0.00    0.00   18.00     0.00     8.01   911.11     0.73   40.39   5.89  10.60
sdt               0.00     0.00    0.00  201.00     0.00    98.66  1005.29    65.87  425.37   4.89  98.30
sdu               0.00     0.00    0.00  209.00     0.00   103.01  1009.38    87.49  285.51   4.74  99.10
sdw               0.00     0.00    0.00  204.00     0.00    96.74   971.22    82.66  410.50   4.89  99.70
sdv               0.00     0.00    0.00  198.00     0.00    96.61   999.23    83.39  420.17   5.03  99.50
sdx               0.00     0.00    0.00  204.00     0.00    98.79   991.80    86.54  428.67   4.90 100.00
sdq               0.00     0.00    0.00   36.00     0.00    16.02   911.11     0.88   24.33   4.44  16.00


The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.

-- Jim

>
> Yehuda
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-06 16:20                           ` Jim Schutt
@ 2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
  2012-02-06 18:20                               ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-06 17:22 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel

On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote:
>>
>> On Fri, Feb 3, 2012 at 3:33 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>
>
>>
>> You can try running 'iostat -t -kx -d 1' on the osds, and see whether
>> %util
>> reaches 100%, and when it happens whether it's due to number of io
>> operations that are thrashing, or whether it's due to high amount of data.
>> FWIW, you may try setting  'filestore flusher = false', and set
>> /proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M).
>
>
> Here's some iostat data from early in a run, when things are
> running well:
>
>
> 02/02/2012 09:14:13 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          23.24    0.00   61.99    7.38    0.00    7.38
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     0.00    0.00  206.00     0.00   101.57  1009.79
>  54.80  251.27   4.86 100.10
> sdd               0.00     0.00    0.00  202.00     0.00    98.10   994.61
>  27.85  132.42   4.96 100.10
> sde               0.00     4.00    0.00  212.00     0.00   105.09  1015.25
>  96.06  588.43   4.72 100.10
> sdh               0.00     0.00    0.00  200.00     0.00    97.11   994.40
>  69.77  535.01   5.00 100.10
> sdg               0.00     2.00    0.00  221.00     0.00   109.59  1015.60
>  82.05  298.71   4.53 100.10
> sda               0.00     1.00    0.00  212.00     0.00    83.93   810.75
>  18.26   84.82   4.68  99.30
> sdf               0.00     0.00    0.00  208.00     0.00   102.55  1009.73
>  77.23  383.19   4.50  93.70
> sdb               0.00     0.00    0.00  205.00     0.00    98.66   985.68
>  19.97  133.98   4.84  99.20
> sdj               0.00     0.00    0.00  202.00     0.00    99.59  1009.66
>  69.97  257.47   4.95 100.00
> sdk               0.00     0.00    0.00  204.00     0.00    98.10   984.86
>  20.83  100.34   4.87  99.30
> sdm               0.00     0.00    0.00  216.00     0.00   106.55  1010.22
>  77.73  268.67   4.63 100.00
> sdn               0.00     0.00    0.00  205.00     0.00    98.60   985.05
>  19.33   95.88   4.81  98.60
> sdo               0.00     0.00    0.00  232.00     0.00   106.25   937.93
>  23.26   82.19   4.29  99.50
> sdl               0.00     0.00    0.00  181.00     0.00    85.12   963.09
>  24.73  131.71   4.80  86.80
> sdp               0.00     4.00    0.00  207.00     0.00    87.41   864.77
>  37.01  111.13   4.49  93.00
> sdi               0.00     0.00    0.00  208.00     0.00   103.04  1014.54
>  72.30  263.72   4.70  97.70
> sdr               0.00     0.00    0.00  191.00     0.00    76.75   822.95
>  11.51   83.69   4.59  87.60
> sds               0.00     0.00    0.00  209.00     0.00   101.91   998.58
>  49.95  278.08   4.70  98.20
> sdt               0.00     0.00    0.00  209.00     0.00    99.57   975.69
>  27.31  157.44   4.79 100.10
> sdu               0.00     0.00    0.00  216.00     0.00   107.09  1015.41
>  79.82  345.88   4.63 100.10
> sdw               0.00     0.00    0.00  208.00     0.00   103.09  1015.00
>  74.55  308.15   4.81 100.10
> sdv               0.00     0.00    0.00  201.00     0.00    98.05   999.08
>  76.87  265.88   4.98 100.10
> sdx               0.00     0.00    0.00  202.00     0.00   100.50  1018.93
> 110.40  327.68   4.96 100.10
> sdq               0.00     0.00    0.00  228.00     0.00   112.59  1011.30
>  54.84  281.04   4.39 100.10
>
> 02/02/2012 09:14:14 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          22.11    0.00   54.03   15.38    0.00    8.48
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     0.00    0.00  233.00     0.00    99.68   876.15
>  95.98  384.42   4.29 100.00
> sdd               0.00     0.00    0.00  205.00     0.00    96.64   965.46
>  20.37  108.51   4.84  99.30
> sde               0.00     0.00    0.00  225.00     0.00    99.54   906.03
>  92.38  420.67   4.44 100.00
> sdh               0.00     0.00    0.00  198.00     0.00    97.05  1003.84
>  79.39  410.56   5.05 100.00
> sdg               0.00     0.00    0.00  245.00     0.00   108.38   905.99
>  84.40  385.47   4.08 100.00
> sda               0.00     4.00    0.00  220.00     0.00    96.23   895.78
>  63.24  294.59   4.44  97.60
> sdf               0.00     0.00    0.00  216.00     0.00   107.09  1015.41
>  87.67  399.14   4.57  98.80
> sdb               0.00     0.00    0.00  156.00     0.00    72.05   945.95
>  11.61   58.94   4.84  75.50
> sdj               0.00     0.00    0.00  199.00     0.00    95.41   981.95
>  56.28  366.11   4.84  96.40
> sdk               0.00     0.00    0.00  206.00     0.00   100.14   995.57
>  54.69  241.41   4.86 100.10
> sdm               0.00     0.00    0.00  200.00     0.00    99.09  1014.72
>  79.51  506.47   4.74  94.70
> sdn               0.00     0.00    0.00  191.00     0.00    91.29   978.81
>  26.82  128.39   5.18  98.90
> sdo               0.00     0.00    0.00  234.00     0.00   106.75   934.32
>  49.82  231.07   4.27 100.00
> sdl               0.00     0.00    0.00  214.00     0.00   103.62   991.70
>  33.03  168.13   4.62  98.80
> sdp               0.00     0.00    0.00  219.00     0.00   106.08   992.00
>  64.69  328.92   4.57 100.00
> sdi               0.00     0.00    0.00  210.00     0.00   104.09  1015.09
> 100.98  421.01   4.76 100.00
> sdr               0.00     0.00    0.00  180.00     0.00    81.66   929.07
>  10.31   63.59   5.12  92.20
> sds               0.00     0.00    0.00  201.00     0.00    95.15   969.47
>  32.60  144.16   4.98 100.00
> sdt               0.00     0.00    0.00  198.00     0.00    95.72   990.10
>  33.26  155.98   4.84  95.90
> sdu               0.00     0.00    0.00  219.00     0.00   108.59  1015.53
>  66.10  347.91   4.57 100.00
> sdw               0.00     0.00    0.00  204.00     0.00   100.75  1011.41
>  81.20  456.47   4.80  98.00
> sdv               0.00     0.00    0.00  197.00     0.00    96.09   998.90
>  44.19  284.65   5.08 100.00
> sdx               0.00     0.00    0.00  211.00     0.00   104.19  1011.26
>  84.87  542.85   4.69  99.00
> sdq               0.00     0.00    0.00  216.00     0.00   105.10   996.52
>  36.63  134.40   4.63 100.00
>
>
> This is later in the same run, when things are not going as well:
>
> 02/02/2012 09:21:52 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           5.13    0.00   13.31    8.52    0.00   73.04
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     0.00    0.00   36.00     0.00    16.02   911.11
>   1.43   39.72   5.64  20.30
> sdd               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.85   47.28   6.39  11.50
> sde               0.00     0.00    0.00    4.00     0.00     0.01     6.00
>   0.08   20.00  13.00   5.20
> sdh               0.00     0.00    0.00   20.00     0.00     8.01   820.40
>   0.65   32.40   5.30  10.60
> sdg               0.00     0.00    0.00   19.00     0.00     8.01   863.58
>   0.60   31.63 4.63 8.80
> sda               0.00     0.00    0.00   82.00     0.00    36.04   900.10
>   3.13 37.05 5.15  42.20
> sdf               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.80   44.22   6.39  11.50
> sdb               0.00     8.00    0.00   42.00     0.00     1.75    85.52
>   0.14    3.43   1.40   5.90
> sdj               0.00    16.00    0.00  103.00     0.00    25.64   509.83
>   2.21   21.36   3.65  37.60
> sdk               0.00    14.00    0.00  152.00     0.00    47.93   645.79
>   3.96   27.31   4.12  62.60
> sdm               0.00     0.00    0.00   21.00     0.00     9.39   915.81
>   0.94   44.57   5.71  12.00
> sdn               0.00    34.00    0.00  197.00     0.00    64.61   671.72
>  28.66   85.62   4.02  79.10
> sdo               0.00     0.00    0.00   92.00     0.00    42.54   946.87
>   6.22   55.58   4.85  44.60
> sdl               0.00     0.00    0.00    6.00     0.00     2.01   685.33
>   0.09   59.67   6.33   3.80
> sdp               0.00    10.00    0.00   58.00     0.00     9.56   337.52
>   1.20   20.60   3.05  17.70
> sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdr               0.00     0.00    0.00   37.00     0.00    16.02   886.92
>   1.19   32.27   5.11  18.90
> sds               0.00    18.00    0.00  115.00     0.00    26.54   472.70
>   4.03 25.94 3.20  36.80
> sdt               0.00     0.00    0.00  131.00     0.00    60.05   938.87
>   6.13 46.33 5.11  67.00
> sdu               0.00    12.00    0.00  119.00     0.00    31.40   540.44
>   2.93   24.65   3.05  36.30
> sdw               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdv               0.00     4.00    0.00   63.00     0.00     9.46   307.68
>   0.83   14.32 2.38 15.00
> sdx               0.00     0.00    0.00   35.00     0.00    15.51   907.66
>   0.79   28.20   4.89  17.10
> sdq               0.00     0.00    0.00   37.00     0.00    16.02   886.70
>   1.52   41.00   5.86  21.70
>
> 02/02/2012 09:21:53 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           3.74    0.00    8.75    6.60    0.00   80.90
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdg               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.88   48.94   6.83  12.30
> sda               0.00     0.00    0.00   45.00     0.00     7.38   335.64
>   0.54   18.87   1.78   8.00
> sdf               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.93   51.44   6.78  12.20
> sdb               0.00     0.00    0.00    5.00     0.00     0.74   302.40
>   0.05   10.20   8.20   4.10
> sdj               0.00     0.00    0.00   72.00     0.00    32.03   911.11
>   2.51 34.99 5.01  36.10
> sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdn               0.00     0.00    0.00  123.00     0.00    52.60   875.84
>  13.83  209.72   4.84  59.50
> sdo               0.00     0.00    0.00   13.00     0.00     5.52   868.92
>   0.30  108.31   4.69   6.10
> sdl               0.00     0.00    0.00   27.00     0.00    12.47   945.78
>   1.33   47.15   6.59  17.80
> sdp               0.00     0.00    0.00   11.00     0.00     4.50   838.55
>   0.51   14.09   5.09   5.60
> sdi               0.00     0.00    0.00   19.00     0.00     8.01   863.58
>   0.72   38.05   5.74  10.90
> sdr               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.69   38.33   5.89  10.60
> sds               0.00     0.00    0.00   56.00     0.00    19.66   718.86
>   1.31   39.16   5.11  28.60
> sdt               0.00     0.00    0.00  161.00     0.00    72.57   923.18
>   6.97   37.39   5.07  81.70
> sdu               0.00     0.00    0.00   66.00     0.00    30.02   931.64
>   2.77   39.85   5.09  33.60
> sdw               0.00     0.00    0.00   20.00     0.00     8.51   871.60
>   1.47   27.80   4.85   9.70
> sdv               0.00     0.00    0.00    0.00     0.00     0.00     0.00
>   0.00    0.00   0.00   0.00
> sdx               0.00     0.00    0.00   36.00     0.00    16.02   911.11
>   1.37   38.08   5.72  20.60
> sdq               0.00     0.00    0.00   44.00     0.00    19.46   906.00
>   1.15   26.02   4.50  19.80
>
> And finally, this is still later, near the end of the run, when things have
> recovered
> somewhat:
>
> 02/02/2012 09:22:34 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          15.25    0.00   52.27   20.88    0.00   11.60
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     1.00    0.00  217.00     0.00    95.20   898.51
>  84.43  413.56   4.60  99.90
> sdd               0.00     0.00    0.00   40.00     0.00    16.86   863.00
>   1.59   28.45   5.55  22.20
> sde               0.00     0.00    0.00  206.00     0.00    99.27   986.95
>  89.64  452.92   4.85  99.90
> sdh               0.00     0.00    0.00   51.00     0.00    22.53   904.63
>   2.02 35.45 5.47  27.90
> sdg               0.00     0.00    0.00  230.00     0.00   112.49  1001.63
>  92.87  283.01   4.33  99.60
> sda               0.00     0.00    0.00  215.00     0.00   106.10  1010.68
>  94.45  253.40   4.65  99.90
> sdf               0.00     0.00    0.00   73.00     0.00    32.04   898.74
>   2.20   30.08   5.11  37.30
> sdb               0.00     0.00    0.00   92.00     0.00    40.05   891.48
>   2.55   27.70   4.85  44.60
> sdj               0.00    44.00    0.00  280.00     0.00    91.61   670.03
> 109.32  314.59   3.57  99.90
> sdk               0.00     1.00    0.00  210.00     0.00   100.63   981.41
>  97.79  419.98   4.76  99.90
> sdm               0.00    42.00    0.00  282.00     0.00   100.27   728.23
>  92.86  285.38   3.54  99.90
> sdn               0.00     0.00    0.00  213.00     0.00   100.81   969.31
>  41.62  301.33   4.67  99.40
> sdo               0.00    39.00    0.00  306.00     0.00   102.84   688.29
>  82.44  279.69   3.26  99.70
> sdl               0.00     0.00    0.00  219.00     0.00   104.16   974.06
>  83.05  421.80   4.56  99.90
> sdp               0.00    46.00    0.00  277.00     0.00    97.01   717.23
> 106.44  324.31   3.61  99.90
> sdi               0.00     0.00    0.00   56.00     0.00    24.03   878.86
>   1.73   30.91   5.05  28.30
> sdr               0.00    34.00    0.00  266.00     0.00    97.66   751.91
>  63.86  304.39   3.76 100.00
> sds               0.00    18.00    0.00   67.00     0.00    17.41   532.18
>   1.68   25.03   3.79  25.40
> sdt               0.00     0.00    0.00  130.00     0.00    64.01  1008.37
>  56.33  166.52   4.99  64.90
> sdu               0.00     0.00    0.00  197.00     0.00    95.02   987.82
>  44.70  282.45   4.95  97.60
> sdw               0.00     0.00    0.00  207.00     0.00    93.39   923.98
>  90.21  448.08   4.83  99.90
> sdv               0.00     0.00    0.00  204.00     0.00   100.52  1009.14
>  84.16  425.70   4.85  98.90
> sdx               0.00     0.00    0.00  203.00     0.00    88.75   895.33
>  87.10  475.92   4.92  99.90
> sdq               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.52   28.83   4.83   8.70
>
> 02/02/2012 09:22:35 AM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          14.63    0.00   50.99   22.22    0.00   12.16
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sdc               0.00     0.00    0.00  209.00     0.00    99.54   975.35
>  84.02  409.76   4.78  99.90
> sdd               0.00     0.00    0.00   13.00     0.00     5.50   867.08
>   0.34   57.31   6.23   8.10
> sde               0.00     0.00    0.00  204.00     0.00    98.12   985.06
>  87.28  418.62   4.88  99.50
> sdh               0.00     0.00    0.00   78.00     0.00    34.12   895.79
>   2.15 30.26 5.37  41.90
> sdg               0.00     0.00    0.00  226.00     0.00   108.48   983.04
>  93.54  336.46   4.42  99.80
> sda               0.00     0.00    0.00  219.00     0.00   108.07  1010.63
>  80.90  510.96   4.53  99.20
> sdf               0.00     6.00    0.00   81.00     0.00    21.20   535.90
>   1.99   24.47   3.59  29.10
> sdb               0.00     0.00    0.00   71.00     0.00    32.03   923.94
>   2.46 34.63 4.65  33.00
> sdj               0.00     0.00    0.00  192.00     0.00    83.87   894.62
>  83.33  459.53   5.21 100.10
> sdk               0.00    41.00    0.00  285.00     0.00    94.12   676.32
> 104.34  310.17   3.51 100.10
> sdm               0.00     0.00    0.00  202.00     0.00    90.44   916.91
>  86.45  506.52   4.96 100.10
> sdn               0.00     0.00    0.00  208.00     0.00   101.48   999.23
>  87.79  323.35   4.79  99.70
> sdo               0.00     1.00    0.00  228.00     0.00   108.63   975.75
>  89.79  327.24   4.38  99.80
> sdl               0.00    28.00    0.00  270.00     0.00    97.64   740.65
>  52.06  281.67   3.54  95.60
> sdp               0.00     0.00    0.00  195.00     0.00    85.65   899.57
>  92.28  453.54   5.14 100.20
> sdi               0.00    14.00    0.00   31.00     0.00     9.02   595.61
>   0.96   30.94   4.77  14.80
> sdr               0.00     0.00    0.00  192.00     0.00    83.11   886.46
>  14.22  142.39   5.06  97.10
> sds               0.00     0.00    0.00   18.00     0.00     8.01   911.11
>   0.73   40.39   5.89  10.60
> sdt               0.00     0.00    0.00  201.00     0.00    98.66  1005.29
>  65.87  425.37   4.89  98.30
> sdu               0.00     0.00    0.00  209.00     0.00   103.01  1009.38
>  87.49  285.51   4.74  99.10
> sdw               0.00     0.00    0.00  204.00     0.00    96.74   971.22
>  82.66  410.50   4.89  99.70
> sdv               0.00     0.00    0.00  198.00     0.00    96.61   999.23
>  83.39  420.17   5.03  99.50
> sdx               0.00     0.00    0.00  204.00     0.00    98.79   991.80
>  86.54  428.67   4.90 100.00
> sdq               0.00     0.00    0.00   36.00     0.00    16.02   911.11
>   0.88   24.33   4.44  16.00
>
>
> The above suggests to me that the slowdown is a result
> of requests not getting submitted at the same rate as
> when things are running well.
>

Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
@ 2012-02-06 18:20                               ` Jim Schutt
  2012-02-06 18:35                                 ` Gregory Farnum
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-06 18:20 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>  wrote:

>>
>> The above suggests to me that the slowdown is a result
>> of requests not getting submitted at the same rate as
>> when things are running well.
>>
>
> Yeah, it really looks like that. My suggestions wouldn't help there.
>
> I do see that when things go well the number of writes per device is
> capped at ~200 writes per second and the throughput per device is
> ~100MB/sec. Is 100MB/sec the expected device throughput?

Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.

-- Jim



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-06 18:20                               ` Jim Schutt
@ 2012-02-06 18:35                                 ` Gregory Farnum
  2012-02-09 20:53                                   ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: Gregory Farnum @ 2012-02-06 18:35 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Yehuda Sadeh Weinraub, ceph-devel

On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>
>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>
>
>>>
>>> The above suggests to me that the slowdown is a result
>>> of requests not getting submitted at the same rate as
>>> when things are running well.
>>>
>>
>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>
>> I do see that when things go well the number of writes per device is
>> capped at ~200 writes per second and the throughput per device is
>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>
>
> Pretty much, at least for the outer tracks on a drive.  I've seen
> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
> disk subsystem is any sort of significant bottleneck.

Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-06 18:35                                 ` Gregory Farnum
@ 2012-02-09 20:53                                   ` Jim Schutt
  2012-02-09 22:40                                     ` sridhar basam
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-09 20:53 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Yehuda Sadeh Weinraub, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 8281 bytes --]

On 02/06/2012 11:35 AM, Gregory Farnum wrote:
> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>
>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>    wrote:
>>
>>
>>>>
>>>> The above suggests to me that the slowdown is a result
>>>> of requests not getting submitted at the same rate as
>>>> when things are running well.
>>>>
>>>
>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>
>>> I do see that when things go well the number of writes per device is
>>> capped at ~200 writes per second and the throughput per device is
>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>
>>
>> Pretty much, at least for the outer tracks on a drive.  I've seen
>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>> disk subsystem is any sort of significant bottleneck.
>
> Well, you might try changing your throttling settings on the OSDs.
> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
> and is used to limit the amount of client data in memory (ie; messages
> are included in this throttler for their entire lifetime, not just
> while waiting for dispatch).
>
>

I've made a little progress isolating this.

"osd client message size cap =  5000000" makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader got front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr 1096266], length 1448
15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr 1096266], length 1448
15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr 1096266], length 1448
15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr 1096266], length 1448
15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr 1096266], length 1448
15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr 1096266], length 1448
15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr 1096266], length 1448
15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr 1096266], length 1448
15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1096266], length 1448
15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 2896
15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448

So in this case the client retransmitted for ~2 minutes with no response from
the OSD.  Note that during this time the client was talking to other OSDs on the
same server.

I want to try turning off GSO/GRO on my interfaces, but then
I think I need to post to netdev...

-- Jim

[-- Attachment #2: osd.0.msg-et.svg.bz2 --]
[-- Type: application/x-bzip, Size: 22502 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-09 20:53                                   ` Jim Schutt
@ 2012-02-09 22:40                                     ` sridhar basam
  2012-02-09 23:15                                       ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: sridhar basam @ 2012-02-09 22:40 UTC (permalink / raw)
  To: ceph-devel

On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>
>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>>
>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>    wrote:
>>>
>>>
>>>
>>>>>
>>>>> The above suggests to me that the slowdown is a result
>>>>> of requests not getting submitted at the same rate as
>>>>> when things are running well.
>>>>>
>>>>
>>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>>
>>>> I do see that when things go well the number of writes per device is
>>>> capped at ~200 writes per second and the throughput per device is
>>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>
>>>
>>>
>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>> disk subsystem is any sort of significant bottleneck.
>>
>>
>> Well, you might try changing your throttling settings on the OSDs.
>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>> and is used to limit the amount of client data in memory (ie; messages
>> are included in this throttler for their entire lifetime, not just
>> while waiting for dispatch).
>>
>>
>
> I've made a little progress isolating this.
>
> "osd client message size cap =  5000000" makes the stall
> completely reproducible (which also means I can reproduce
> on two different network types, ethernet and IPoIB.), and I
> am able to generate graphs of throttled/receive/process time
> for each request received by an OSD (see attached SVG plot).
>
> Such plots suggest to me my problem is caused by stalled
> receives.  Using debug ms = 30 on my OSDs turns up instances
> of this:
>
> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader got front 128
> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
> 172.17.131.32:6800/15199 >> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>
> which I take to mean that the reader thread sat in poll() for 56 secs, in
> this case.
>
> I was able to correlate such stalls with tcpdump output collected on
> clients.  Here's an example from another run:
>
> 15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
> 15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
> 15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
> 15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
> 15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr
> 1096266], length 1448
> 15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr
> 1096266], length 1448
> 15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr
> 1096266], length 1448
> 15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr
> 1096266], length 1448
> 15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr
> 1096266], length 1448
> 15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr
> 1096266], length 1448
> 15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr
> 1096266], length 1448
> 15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr
> 1096266], length 1448
> 15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1096266], length 1448
> 15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
> 15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 2896
> 15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
> 15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
>
> So in this case the client retransmitted for ~2 minutes with no response
> from
> the OSD.  Note that during this time the client was talking to other OSDs on
> the
> same server.
>
> I want to try turning off GSO/GRO on my interfaces, but then
> I think I need to post to netdev...
>
> -- Jim

The network trace output looks weird, it either means all of the
packets between 23822697:23886409 were lost or a bug in the networking
stack. The application should have no effect on the acks that should
have been generated. Even if you assume one or more of the frames on
the wire between 23822697:23886409 were somehow lost, you would have
had to see some sort of duplicate acks with sack segments.

Is this a bunch of bare metal servers or are these virtual? If you
could tap the network just upstream of the OSD servers, it would help
narrow down where to look at. You could also just turning off GRO/GSO,
as you suggest, to see if it makes a difference.

 Sridhar
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-09 22:40                                     ` sridhar basam
@ 2012-02-09 23:15                                       ` Jim Schutt
  2012-02-10  0:34                                         ` Tommi Virtanen
  2012-02-10  1:26                                         ` sridhar basam
  0 siblings, 2 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-09 23:15 UTC (permalink / raw)
  To: sridhar basam; +Cc: ceph-devel

On 02/09/2012 03:40 PM, sridhar basam wrote:
> On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>>
>>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@sandia.gov>    wrote:
>>>>
>>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>      wrote:
>>>>
>>>>
>>>>
>>>>>>
>>>>>> The above suggests to me that the slowdown is a result
>>>>>> of requests not getting submitted at the same rate as
>>>>>> when things are running well.
>>>>>>
>>>>>
>>>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>>>
>>>>> I do see that when things go well the number of writes per device is
>>>>> capped at ~200 writes per second and the throughput per device is
>>>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>>
>>>>
>>>>
>>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>>> disk subsystem is any sort of significant bottleneck.
>>>
>>>
>>> Well, you might try changing your throttling settings on the OSDs.
>>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>>> and is used to limit the amount of client data in memory (ie; messages
>>> are included in this throttler for their entire lifetime, not just
>>> while waiting for dispatch).
>>>
>>>
>>
>> I've made a little progress isolating this.
>>
>> "osd client message size cap =  5000000" makes the stall
>> completely reproducible (which also means I can reproduce
>> on two different network types, ethernet and IPoIB.), and I
>> am able to generate graphs of throttled/receive/process time
>> for each request received by an OSD (see attached SVG plot).
>>
>> Such plots suggest to me my problem is caused by stalled
>> receives.  Using debug ms = 30 on my OSDs turns up instances
>> of this:
>>
>> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
>> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader got front 128
>> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
>> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
>> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
>> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
>> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173
>> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>>
>> which I take to mean that the reader thread sat in poll() for 56 secs, in
>> this case.
>>
>> I was able to correlate such stalls with tcpdump output collected on
>> clients.  Here's an example from another run:
>>
>> 15:09:37.584600 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
>> 15:09:37.584613 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
>> 1096144], length 31856
>> 15:09:37.584655 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
>> 1096144], length 31856
>> 15:09:37.624476 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
>> 15:09:37.624489 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
>> 1096184], length 31856
>> 15:09:37.624532 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [P.], seq
>> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
>> 1096184], length 31856
>> 15:09:37.664454 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
>> 15:09:37.664468 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
>> 1096224], length 31856
>> 15:09:37.664506 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
>> 1096224], length 31856
>> 15:09:37.706937 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
>> 15:09:37.706950 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
>> 1096266], length 31856
>> 15:09:37.706995 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [P.], seq
>> 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
>> 1096266], length 31856
>> 15:09:37.929946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr
>> 1096266], length 1448
>> 15:09:38.376961 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr
>> 1096266], length 1448
>> 15:09:39.270947 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr
>> 1096266], length 1448
>> 15:09:41.056943 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr
>> 1096266], length 1448
>> 15:09:44.632946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr
>> 1096266], length 1448
>> 15:09:51.784947 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr
>> 1096266], length 1448
>> 15:10:06.088945 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr
>> 1096266], length 1448
>> 15:10:34.728951 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr
>> 1096266], length 1448
>> 15:11:31.944946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
>> 1096266], length 1448
>> 15:11:31.945075 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
>> 15:11:31.945091 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
>> 1210496], length 2896
>> 15:11:31.945178 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.], ack
>> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
>> 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
>> 15:11:31.945199 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
>> 1210496], length 1448
>> 15:11:31.945207 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
>> 1210496], length 1448
>> 15:11:31.945214 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.], seq
>> 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
>> 1210496], length 1448
>>
>> So in this case the client retransmitted for ~2 minutes with no response
>> from
>> the OSD.  Note that during this time the client was talking to other OSDs on
>> the
>> same server.
>>
>> I want to try turning off GSO/GRO on my interfaces, but then
>> I think I need to post to netdev...
>>
>> -- Jim
>
> The network trace output looks weird, it either means all of the
> packets between 23822697:23886409 were lost or a bug in the networking
> stack.

I suspect a bug in the stack, as at an application level I get
the same sort of stalls whether I use IP over ethernet or IPoIB.
I need to get traces for both cases to prove that it is the same
stall...

> The application should have no effect on the acks that should
> have been generated. Even if you assume one or more of the frames on
> the wire between 23822697:23886409 were somehow lost, you would have
> had to see some sort of duplicate acks with sack segments.

See the full trace of the recovery below....
>
> Is this a bunch of bare metal servers or are these virtual?

Bare metal.

> If you
> could tap the network just upstream of the OSD servers, it would help
> narrow down where to look at. You could also just turning off GRO/GSO,
> as you suggest, to see if it makes a difference.

Turning off GRO/GSO made no difference to application level behavior.
Tomorrow I'll collect traces to see what happened.

Thanks for taking a look.

-- Jim

Here's the trace:

15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr 1096266], length 31856
15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr 1096266], length 1448
15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr 1096266], length 1448
15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr 1096266], length 1448
15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr 1096266], length 1448
15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr 1096266], length 1448
15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr 1096266], length 1448
15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr 1096266], length 1448
15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr 1096266], length 1448
15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1096266], length 1448
15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 2896
15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945225 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945325 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23828489, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945338 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23828489:23829937, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945346 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23829937:23831385, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945352 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23831385:23832833, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945475 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23831385, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945485 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23832833:23834281, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945491 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23834281:23835729, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945499 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23835729:23837177, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945508 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23832833, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945515 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23837177:23838625, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945522 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23838625:23840073, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945582 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23834281, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945592 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23840073:23841521, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945601 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23841521:23842969, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210496], length 1448
15:11:31.945611 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23837177, win 65535, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945618 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23842969:23844417, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945624 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23844417:23845865, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945631 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23845865:23847313, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945639 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23838625, win 65535, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945646 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23847313:23848761, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945653 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23848761:23850209, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945661 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23840073, win 65160, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945667 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23850209:23851657, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945674 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23851657:23853105, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945733 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23842969, win 62264, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945743 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23853105:23854553, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945750 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23854553:23856001, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945756 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23856001:23857449, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945765 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23844417, win 60816, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945772 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23857449:23858897, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945779 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23858897:23860345, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945787 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23848761, win 56472, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945795 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23860345:23861793, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945802 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23861793:23863241, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945808 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23863241:23864689, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945815 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23850209, win 55024, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945822 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23864689:23866137, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945828 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23866137:23867585, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945837 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23851657, win 53576, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945844 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23867585:23869033, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945850 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23869033:23870481, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945858 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23853105, win 52128, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945865 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23870481:23871929, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945872 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23871929:23873377, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945880 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23854553, win 50680, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945894 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23873377:23874825, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945901 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23874825:23876273, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945909 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23856001, win 49232, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945915 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23876273:23877721, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945922 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23877721:23879169, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr 1210497], length 1448
15:11:31.945931 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23857449, win 47784, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945937 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23879169:23880617, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23880617:23882065, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945951 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23858897, win 46336, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945956 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23882065:23883513, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945963 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23883513:23884961, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945971 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23860345, win 44888, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945977 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq 23884961:23886409, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945984 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23889305:23890753, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 1448
15:11:31.945991 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23861793, win 43440, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.945999 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23890753:23893649, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946009 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23863241, win 41992, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946017 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23893649:23896545, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946025 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23864689, win 40544, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946032 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23896545:23899441, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946040 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23866137, win 39096, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946046 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23899441:23902337, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946056 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23867585, win 37648, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946063 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23902337:23905233, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946076 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23869033, win 36200, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946082 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23870481, win 34752, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946089 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23871929, win 33304, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946095 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23876273, win 28960, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946101 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23879169, win 26064, options [nop,nop,TS val 1210497 ecr 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946106 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23880617, win 24616, options [nop,nop,TS val 1210497 ecr 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946113 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23882065, win 23168, options [nop,nop,TS val 1210497 ecr 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946118 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23883513, win 21720, options [nop,nop,TS val 1210497 ecr 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946123 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23884961, win 20272, options [nop,nop,TS val 1210497 ecr 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
15:11:31.946207 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23893649, win 11584, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946217 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23900889, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946225 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23903785, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946230 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23905233:23909577, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 4344
15:11:31.946425 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23906681, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946434 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23909577:23912473, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946444 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23909577, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946449 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23912473:23915369, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946676 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23912473, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946684 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23915369:23918265, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946694 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23915369, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
15:11:31.946700 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23918265:23921161, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210497], length 2896
15:11:31.946855 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23921161, win 5792, options [nop,nop,TS val 1210498 ecr 1214977], length 0
15:11:31.946864 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq 23921161:23926953, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr 1210498], length 5792
15:11:31.947007 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack 23926953, win 5792, options [nop,nop,TS val 1210498 ecr 1214977], length 0

>
>   Sridhar
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-09 23:15                                       ` Jim Schutt
@ 2012-02-10  0:34                                         ` Tommi Virtanen
  2012-02-10  1:26                                         ` sridhar basam
  1 sibling, 0 replies; 47+ messages in thread
From: Tommi Virtanen @ 2012-02-10  0:34 UTC (permalink / raw)
  To: Jim Schutt; +Cc: sridhar basam, ceph-devel

On Thu, Feb 9, 2012 at 15:15, Jim Schutt <jaschut@sandia.gov> wrote:
> I suspect a bug in the stack, as at an application level I get
> the same sort of stalls whether I use IP over ethernet or IPoIB.
> I need to get traces for both cases to prove that it is the same
> stall...

Hi. I just wanted to confirm that what your tcpdump shows is a packet
loss between the client and the osd. If this wasn't packet loss, you'd
expect the TCP window size to drop to 0 -- not remain at 64k as it is
in your dump -- and you'd expect to see ACKs that don't advance the
sequence number. Something like this:

16:25:16.914407 IP 127.0.0.1.60336 > 127.0.0.1.9999: Flags [P.], seq
90150:93094, ack 1, win 257, options [nop,nop,TS val 3732293 ecr
3732270], length 2944
16:25:16.914416 IP 127.0.0.1.9999 > 127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732293 ecr 3732293], length 0
16:25:17.144409 IP 127.0.0.1.60336 > 127.0.0.1.9999: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.144421 IP 127.0.0.1.9999 > 127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.604409 IP 127.0.0.1.60336 > 127.0.0.1.9999: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732362 ecr 3732316], length 0
16:25:17.604419 IP 127.0.0.1.9999 > 127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732362 ecr 3732293], length 0

As pointed out by Sridhar, various TCP offload mechanisms (and
firewalling!) may make your tcpdump not see the underlying reality.

You might also be just actually losing packets, and the osd settings
might, perhaps, influence the performance of the machine enough to
make it lose packets -- though that sounds a bit far fetched.

You might also be suffering from a Path MTU Discovery black hole, and
need the osd size cap to get full-frame packets out. I see your
tcpdump indicated jumbo frames (at least until the TSO engine!), that
might be its own source of pain.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-09 23:15                                       ` Jim Schutt
  2012-02-10  0:34                                         ` Tommi Virtanen
@ 2012-02-10  1:26                                         ` sridhar basam
  2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
  1 sibling, 1 reply; 47+ messages in thread
From: sridhar basam @ 2012-02-10  1:26 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Feb 9, 2012 at 6:15 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/09/2012 03:40 PM, sridhar basam wrote:
>>
>> On Thu, Feb 9, 2012 at 3:53 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>>
>>> On 02/06/2012 11:35 AM, Gregory Farnum wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<jaschut@sandia.gov>
>>>>  wrote:
>>>>>
>>>>>
>>>>> On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<jaschut@sandia.gov>
>>>>>>  wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> The above suggests to me that the slowdown is a result
>>>>>>> of requests not getting submitted at the same rate as
>>>>>>> when things are running well.
>>>>>>>
>>>>>>
>>>>>> Yeah, it really looks like that. My suggestions wouldn't help there.
>>>>>>
>>>>>> I do see that when things go well the number of writes per device is
>>>>>> capped at ~200 writes per second and the throughput per device is
>>>>>> ~100MB/sec. Is 100MB/sec the expected device throughput?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Pretty much, at least for the outer tracks on a drive.  I've seen
>>>>> ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
>>>>> SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
>>>>> disk subsystem is any sort of significant bottleneck.
>>>>
>>>>
>>>>
>>>> Well, you might try changing your throttling settings on the OSDs.
>>>> ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for
>>>> throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB)
>>>> and is used to limit the amount of client data in memory (ie; messages
>>>> are included in this throttler for their entire lifetime, not just
>>>> while waiting for dispatch).
>>>>
>>>>
>>>
>>> I've made a little progress isolating this.
>>>
>>> "osd client message size cap =  5000000" makes the stall
>>> completely reproducible (which also means I can reproduce
>>> on two different network types, ethernet and IPoIB.), and I
>>> am able to generate graphs of throttled/receive/process time
>>> for each request received by an OSD (see attached SVG plot).
>>>
>>> Such plots suggest to me my problem is caused by stalled
>>> receives.  Using debug ms = 30 on my OSDs turns up instances
>>> of this:
>>>
>>> osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/25000000
>>> osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader got front 128
>>> osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
>>> osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
>>> osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader read 1369231 of 4194304
>>> osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
>>> osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
>>> 172.17.131.32:6800/15199>>  172.17.135.85:0/2712733083 pipe(0x2ef0000
>>> sd=173
>>> pgs=7 cs=1 l=1).reader read 1448 of 2825073
>>>
>>> which I take to mean that the reader thread sat in poll() for 56 secs, in
>>> this case.
>>>
>>> I was able to correlate such stalls with tcpdump output collected on
>>> clients.  Here's an example from another run:
>>>
>>> 15:09:37.584600 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length
>>> 0
>>> 15:09:37.584613 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615
>>> ecr
>>> 1096144], length 31856
>>> 15:09:37.584655 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615
>>> ecr
>>> 1096144], length 31856
>>> 15:09:37.624476 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length
>>> 0
>>> 15:09:37.624489 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655
>>> ecr
>>> 1096184], length 31856
>>> 15:09:37.624532 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [P.],
>>> seq
>>> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655
>>> ecr
>>> 1096184], length 31856
>>> 15:09:37.664454 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length
>>> 0
>>> 15:09:37.664468 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695
>>> ecr
>>> 1096224], length 31856
>>> 15:09:37.664506 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695
>>> ecr
>>> 1096224], length 31856
>>> 15:09:37.706937 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length
>>> 0
>>> 15:09:37.706950 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738
>>> ecr
>>> 1096266], length 31856
>>> 15:09:37.706995 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [P.],
>>> seq
>>> 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738
>>> ecr
>>> 1096266], length 31856
>>> 15:09:37.929946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961
>>> ecr
>>> 1096266], length 1448
>>> 15:09:38.376961 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408
>>> ecr
>>> 1096266], length 1448
>>> 15:09:39.270947 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302
>>> ecr
>>> 1096266], length 1448
>>> 15:09:41.056943 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088
>>> ecr
>>> 1096266], length 1448
>>> 15:09:44.632946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664
>>> ecr
>>> 1096266], length 1448
>>> 15:09:51.784947 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816
>>> ecr
>>> 1096266], length 1448
>>> 15:10:06.088945 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120
>>> ecr
>>> 1096266], length 1448
>>> 15:10:34.728951 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760
>>> ecr
>>> 1096266], length 1448
>>> 15:11:31.944946 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976
>>> ecr
>>> 1096266], length 1448
>>> 15:11:31.945075 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length
>>> 0
>>> 15:11:31.945091 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976
>>> ecr
>>> 1210496], length 2896
>>> 15:11:31.945178 IP 172.17.131.32.6808>  172.17.135.7.37045: Flags [.],
>>> ack
>>> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
>>> 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
>>> 15:11:31.945199 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976
>>> ecr
>>> 1210496], length 1448
>>> 15:11:31.945207 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976
>>> ecr
>>> 1210496], length 1448
>>> 15:11:31.945214 IP 172.17.135.7.37045>  172.17.131.32.6808: Flags [.],
>>> seq
>>> 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976
>>> ecr
>>> 1210496], length 1448
>>>
>>> So in this case the client retransmitted for ~2 minutes with no response
>>> from
>>> the OSD.  Note that during this time the client was talking to other OSDs
>>> on
>>> the
>>> same server.
>>>
>>> I want to try turning off GSO/GRO on my interfaces, but then
>>> I think I need to post to netdev...
>>>
>>> -- Jim
>>
>>
>> The network trace output looks weird, it either means all of the
>> packets between 23822697:23886409 were lost or a bug in the networking
>> stack.
>
>
> I suspect a bug in the stack, as at an application level I get
> the same sort of stalls whether I use IP over ethernet or IPoIB.
> I need to get traces for both cases to prove that it is the same
> stall...
>
>
>> The application should have no effect on the acks that should
>> have been generated. Even if you assume one or more of the frames on
>> the wire between 23822697:23886409 were somehow lost, you would have
>> had to see some sort of duplicate acks with sack segments.
>
>
> See the full trace of the recovery below....
>
>>
>> Is this a bunch of bare metal servers or are these virtual?
>
>
> Bare metal.
>
>
>> If you
>> could tap the network just upstream of the OSD servers, it would help
>> narrow down where to look at. You could also just turning off GRO/GSO,
>> as you suggest, to see if it makes a difference.
>
>
> Turning off GRO/GSO made no difference to application level behavior.
> Tomorrow I'll collect traces to see what happened.
>
> Thanks for taking a look.
>
> -- Jim
>
> Here's the trace:

Do you mind capturing to a pcap file and providing that. Makes it
easier to analyse things. If not, i understand. If you can make do the
capture on both ends, do it with a snaplen of 68 so that you get all
of the headers and there shouldn't be too much payload information in
the file.

I will take a look at the additional output and see if anything pops
out. I am assuming the below output was immediately after what you
posted in your earlier email.


 Sridhar

>
>
> 15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
> 15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
> 1096144], length 31856
> 15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
> 15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
> 1096184], length 31856
> 15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
> 15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
> 1096224], length 31856
> 15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
> 15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
> 1096266], length 31856
> 15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr
> 1096266], length 1448
> 15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr
> 1096266], length 1448
> 15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr
> 1096266], length 1448
> 15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr
> 1096266], length 1448
> 15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr
> 1096266], length 1448
> 15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr
> 1096266], length 1448
> 15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr
> 1096266], length 1448
> 15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr
> 1096266], length 1448
> 15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1096266], length 1448
> 15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
> 15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 2896
> 15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23887857}], length 0
> 15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945225 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23824145, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945325 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23828489, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945338 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23828489:23829937, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945346 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23829937:23831385, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945352 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23831385:23832833, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945475 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23831385, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945485 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23832833:23834281, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945491 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23834281:23835729, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945499 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23835729:23837177, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945508 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23832833, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945515 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23837177:23838625, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945522 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23838625:23840073, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945582 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23834281, win 65535, options [nop,nop,TS val 1210496 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945592 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23840073:23841521, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945601 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23841521:23842969, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210496], length 1448
> 15:11:31.945611 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23837177, win 65535, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945618 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23842969:23844417, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945624 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23844417:23845865, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945631 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23845865:23847313, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945639 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23838625, win 65535, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945646 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23847313:23848761, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945653 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23848761:23850209, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945661 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23840073, win 65160, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945667 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23850209:23851657, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945674 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23851657:23853105, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945733 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23842969, win 62264, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945743 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23853105:23854553, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945750 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23854553:23856001, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945756 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23856001:23857449, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945765 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23844417, win 60816, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945772 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23857449:23858897, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945779 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23858897:23860345, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945787 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23848761, win 56472, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945795 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23860345:23861793, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945802 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23861793:23863241, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945808 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23863241:23864689, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945815 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23850209, win 55024, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945822 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23864689:23866137, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945828 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23866137:23867585, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945837 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23851657, win 53576, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945844 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23867585:23869033, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945850 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23869033:23870481, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945858 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23853105, win 52128, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945865 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23870481:23871929, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945872 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23871929:23873377, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945880 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23854553, win 50680, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945894 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23873377:23874825, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945901 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23874825:23876273, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945909 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23856001, win 49232, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945915 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23876273:23877721, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945922 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23877721:23879169, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
> 1210497], length 1448
> 15:11:31.945931 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23857449, win 47784, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945937 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23879169:23880617, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23880617:23882065, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945951 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23858897, win 46336, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945956 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23882065:23883513, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945963 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23883513:23884961, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945971 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23860345, win 44888, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945977 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
> 23884961:23886409, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945984 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23889305:23890753, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 1448
> 15:11:31.945991 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23861793, win 43440, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.945999 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23890753:23893649, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946009 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23863241, win 41992, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946017 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23893649:23896545, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946025 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23864689, win 40544, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946032 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23896545:23899441, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946040 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23866137, win 39096, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946046 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23899441:23902337, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946056 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23867585, win 37648, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946063 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23902337:23905233, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946076 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23869033, win 36200, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946082 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23870481, win 34752, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946089 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23871929, win 33304, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946095 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23876273, win 28960, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946101 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23879169, win 26064, options [nop,nop,TS val 1210497 ecr
> 1214976,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946106 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23880617, win 24616, options [nop,nop,TS val 1210497 ecr
> 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946113 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23882065, win 23168, options [nop,nop,TS val 1210497 ecr
> 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946118 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23883513, win 21720, options [nop,nop,TS val 1210497 ecr
> 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946123 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23884961, win 20272, options [nop,nop,TS val 1210497 ecr
> 1214977,nop,nop,sack 1 {23886409:23889305}], length 0
> 15:11:31.946207 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23893649, win 11584, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946217 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23900889, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946225 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23903785, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946230 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23905233:23909577, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 4344
> 15:11:31.946425 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23906681, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946434 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23909577:23912473, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946444 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23909577, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946449 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23912473:23915369, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946676 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23912473, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946684 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23915369:23918265, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946694 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23915369, win 5792, options [nop,nop,TS val 1210497 ecr 1214977], length 0
> 15:11:31.946700 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23918265:23921161, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210497], length 2896
> 15:11:31.946855 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23921161, win 5792, options [nop,nop,TS val 1210498 ecr 1214977], length 0
> 15:11:31.946864 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
> 23921161:23926953, ack 1218, win 20904, options [nop,nop,TS val 1214977 ecr
> 1210498], length 5792
> 15:11:31.947007 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
> 23926953, win 5792, options [nop,nop,TS val 1210498 ecr 1214977], length 0
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-10  1:26                                         ` sridhar basam
@ 2012-02-10 15:32                                           ` Jim Schutt
  2012-02-10 17:13                                             ` sridhar basam
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-10 15:32 UTC (permalink / raw)
  To: sridhar basam; +Cc: ceph-devel

On 02/09/2012 06:26 PM, sridhar basam wrote:
> Do you mind capturing to a pcap file and providing that. Makes it
> easier to analyse things. If not, i understand. If you can make do the
> capture on both ends, do it with a snaplen of 68 so that you get all
> of the headers and there shouldn't be too much payload information in
> the file.

I've got a pcap file for this run for this client, with snaplen 128
(I thought I might need to look for ceph message headers).  It's 13 MB
compressed.  How shall I get it to you?

In the meantime, I'll try to capture this from both sides.

>
> I will take a look at the additional output and see if anything pops
> out. I am assuming the below output was immediately after what you
> posted in your earlier email.

Yes.

Thanks -- Jim

>
>
>   Sridhar
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
@ 2012-02-10 17:13                                             ` sridhar basam
  2012-02-10 23:09                                               ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: sridhar basam @ 2012-02-10 17:13 UTC (permalink / raw)
  To: Jim Schutt, ceph-devel

On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 02/09/2012 06:26 PM, sridhar basam wrote:
>>
>> Do you mind capturing to a pcap file and providing that. Makes it
>> easier to analyse things. If not, i understand. If you can make do the
>> capture on both ends, do it with a snaplen of 68 so that you get all
>> of the headers and there shouldn't be too much payload information in
>> the file.
>
>
> I've got a pcap file for this run for this client, with snaplen 128
> (I thought I might need to look for ceph message headers).  It's 13 MB
> compressed.  How shall I get it to you?
>

Can i grab it off some webserver you control? Or you can temporarily
drop it into docs.google.com and add accesss from my email account?

> In the meantime, I'll try to capture this from both sides.
>
>
>>
>> I will take a look at the additional output and see if anything pops
>> out. I am assuming the below output was immediately after what you
>> posted in your earlier email.

i don't see anything out of the ordinary once things recover, the
sender even starts to do TSO after a short while.

 Sridhar


>
>
> Yes.
>
> Thanks -- Jim
>
>
>>
>>
>>  Sridhar
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-10 17:13                                             ` sridhar basam
@ 2012-02-10 23:09                                               ` Jim Schutt
  2012-02-11  0:05                                                 ` sridhar basam
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2012-02-10 23:09 UTC (permalink / raw)
  To: sridhar basam; +Cc: ceph-devel, netdev


[ added Cc:netdev
   See http://www.spinics.net/lists/ceph-devel/msg04804.html
   for the start of the thread.
   -- Jim
]

On 02/10/2012 10:13 AM, sridhar basam wrote:
> On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 02/09/2012 06:26 PM, sridhar basam wrote:
>>>
>>> Do you mind capturing to a pcap file and providing that. Makes it
>>> easier to analyse things. If not, i understand. If you can make do the
>>> capture on both ends, do it with a snaplen of 68 so that you get all
>>> of the headers and there shouldn't be too much payload information in
>>> the file.
>>
>>
>> I've got a pcap file for this run for this client, with snaplen 128
>> (I thought I might need to look for ceph message headers).  It's 13 MB
>> compressed.  How shall I get it to you?
>>
>
> Can i grab it off some webserver you control? Or you can temporarily
> drop it into docs.google.com and add accesss from my email account?

I tabled this for the moment while I worked on collecting
packet traces from both ends.  But you'll probably want
to see the pcap files for what I'm about to show.  Also,
I think I need to add netdev to this discussion ...

>
>> In the meantime, I'll try to capture this from both sides.

Here's another example, captured from both sides, with
TSO/GSO/GRO all off, snaplen 68.

This was captured from the client side.  Same pattern, in
that the client sends many retransmits over a period of
a couple minutes.  It's different in that the client
seems to give up and reconnect ...

11:57:35.984024 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18109890, win 5792, options [nop,nop,TS[|tcp]>
11:57:35.984032 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984038 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984120 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18112786, win 5792, options [nop,nop,TS[|tcp]>
11:57:35.984129 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984135 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984143 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18115682, win 5792, options [nop,nop,TS[|tcp]>
11:57:35.984148 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984153 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984270 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18118578, win 5792, options [nop,nop,TS[|tcp]>
11:57:35.984278 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984283 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:35.984420 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
11:57:35.984428 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.184945 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.587936 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:37.393937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:39.003937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:42.227933 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:48.675931 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:58:01.555935 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:58:27.347945 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:59:18.867935 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
12:00:22.673029 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [P.], seq 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
12:00:22.712933 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:01:02.035951 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:03:02.355941 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:05:02.675947 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:07:02.995943 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:09:03.315942 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:11:03.635948 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:13:03.961655 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [S], seq 1551304795, win 14600, options [mss 1460,[|tcp]>
12:13:03.961722 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [S.], seq 2095554319, ack 1551304796, win 14480, options [mss 1460,[|tcp]>
12:13:03.961732 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 1, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.961822 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [P.], seq 1:201, ack 1, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.961874 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [.], ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.962070 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq 1:10, ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.962077 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 10, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.962370 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq 10:282, ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.962377 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 282, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.962819 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [F.], seq 1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
12:13:03.962828 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [R], seq 2449506432, win 0, length 0

Here's the same thing, captured from the server side:

11:57:36.012908 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18109890, win 5792, options [nop,nop,TS[|tcp]>
11:57:36.012967 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18109890:18111338, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.012977 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18111338:18112786, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013020 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18112786, win 5792, options [nop,nop,TS[|tcp]>
11:57:36.013036 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013039 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013041 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18115682, win 5792, options [nop,nop,TS[|tcp]>
11:57:36.013123 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013129 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013155 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18118578, win 5792, options [nop,nop,TS[|tcp]>
11:57:36.013163 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013171 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013261 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013281 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
11:57:36.013288 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.013410 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.213941 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:36.617001 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:37.422996 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:39.033018 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:42.257206 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:57:48.705321 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:58:01.585648 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:58:27.378231 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
11:59:18.899063 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
12:00:22.704018 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [P.], seq 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
12:00:22.744053 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:01:02.067040 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:03:02.386981 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:05:02.705227 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:07:03.021427 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:09:03.332661 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:11:03.642409 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
12:13:03.963373 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [S], seq 1551304795, win 14600, options [mss 1460,[|tcp]>
12:13:03.963389 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [S.], seq 2095554319, ack 1551304796, win 14480, options [mss 1460,[|tcp]>
12:13:03.963446 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 1, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.963540 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [P.], seq 1:201, ack 1, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.963547 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [.], ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.963700 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq 1:10, ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.963794 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 10, win 14600, options [nop,nop,TS[|tcp]>
12:13:03.964024 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq 10:282, ack 201, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.964091 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack 282, win 15544, options [nop,nop,TS[|tcp]>
12:13:03.964438 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [F.], seq 1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
12:13:03.964542 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [R], seq 2449506432, win 0, length 0

So if I'm reading this right, the client and server agree that
the server ACKed 18120026 twice.  The client and server also
agree that the client retransmitted 18120026:18121474 nine times
from 11:57:36.213941 through 11:59:18.899063 (server clock).

But the server never ACKed that packet.  Too busy?

I was collecting vmstat data during the run; here's the important bits:

Fri Feb 10 11:56:51 MST 2012
vmstat -w 8 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
13 10          0     250272        944   37859080    0    0     7  5346 1098  444   2  5  92  1  0
88  8          0     260472        944   36728776    0    0     0 1329838 257602 68861  19 73   5  4  0
100 10          0     241952        944   36066536    0    0     0 1635891 340724 85570  22 68   6  4  0
105  9          0     250288        944   34750820    0    0     0 1584816 433223 111462  21 73   4  3  0
126  3          0     259908        944   33841696    0    0     0 749648 225707 86716   9 83   4  3  0
157  2          0     245032        944   31572536    0    0     0 736841 252406 99083   9 81   5  5  0
45 17          0     246720        944   28877640    0    0     1 755085 282177 116551   8 77   9  5  0
27  5          0     260556        944   27322948    0    0     0 553263 232682 132427   7 68  19  6  0
  4  0          0     256552        944   26507508    0    0     0 271822 133540 113952   5 15  75  5  0
  4  3          0     235236        944   26308308    0    0     0 181450 96027 101017   4 10  82  4  0
  4  2          0     225372        944   26072048    0    0     0 200145 97401 100146   4 11  80  5  0
  7  1          0     250940        944   25974752    0    0     0 92943 64015 78035   3  7  87  2  0
  2  1          0     261712        944   25886872    0    0     0 152351 80963 99512   4  9  84  4  0
  4  1          0     265056        944   25850216    0    0     0 92452 60790 75949   3  7  87  2  0
  0  0          0     269164        944   25857592    0    0     0 87396 52994 67057   3  7  88  3  0
  6  2          0     263672        944   25846192    0    0     0 110817 67707 75849   3  8  86  3  0
Fri Feb 10 11:58:51 MST 2012
vmstat -w 8 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  1  2          0     260620        944   25838996    0    0     6  7199  240  795   3  8  88  1  0
  1  0          0     262124        944   25838936    0    0     0 108113 64640 75751   3  7  87  3  0
  1  0          0     258700        944   25850148    0    0     0 94477 57952 68787   3  7  88  3  0
  0  1          0     270144        944   25794068    0    0     0 92113 54901 73329   3  6  88  2  0
  2  2          0     272036        944   25768044    0    0     0 57449 45552 61373   3  5  90  2  0
  1  1          0     270024        944   25832600    0    0     0 47651 44594 60577   2  5  91  1  0
  1  0          0     280648        944   25862304    0    0     1 54773 42668 58636   2  6  90  2  0
  1  1          0     272132        944   25848136    0    0     0 41938 42310 57425   3  6  91  1  0
  2  0          0     291272        944   25806644    0    0     1 41896 42259 58833   2  5  91  1  0
  0  0          0     289392        944   25804128    0    0     0 32031 36699 51119   2  5  92  1  0
  2  1          0     288420        944   25824956    0    0     0 42997 40542 55109   2  5  91  1  0
  2  0          0     289076        944   25832792    0    0     0 31843 36438 49974   2  4  92  1  0
  1  1          0     294600        944   25795512    0    0     0 35685 39307 56293   2  5  92  1  0
  3  1          0     268708        944   25937656    0    0     0 148219 79498 87394   4  8  85  3  0
  2  0          0     300100        944   25928888    0    0     0 87999 59708 73501   3  6  88  2  0
  1  0          0     279988        944   25966636    0    0     0 71014 52225 69119   3  6  90  2  0

So the server might have been busy when 18120026:18121474 was first
received, but it was nearly idle for several of the retransmits.

What am I missing?

>>
>>
>>>
>>> I will take a look at the additional output and see if anything pops
>>> out. I am assuming the below output was immediately after what you
>>> posted in your earlier email.
>
> i don't see anything out of the ordinary once things recover, the
> sender even starts to do TSO after a short while.

That's what I thought as well.

Thanks -- Jim

>
>   Sridhar
>
>
>>
>>
>> Yes.
>>
>> Thanks -- Jim
>>
>>
>>>
>>>
>>>   Sridhar
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-10 23:09                                               ` Jim Schutt
@ 2012-02-11  0:05                                                 ` sridhar basam
  2012-02-13 15:26                                                   ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: sridhar basam @ 2012-02-11  0:05 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, netdev

On Fri, Feb 10, 2012 at 6:09 PM, Jim Schutt <jaschut@sandia.gov> wrote:
>
> [ added Cc:netdev
>  See http://www.spinics.net/lists/ceph-devel/msg04804.html
>  for the start of the thread.
>  -- Jim
> ]
>
>
> On 02/10/2012 10:13 AM, sridhar basam wrote:
>>
>> On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>>
>>> On 02/09/2012 06:26 PM, sridhar basam wrote:
>>>>
>>>>
>>>> Do you mind capturing to a pcap file and providing that. Makes it
>>>> easier to analyse things. If not, i understand. If you can make do the
>>>> capture on both ends, do it with a snaplen of 68 so that you get all
>>>> of the headers and there shouldn't be too much payload information in
>>>> the file.
>>>
>>>
>>>
>>> I've got a pcap file for this run for this client, with snaplen 128
>>> (I thought I might need to look for ceph message headers).  It's 13 MB
>>> compressed.  How shall I get it to you?
>>>
>>
>> Can i grab it off some webserver you control? Or you can temporarily
>> drop it into docs.google.com and add accesss from my email account?
>
>
> I tabled this for the moment while I worked on collecting
> packet traces from both ends.  But you'll probably want
> to see the pcap files for what I'm about to show.  Also,
> I think I need to add netdev to this discussion ...
>
>
>>
>>> In the meantime, I'll try to capture this from both sides.
>
>
> Here's another example, captured from both sides, with
> TSO/GSO/GRO all off, snaplen 68.
>
> This was captured from the client side.  Same pattern, in
> that the client sends many retransmits over a period of
> a couple minutes.  It's different in that the client
> seems to give up and reconnect ...
>
> 11:57:35.984024 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18109890, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984032 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984038 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984120 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18112786, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984129 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984135 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984143 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18115682, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984148 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984153 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984270 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18118578, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984278 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984283 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:35.984420 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:35.984428 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.184945 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.587936 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:37.393937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:39.003937 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:42.227933 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:48.675931 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:58:01.555935 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:58:27.347945 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:59:18.867935 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 12:00:22.673029 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [P.], seq
> 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 12:00:22.712933 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], ack
> 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:01:02.035951 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:03:02.355941 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:05:02.675947 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:07:02.995943 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:09:03.315942 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:11:03.635948 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:13:03.961655 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [S], seq
> 1551304795, win 14600, options [mss 1460,[|tcp]>
> 12:13:03.961722 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [S.], seq
> 2095554319, ack 1551304796, win 14480, options [mss 1460,[|tcp]>
> 12:13:03.961732 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 1, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.961822 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [P.], seq
> 1:201, ack 1, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.961874 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [.], ack
> 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.962070 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq
> 1:10, ack 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.962077 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 10, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.962370 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq
> 10:282, ack 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.962377 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 282, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.962819 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [F.], seq
> 1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 12:13:03.962828 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [R], seq
> 2449506432, win 0, length 0
>
> Here's the same thing, captured from the server side:
>
> 11:57:36.012908 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18109890, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:36.012967 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18109890:18111338, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.012977 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18111338:18112786, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013020 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18112786, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:36.013036 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013039 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013041 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18115682, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:36.013123 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013129 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013155 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18118578, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:36.013163 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013171 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013261 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013281 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [.], ack
> 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 11:57:36.013288 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.013410 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.213941 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:36.617001 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:37.422996 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:39.033018 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:42.257206 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:57:48.705321 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:58:01.585648 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:58:27.378231 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 11:59:18.899063 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]>
> 12:00:22.704018 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [P.], seq
> 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 12:00:22.744053 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], ack
> 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:01:02.067040 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:03:02.386981 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:05:02.705227 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:07:03.021427 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:09:03.332661 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:11:03.642409 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [.], seq
> 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]>
> 12:13:03.963373 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [S], seq
> 1551304795, win 14600, options [mss 1460,[|tcp]>
> 12:13:03.963389 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [S.], seq
> 2095554319, ack 1551304796, win 14480, options [mss 1460,[|tcp]>
> 12:13:03.963446 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 1, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.963540 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [P.], seq
> 1:201, ack 1, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.963547 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [.], ack
> 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.963700 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq
> 1:10, ack 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.963794 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 10, win 14600, options [nop,nop,TS[|tcp]>
> 12:13:03.964024 IP 172.17.131.32.6807 > 172.17.135.4.57628: Flags [P.], seq
> 10:282, ack 201, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.964091 IP 172.17.135.4.57628 > 172.17.131.32.6807: Flags [.], ack
> 282, win 15544, options [nop,nop,TS[|tcp]>
> 12:13:03.964438 IP 172.17.131.32.6807 > 172.17.135.4.57589: Flags [F.], seq
> 1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]>
> 12:13:03.964542 IP 172.17.135.4.57589 > 172.17.131.32.6807: Flags [R], seq
> 2449506432, win 0, length 0
>
> So if I'm reading this right, the client and server agree that
> the server ACKed 18120026 twice.  The client and server also
> agree that the client retransmitted 18120026:18121474 nine times
> from 11:57:36.213941 through 11:59:18.899063 (server clock).
>
> But the server never ACKed that packet.  Too busy?
>
> I was collecting vmstat data during the run; here's the important bits:
>
> Fri Feb 10 11:56:51 MST 2012
> vmstat -w 8 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi    bo   in
>   cs  us sy  id wa st
> 13 10          0     250272        944   37859080    0    0     7  5346 1098
>  444   2  5  92  1  0
> 88  8          0     260472        944   36728776    0    0     0 1329838
> 257602 68861  19 73   5  4  0
> 100 10          0     241952        944   36066536 0 0     0 1635891 340724
> 85570  22 68   6  4  0
> 105  9          0     250288        944   34750820 0 0     0 1584816 433223
> 111462  21 73   4  3  0
> 126  3          0     259908        944   33841696    0    0     0 749648
> 225707 86716   9 83   4  3  0
> 157  2          0     245032        944   31572536 0 0     0 736841 252406
> 99083   9 81   5  5  0
> 45 17          0     246720        944   28877640    0    0     1 755085
> 282177 116551   8 77   9  5  0

Holy crap! That might explain why you aren't seeing anything. You are
writing out over a 1.6 million blocks/sec. That too averaged over a 8
second interval. I bet the missed acks are when this is happening.
What sort of I/O load is going through this system during those times?
What sort of filesystem and Linux system are these OSDs on?

 Sridhar



> 27  5          0     260556        944   27322948    0    0     0 553263
> 232682 132427   7 68  19  6  0
>  4  0          0     256552        944   26507508    0    0     0 271822
> 133540 113952   5 15  75  5  0
>  4  3          0     235236        944   26308308    0    0     0 181450
> 96027 101017   4 10  82  4  0
>  4  2          0     225372        944   26072048 0 0     0 200145 97401
> 100146   4 11  80  5  0
>  7  1          0     250940        944   25974752    0    0     0 92943
> 64015 78035   3  7  87  2  0
>  2  1          0     261712        944   25886872    0    0     0 152351
> 80963 99512   4  9  84  4  0
>  4  1          0     265056        944   25850216    0    0     0 92452
> 60790 75949   3  7  87  2  0
>  0  0          0     269164        944   25857592    0    0     0 87396
> 52994 67057   3  7  88  3  0
>  6  2          0     263672        944   25846192    0    0     0 110817
> 67707 75849   3  8  86  3  0
> Fri Feb 10 11:58:51 MST 2012
> vmstat -w 8 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi    bo   in
>   cs  us sy  id wa st
>  1  2          0     260620        944   25838996    0    0     6  7199  240
>  795   3  8  88  1  0
>  1  0          0     262124        944   25838936    0    0     0 108113
> 64640 75751   3  7  87  3  0
>  1  0          0     258700        944   25850148    0    0     0 94477
> 57952 68787   3  7  88  3  0
>  0  1          0     270144        944   25794068    0    0     0 92113
> 54901 73329   3  6  88  2  0
>  2  2          0     272036        944   25768044    0    0     0 57449
> 45552 61373   3  5  90  2  0
>  1  1          0     270024        944   25832600    0    0     0 47651
> 44594 60577   2  5  91  1  0
>  1  0          0     280648        944   25862304    0    0     1 54773
> 42668 58636   2  6  90  2  0
>  1  1          0     272132        944   25848136    0    0     0 41938
> 42310 57425   3  6  91  1  0
>  2  0          0     291272        944   25806644    0    0     1 41896
> 42259 58833   2  5  91  1  0
>  0  0          0     289392        944   25804128    0    0     0 32031
> 36699 51119   2  5  92  1  0
>  2  1          0     288420        944   25824956    0    0     0 42997
> 40542 55109   2  5  91  1  0
>  2  0          0     289076        944   25832792    0    0     0 31843
> 36438 49974   2  4  92  1  0
>  1  1          0     294600        944   25795512    0    0     0 35685
> 39307 56293   2  5  92  1  0
>  3  1          0     268708        944   25937656    0    0     0 148219
> 79498 87394   4  8  85  3  0
>  2  0          0     300100        944   25928888    0    0     0 87999
> 59708 73501   3  6  88  2  0
>  1  0          0     279988        944   25966636    0    0     0 71014
> 52225 69119   3  6  90  2  0
>
> So the server might have been busy when 18120026:18121474 was first
> received, but it was nearly idle for several of the retransmits.
>
> What am I missing?
>
>
>>>
>>>
>>>>
>>>> I will take a look at the additional output and see if anything pops
>>>> out. I am assuming the below output was immediately after what you
>>>> posted in your earlier email.
>>
>>
>> i don't see anything out of the ordinary once things recover, the
>> sender even starts to do TSO after a short while.
>
>
> That's what I thought as well.
>
> Thanks -- Jim
>
>
>>
>>  Sridhar
>>
>>
>>>
>>>
>>> Yes.
>>>
>>> Thanks -- Jim
>>>
>>>
>>>>
>>>>
>>>>  Sridhar
>>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-11  0:05                                                 ` sridhar basam
@ 2012-02-13 15:26                                                   ` Jim Schutt
  0 siblings, 0 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-13 15:26 UTC (permalink / raw)
  To: sridhar basam; +Cc: ceph-devel, netdev

On 02/10/2012 05:05 PM, sridhar basam wrote:
>> >  But the server never ACKed that packet.  Too busy?
>> >
>> >  I was collecting vmstat data during the run; here's the important bits:
>> >
>> >  Fri Feb 10 11:56:51 MST 2012
>> >  vmstat -w 8 16
>> >  procs -------------------memory------------------ ---swap-- -----io----
>> >  --system-- -----cpu-------
>> >    r  b       swpd       free       buff      cache   si   so    bi    bo   in
>> >     cs  us sy  id wa st
>> >  13 10          0     250272        944   37859080    0    0     7  5346 1098
>> >    444   2  5  92  1  0
>> >  88  8          0     260472        944   36728776    0    0     0 1329838
>> >  257602 68861  19 73   5  4  0
>> >  100 10          0     241952        944   36066536 0 0     0 1635891 340724
>> >  85570  22 68   6  4  0
>> >  105  9          0     250288        944   34750820 0 0     0 1584816 433223
>> >  111462  21 73   4  3  0
>> >  126  3          0     259908        944   33841696    0    0     0 749648
>> >  225707 86716   9 83   4  3  0
>> >  157  2          0     245032        944   31572536 0 0     0 736841 252406
>> >  99083   9 81   5  5  0
>> >  45 17          0     246720        944   28877640    0    0     1 755085
>> >  282177 116551   8 77   9  5  0
> Holy crap! That might explain why you aren't seeing anything. You are
> writing out over a 1.6 million blocks/sec. That too averaged over a 8
> second interval. I bet the missed acks are when this is happening.
> What sort of I/O load is going through this system during those times?
> What sort of filesystem and Linux system are these OSDs on?

Dual socket Nehalem EP @ 3 GHz, 24 ea. 7200RPM SAS drives w/ 64 MB cache,
3 LSI SAS HBAs w/8 drives per HBA, btrfs, 3.2.0 kernel.  Each OSD
has a ceph journal and a ceph data store on a single drive.

I'm running 24 OSDs on such a box; all that write load is the result
of dd from 166 linux ceph clients.

FWIW, I've seen these boxes sustain > 2 GB/s for 60 sec or so under
this load, when I have TSO/GSO/GRO turned on, and am writing to
a freshly created ceph filesystem.

That lasts until my OSDs get stalled reading from a socket, as
documented by those packet traces I posted.

If you compare the timestamps on the retransmits to the times
that vmstat is dumping reports, at least some of the retransmits
hit the system when it is ~80% idle.

-- Jim

>
>   Sridhar
>
>
>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-02 17:52         ` Gregory Farnum
  2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
@ 2012-02-24 15:38           ` Jim Schutt
  2012-02-24 18:31             ` Tommi Virtanen
  2013-02-21  0:12             ` Sage Weil
  1 sibling, 2 replies; 47+ messages in thread
From: Jim Schutt @ 2012-02-24 15:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, sri

On 02/02/2012 10:52 AM, Gregory Farnum wrote:
> On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
>> per OSD.  During a test I watch both OSD servers with both
>> vmstat and iostat.
>>
>> During a "good" period, vmstat says the server is sustaining>  2 GB/s
>> for multiple tens of seconds.  Since I use replication factor 2, that
>> means that server is sustaining>  500 MB/s aggregate client throughput,
>> right?  During such a period vmstat also reports ~10% CPU idle.
>>
>> During a "bad" period, vmstat says the server is doing ~200 MB/s,
>> with lots of idle cycles.  It is during these periods that
>> messages stuck in the policy throttler build up such long
>> wait times.  Sometimes I see really bad periods with aggregate
>> throughput per server<  100 MB/s.
>>
>> The typical pattern I see is that a run starts with tens of seconds
>> of aggregate throughput>  2 GB/s.  Then it drops and bounces around
>> 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
>> it ramps back up near 2 GB/s again.
>
> Hmm. 100MB/s is awfully low for this theory, but have you tried to
> correlate the drops in throughput with the OSD journals running out of
> space? I assume from your setup that they're sharing the disk with the
> store (although it works either way), and your description makes me
> think that throughput is initially constrained by sequential journal
> writes but then the journal runs out of space and the OSD has to wait
> for the main store to catch up (with random IO), and that sends the IO
> patterns all to hell. (If you can say that random 4MB IOs are
> hellish.)
> I'm also curious about memory usage as a possible explanation for the
> more dramatic drops.

I've finally figured out what is going on with this behaviour.
Memory usage was on the right track.

It turns out to be an unfortunate interaction between the
number of OSDs/server, number of clients, TCP socket buffer
autotuning, the policy throttler, and limits on the total
memory used by the TCP stack (net/ipv4/tcp_mem sysctl).

What happens is that for throttled reader threads, the
TCP stack will continue to receive data as long as there
is available socket buffer, and the sender has data to send.

As each reader thread receives successive messages, the
TCP socket buffer autotuning increases the size of the
socket buffer.  Eventually, due to the number of OSDs
per server and the number of clients trying to write,
all the memory the TCP stack is allowed by net/ipv4/tcp_mem
to use is consumed by the socket buffers of throttled
reader threads.  When this happens, TCP processing is affected
to the point that the TCP stack cannot send ACKs on behalf
of the reader threads that aren't throttled.  At that point
the OSD stalls until the TCP retransmit count on some connection
is exceeded, causing it to be reset.

Since my OSD servers don't run anything else, the simplest
solution for me is to turn off socket buffer autotuning
(net/ipv4/tcp_moderate_rcvbuf), and set the default socket
buffer size to something reasonable.  256k seems to be
working well for me right now.

-- Jim

> -Greg
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-24 15:38           ` Jim Schutt
@ 2012-02-24 18:31             ` Tommi Virtanen
  2012-02-24 18:38               ` Tommi Virtanen
  2013-02-21  0:12             ` Sage Weil
  1 sibling, 1 reply; 47+ messages in thread
From: Tommi Virtanen @ 2012-02-24 18:31 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel, sri

On Fri, Feb 24, 2012 at 07:38, Jim Schutt <jaschut@sandia.gov> wrote:
> I've finally figured out what is going on with this behaviour.
> Memory usage was on the right track.
>
> It turns out to be an unfortunate interaction between the
> number of OSDs/server, number of clients, TCP socket buffer
> autotuning, the policy throttler, and limits on the total
> memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
>
> What happens is that for throttled reader threads, the
> TCP stack will continue to receive data as long as there
> is available socket buffer, and the sender has data to send.

Ohh! Yes, if the userspace stops reading a socket, kernel will buffer
data as per SO_RCVBUF etc. And TCP has global limits, and that is
going to push it uncomfortably close to the global limit.

Ceph *could* manipulate SO_RCVBUF size at the time it decides to
throttle a client, that would limit the TCP buffer space consumed by
throttled clients (except for a race where the data got received
before Ceph called setsockopt). I recall seeing a trick like that
pulled off somewhere, but I can't find an example right now.

Or perhaps we just say "sorry your server is swamped with too much
work for the resources it's given; you need more of them". That's not
nice though, when throttling can slow down the non-throttled
connections.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-24 18:31             ` Tommi Virtanen
@ 2012-02-24 18:38               ` Tommi Virtanen
  0 siblings, 0 replies; 47+ messages in thread
From: Tommi Virtanen @ 2012-02-24 18:38 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel, sri

I created ticket http://tracker.newdream.net/issues/2100 for this.

On Fri, Feb 24, 2012 at 10:31, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> On Fri, Feb 24, 2012 at 07:38, Jim Schutt <jaschut@sandia.gov> wrote:
>> I've finally figured out what is going on with this behaviour.
>> Memory usage was on the right track.
>>
>> It turns out to be an unfortunate interaction between the
>> number of OSDs/server, number of clients, TCP socket buffer
>> autotuning, the policy throttler, and limits on the total
>> memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
>>
>> What happens is that for throttled reader threads, the
>> TCP stack will continue to receive data as long as there
>> is available socket buffer, and the sender has data to send.
>
> Ohh! Yes, if the userspace stops reading a socket, kernel will buffer
> data as per SO_RCVBUF etc. And TCP has global limits, and that is
> going to push it uncomfortably close to the global limit.
>
> Ceph *could* manipulate SO_RCVBUF size at the time it decides to
> throttle a client, that would limit the TCP buffer space consumed by
> throttled clients (except for a race where the data got received
> before Ceph called setsockopt). I recall seeing a trick like that
> pulled off somewhere, but I can't find an example right now.
>
> Or perhaps we just say "sorry your server is swamped with too much
> work for the resources it's given; you need more of them". That's not
> nice though, when throttling can slow down the non-throttled
> connections.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2012-02-24 15:38           ` Jim Schutt
  2012-02-24 18:31             ` Tommi Virtanen
@ 2013-02-21  0:12             ` Sage Weil
  2013-02-26 19:16               ` Jim Schutt
  1 sibling, 1 reply; 47+ messages in thread
From: Sage Weil @ 2013-02-21  0:12 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel, sri

Hi Jim,

I'm resurrecting an ancient thread here, but: we've just observed this on 
another big cluster and remembered that this hasn't actually been fixed.

I think the right solution is to make an option that will setsockopt on 
SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
wip-tcp.  Do you mind checking to see if this addresses the issue (without 
manually adjusting things in /proc)?

And perhaps we should consider making this default to 256KB...

Thanks!
sage



On Fri, 24 Feb 2012, Jim Schutt wrote:

> On 02/02/2012 10:52 AM, Gregory Farnum wrote:
> > On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt<jaschut@sandia.gov>  wrote:
> > > I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
> > > per OSD.  During a test I watch both OSD servers with both
> > > vmstat and iostat.
> > > 
> > > During a "good" period, vmstat says the server is sustaining>  2 GB/s
> > > for multiple tens of seconds.  Since I use replication factor 2, that
> > > means that server is sustaining>  500 MB/s aggregate client throughput,
> > > right?  During such a period vmstat also reports ~10% CPU idle.
> > > 
> > > During a "bad" period, vmstat says the server is doing ~200 MB/s,
> > > with lots of idle cycles.  It is during these periods that
> > > messages stuck in the policy throttler build up such long
> > > wait times.  Sometimes I see really bad periods with aggregate
> > > throughput per server<  100 MB/s.
> > > 
> > > The typical pattern I see is that a run starts with tens of seconds
> > > of aggregate throughput>  2 GB/s.  Then it drops and bounces around
> > > 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
> > > it ramps back up near 2 GB/s again.
> > 
> > Hmm. 100MB/s is awfully low for this theory, but have you tried to
> > correlate the drops in throughput with the OSD journals running out of
> > space? I assume from your setup that they're sharing the disk with the
> > store (although it works either way), and your description makes me
> > think that throughput is initially constrained by sequential journal
> > writes but then the journal runs out of space and the OSD has to wait
> > for the main store to catch up (with random IO), and that sends the IO
> > patterns all to hell. (If you can say that random 4MB IOs are
> > hellish.)
> > I'm also curious about memory usage as a possible explanation for the
> > more dramatic drops.
> 
> I've finally figured out what is going on with this behaviour.
> Memory usage was on the right track.
> 
> It turns out to be an unfortunate interaction between the
> number of OSDs/server, number of clients, TCP socket buffer
> autotuning, the policy throttler, and limits on the total
> memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
> 
> What happens is that for throttled reader threads, the
> TCP stack will continue to receive data as long as there
> is available socket buffer, and the sender has data to send.
> 
> As each reader thread receives successive messages, the
> TCP socket buffer autotuning increases the size of the
> socket buffer.  Eventually, due to the number of OSDs
> per server and the number of clients trying to write,
> all the memory the TCP stack is allowed by net/ipv4/tcp_mem
> to use is consumed by the socket buffers of throttled
> reader threads.  When this happens, TCP processing is affected
> to the point that the TCP stack cannot send ACKs on behalf
> of the reader threads that aren't throttled.  At that point
> the OSD stalls until the TCP retransmit count on some connection
> is exceeded, causing it to be reset.
> 
> Since my OSD servers don't run anything else, the simplest
> solution for me is to turn off socket buffer autotuning
> (net/ipv4/tcp_moderate_rcvbuf), and set the default socket
> buffer size to something reasonable.  256k seems to be
> working well for me right now.
> 
> -- Jim
> 
> > -Greg
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2013-02-21  0:12             ` Sage Weil
@ 2013-02-26 19:16               ` Jim Schutt
  2013-02-26 19:36                 ` Sage Weil
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2013-02-26 19:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, sri

Hi Sage,

On 02/20/2013 05:12 PM, Sage Weil wrote:
> Hi Jim,
> 
> I'm resurrecting an ancient thread here, but: we've just observed this on 
> another big cluster and remembered that this hasn't actually been fixed.

Sorry for the delayed reply - I missed this in a backlog
of unread email...

> 
> I think the right solution is to make an option that will setsockopt on 
> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
> manually adjusting things in /proc)?

I'll be happy to test it out...

> 
> And perhaps we should consider making this default to 256KB...

That's the value I've been using with my /proc adjustments
since I figured out what was going on.  My servers use
a 10 GbE port for each of the cluster and public networks,
with cephfs clients using 1 GbE, and I've not detected any
issues resulting from that value.  So, it seems like a decent
starting point for a default...

-- Jim

> 
> Thanks!
> sage
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2013-02-26 19:16               ` Jim Schutt
@ 2013-02-26 19:36                 ` Sage Weil
  2013-02-28 19:37                   ` Jim Schutt
  0 siblings, 1 reply; 47+ messages in thread
From: Sage Weil @ 2013-02-26 19:36 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel, sri

On Tue, 26 Feb 2013, Jim Schutt wrote:
> > I think the right solution is to make an option that will setsockopt on 
> > SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
> > wip-tcp.  Do you mind checking to see if this addresses the issue (without 
> > manually adjusting things in /proc)?
> 
> I'll be happy to test it out...

That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
rcvbuf'.

Thanks!
sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2013-02-26 19:36                 ` Sage Weil
@ 2013-02-28 19:37                   ` Jim Schutt
  2013-02-28 21:06                     ` Sage Weil
  0 siblings, 1 reply; 47+ messages in thread
From: Jim Schutt @ 2013-02-28 19:37 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, sri

Hi Sage,

On 02/26/2013 12:36 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Jim Schutt wrote:
>>> I think the right solution is to make an option that will setsockopt on 
>>> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
>>> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
>>> manually adjusting things in /proc)?
>>
>> I'll be happy to test it out...
> 
> That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
> rcvbuf'.

I've verified that I can reproduce the slowdown with the
default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
and 'ms tcp rcvbuf' at 0.

I've also verified that I could not reproduce any slowdown when
I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.

So, that's great news - sorry for the delay in testing.

Also, FWIW, I ended up testing with commits cb15e6e0f4 and
c346282940 cherry-picked on top of next as of a day or
so ago (commit f58601d681), as for some reason wip-tcp
wouldn't work for me - ceph-mon was non-responsive in
some way I didn't dig into.

-- Jim

> 
> Thanks!
> sage
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
  2013-02-28 19:37                   ` Jim Schutt
@ 2013-02-28 21:06                     ` Sage Weil
  0 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2013-02-28 21:06 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, ceph-devel, sri

On Thu, 28 Feb 2013, Jim Schutt wrote:
> Hi Sage,
> 
> On 02/26/2013 12:36 PM, Sage Weil wrote:
> > On Tue, 26 Feb 2013, Jim Schutt wrote:
> >>> I think the right solution is to make an option that will setsockopt on 
> >>> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
> >>> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
> >>> manually adjusting things in /proc)?
> >>
> >> I'll be happy to test it out...
> > 
> > That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
> > rcvbuf'.
> 
> I've verified that I can reproduce the slowdown with the
> default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
> and 'ms tcp rcvbuf' at 0.
> 
> I've also verified that I could not reproduce any slowdown when
> I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.
> 
> So, that's great news - sorry for the delay in testing.

Awesome--thanks so much for testing that!  Pulling it into master now.
 
> Also, FWIW, I ended up testing with commits cb15e6e0f4 and
> c346282940 cherry-picked on top of next as of a day or
> so ago (commit f58601d681), as for some reason wip-tcp
> wouldn't work for me - ceph-mon was non-responsive in
> some way I didn't dig into.

Yeah, sorry about that.  I rebased wip-tcp a few days ago but you may have 
picked up the previous version.

sage

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-02-28 21:06 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-01 15:54 [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 1/6] msgr: print message sequence number and tid when receiving message envelope Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 2/6] common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 3/6] common/Throttle: throttle in FIFO order Jim Schutt
2012-02-02 17:53   ` Gregory Farnum
2012-02-02 18:31     ` Jim Schutt
2012-02-02 19:01       ` Gregory Farnum
2012-02-01 15:54 ` [RFC PATCH 4/6] common/Throttle: FIFO throttler doesn't need to signal waiters when max changes Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 5/6] common/Throttle: make get() report number of waiters on entry/exit Jim Schutt
2012-02-01 15:54 ` [RFC PATCH 6/6] msg: log Message interactions with throttler Jim Schutt
2012-02-01 22:33 ` [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load Gregory Farnum
2012-02-02 15:38   ` Jim Schutt
     [not found]   ` <4F29CDAA.408@sandia.gov>
     [not found]     ` <CAF3hT9BZEP_FWS=qt8ivA++aDpPGGFzuD_PtMcvDRS2aDEN+hw@mail.gmail.com>
     [not found]       ` <4F2AABF5.6050803@sandia.gov>
2012-02-02 17:52         ` Gregory Farnum
2012-02-02 19:06           ` [EXTERNAL] " Jim Schutt
2012-02-02 19:15             ` Sage Weil
2012-02-02 19:33               ` Jim Schutt
2012-02-02 19:32             ` Gregory Farnum
2012-02-02 20:22               ` Jim Schutt
2012-02-02 20:31                 ` Jim Schutt
2012-02-03  0:28                 ` [EXTERNAL] " Gregory Farnum
2012-02-03 16:17                   ` Jim Schutt
2012-02-03 17:06                     ` Gregory Farnum
2012-02-03 23:33                       ` Jim Schutt
     [not found]                         ` <CAC-hyiHSNv_VgLcyVCrJ66HxTGFNBONrmmBddJk5326dLTKgkw@mail.gmail.com>
2012-02-04  0:04                           ` Yehuda Sadeh Weinraub
2012-02-06 16:20                           ` Jim Schutt
2012-02-06 17:22                             ` Yehuda Sadeh Weinraub
2012-02-06 18:20                               ` Jim Schutt
2012-02-06 18:35                                 ` Gregory Farnum
2012-02-09 20:53                                   ` Jim Schutt
2012-02-09 22:40                                     ` sridhar basam
2012-02-09 23:15                                       ` Jim Schutt
2012-02-10  0:34                                         ` Tommi Virtanen
2012-02-10  1:26                                         ` sridhar basam
2012-02-10 15:32                                           ` [EXTERNAL] " Jim Schutt
2012-02-10 17:13                                             ` sridhar basam
2012-02-10 23:09                                               ` Jim Schutt
2012-02-11  0:05                                                 ` sridhar basam
2012-02-13 15:26                                                   ` Jim Schutt
2012-02-03 17:07                     ` Sage Weil
2012-02-24 15:38           ` Jim Schutt
2012-02-24 18:31             ` Tommi Virtanen
2012-02-24 18:38               ` Tommi Virtanen
2013-02-21  0:12             ` Sage Weil
2013-02-26 19:16               ` Jim Schutt
2013-02-26 19:36                 ` Sage Weil
2013-02-28 19:37                   ` Jim Schutt
2013-02-28 21:06                     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.