All of lore.kernel.org
 help / color / mirror / Atom feed
* scaling issues
@ 2012-03-08 23:31 Jim Schutt
  2012-03-09  0:26 ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-03-08 23:31 UTC (permalink / raw)
  To: ceph-devel

Hi,

I've been trying to scale up a Ceph filesystem to as big
as I have hardware for - up to 288 OSDs right now.

(I'm using commit ed0f605365e - tip of master branch from
a few days ago.)

My problem is that I cannot get a 288 OSD filesystem to go active
(that's with 1 mon and 1 MDS).  Pretty quickly I start seeing
"mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
Note that as this is happening all the OSDs and the MDS are
essentially idle; only the mon is busy.

While tailing the mon log I noticed there was a periodic pause;
after adding a little more debug printing, I learned that the
pause was due to encoding pg_stat_t before writing the pg_map to disk.

Here's the result of a scaling study I did on startup time for
a freshly created filesystem.  I normally run 24 OSDs/server on
these machines with no trouble, for small numbers of OSDs.

                    seconds from      seconds from     seconds to
    OSD       PG   store() mount     store() mount      encode
                        to            to all PGs       pg_stat_t   Notes
                    up:active        active+clean*

     48     9504       58                63              0.30
     72    14256       70                89              0.65
     96    19008       93               117              1.1
    120    23760      132               138              1.7
    144    28512       92               165              2.3
    168    33264      215               218              3.2       periods of "up:creating(laggy or crashed)"
    192    38016      392               344              4.0       periods of "up:creating(laggy or crashed)"
    240    47520     1189               644              6.3       periods of "up:creating(laggy or crashed)"
    288    57024   >14400            >14400              9.0       never went active; >200 OSDs out, reporting "wrongly marked me down"

*   active+clean includes active+clean+scrubbing, i.e., no peering or creating
**  all runs up to 288 used mon osd down out interval = 30; 288 used that for
       first hour, then switched to 300

It might be that the filesystem never went to active at 288 OSDs due
to some lurking bugs, but even so, the results for time to encode
pg_stat_t is worrisome; gnuplot fit it for me to
     2.18341 * exp(OSDs/171.373) - 2.67065

----
After 79 iterations the fit converged.
final sum of squares of residuals : 0.0363573
rel. change during last iteration : -4.77639e-06

degrees of freedom    (FIT_NDF)                        : 6
rms of residuals      (FIT_STDFIT) = sqrt(WSSR/ndf)    : 0.0778431
variance of residuals (reduced chisquare) = WSSR/ndf   : 0.00605955

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

a               = 2.18341          +/- 0.2276       (10.42%)
b               = 171.373          +/- 8.344        (4.869%)
c               = -2.67065         +/- 0.3049       (11.42%)
----

I haven't dug deeply into what all goes into a pg_stat_t; how is that
expected to scale?  I tried to fit it to some other functions, but
they didn't look as good to me (not very scientific).

If that fit is correct, and I had the hardware to double my cluster
size to 576 OSDs, the time to encode pg_stat_t for such a cluster
would be ~60 seconds.  That seems unlikely to work well, and what
I'd really like to get to is thousands of OSDs.

Let me know if there is anything I can do to help with this.  I've still
got the mon logs for the above runs, with debug ms = 1 and debug mon = 10;

-- Jim

P.S. Here's how I instrumented to get above results:


diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc
index d961ac1..58198d7 100644
--- a/src/mon/PGMap.cc
+++ b/src/mon/PGMap.cc
@@ -5,6 +5,7 @@

  #define DOUT_SUBSYS mon
  #include "common/debug.h"
+#include "common/Clock.h"

  #include "common/Formatter.h"

@@ -311,8 +312,17 @@ void PGMap::encode(bufferlist &bl) const
    __u8 v = 3;
    ::encode(v, bl);
    ::encode(version, bl);
+
+  utime_t start = ceph_clock_now(g_ceph_context);
    ::encode(pg_stat, bl);
+  utime_t end = ceph_clock_now(g_ceph_context);
+  dout(10) << "PGMap::encode pg_stat took " << end - start << dendl;
+
+  start = end;
    ::encode(osd_stat, bl);
+  end = ceph_clock_now(g_ceph_context);
+  dout(10) << "PGMap::encode osd_stat took " << end - start << dendl;
+
    ::encode(last_osdmap_epoch, bl);
    ::encode(last_pg_scan, bl);
    ::encode(full_ratio, bl);
-- 
1.7.8.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: scaling issues
  2012-03-08 23:31 scaling issues Jim Schutt
@ 2012-03-09  0:26 ` Sage Weil
  2012-03-09 19:39   ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2012-03-09  0:26 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, 8 Mar 2012, Jim Schutt wrote:
> Hi,
> 
> I've been trying to scale up a Ceph filesystem to as big
> as I have hardware for - up to 288 OSDs right now.
> 
> (I'm using commit ed0f605365e - tip of master branch from
> a few days ago.)
> 
> My problem is that I cannot get a 288 OSD filesystem to go active
> (that's with 1 mon and 1 MDS).  Pretty quickly I start seeing
> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
> Note that as this is happening all the OSDs and the MDS are
> essentially idle; only the mon is busy.
> 
> While tailing the mon log I noticed there was a periodic pause;
> after adding a little more debug printing, I learned that the
> pause was due to encoding pg_stat_t before writing the pg_map to disk.
> 
> Here's the result of a scaling study I did on startup time for
> a freshly created filesystem.  I normally run 24 OSDs/server on
> these machines with no trouble, for small numbers of OSDs.
> 
>                    seconds from      seconds from     seconds to
>    OSD       PG   store() mount     store() mount      encode
>                        to            to all PGs       pg_stat_t   Notes
>                    up:active        active+clean*
> 
>     48     9504       58                63              0.30
>     72    14256       70                89              0.65
>     96    19008       93               117              1.1
>    120    23760      132               138              1.7
>    144    28512       92               165              2.3
>    168    33264      215               218              3.2       periods of
> "up:creating(laggy or crashed)"
>    192    38016      392               344              4.0       periods of
> "up:creating(laggy or crashed)"
>    240    47520     1189               644              6.3       periods of
> "up:creating(laggy or crashed)"
>    288    57024   >14400            >14400              9.0       never went
> active; >200 OSDs out, reporting "wrongly marked me down"

Weird, pg_stat_t really shouldn't be growing quadratically.  Can you look 
at the size of the monitors pg/latest file, and see if those are growing 
quadratically as well?  I would expect it to be proportional to the 
encode time.

And maybe send us a copy of one of the big ones?

Thanks-
sage


> 
> *   active+clean includes active+clean+scrubbing, i.e., no peering or creating
> **  all runs up to 288 used mon osd down out interval = 30; 288 used that for
>       first hour, then switched to 300
> 
> It might be that the filesystem never went to active at 288 OSDs due
> to some lurking bugs, but even so, the results for time to encode
> pg_stat_t is worrisome; gnuplot fit it for me to
>     2.18341 * exp(OSDs/171.373) - 2.67065
> 
> ----
> After 79 iterations the fit converged.
> final sum of squares of residuals : 0.0363573
> rel. change during last iteration : -4.77639e-06
> 
> degrees of freedom    (FIT_NDF)                        : 6
> rms of residuals      (FIT_STDFIT) = sqrt(WSSR/ndf)    : 0.0778431
> variance of residuals (reduced chisquare) = WSSR/ndf   : 0.00605955
> 
> Final set of parameters            Asymptotic Standard Error
> =======================            ==========================
> 
> a               = 2.18341          +/- 0.2276       (10.42%)
> b               = 171.373          +/- 8.344        (4.869%)
> c               = -2.67065         +/- 0.3049       (11.42%)
> ----
> 
> I haven't dug deeply into what all goes into a pg_stat_t; how is that
> expected to scale?  I tried to fit it to some other functions, but
> they didn't look as good to me (not very scientific).
> 
> If that fit is correct, and I had the hardware to double my cluster
> size to 576 OSDs, the time to encode pg_stat_t for such a cluster
> would be ~60 seconds.  That seems unlikely to work well, and what
> I'd really like to get to is thousands of OSDs.
> 
> Let me know if there is anything I can do to help with this.  I've still
> got the mon logs for the above runs, with debug ms = 1 and debug mon = 10;
> 
> -- Jim
> 
> P.S. Here's how I instrumented to get above results:
> 
> 
> diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc
> index d961ac1..58198d7 100644
> --- a/src/mon/PGMap.cc
> +++ b/src/mon/PGMap.cc
> @@ -5,6 +5,7 @@
> 
>  #define DOUT_SUBSYS mon
>  #include "common/debug.h"
> +#include "common/Clock.h"
> 
>  #include "common/Formatter.h"
> 
> @@ -311,8 +312,17 @@ void PGMap::encode(bufferlist &bl) const
>    __u8 v = 3;
>    ::encode(v, bl);
>    ::encode(version, bl);
> +
> +  utime_t start = ceph_clock_now(g_ceph_context);
>    ::encode(pg_stat, bl);
> +  utime_t end = ceph_clock_now(g_ceph_context);
> +  dout(10) << "PGMap::encode pg_stat took " << end - start << dendl;
> +
> +  start = end;
>    ::encode(osd_stat, bl);
> +  end = ceph_clock_now(g_ceph_context);
> +  dout(10) << "PGMap::encode osd_stat took " << end - start << dendl;
> +
>    ::encode(last_osdmap_epoch, bl);
>    ::encode(last_pg_scan, bl);
>    ::encode(full_ratio, bl);
> -- 
> 1.7.8.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scaling issues
  2012-03-09  0:26 ` Sage Weil
@ 2012-03-09 19:39   ` Jim Schutt
  2012-03-09 23:21     ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-03-09 19:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 03/08/2012 05:26 PM, Sage Weil wrote:
> On Thu, 8 Mar 2012, Jim Schutt wrote:
>> Hi,
>>
>> I've been trying to scale up a Ceph filesystem to as big
>> as I have hardware for - up to 288 OSDs right now.
>>
>> (I'm using commit ed0f605365e - tip of master branch from
>> a few days ago.)
>>
>> My problem is that I cannot get a 288 OSD filesystem to go active
>> (that's with 1 mon and 1 MDS).  Pretty quickly I start seeing
>> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
>> Note that as this is happening all the OSDs and the MDS are
>> essentially idle; only the mon is busy.
>>
>> While tailing the mon log I noticed there was a periodic pause;
>> after adding a little more debug printing, I learned that the
>> pause was due to encoding pg_stat_t before writing the pg_map to disk.
>>
>> Here's the result of a scaling study I did on startup time for
>> a freshly created filesystem.  I normally run 24 OSDs/server on
>> these machines with no trouble, for small numbers of OSDs.
>>
>>                     seconds from      seconds from     seconds to
>>     OSD       PG   store() mount     store() mount      encode
>>                         to            to all PGs       pg_stat_t   Notes
>>                     up:active        active+clean*
>>
>>      48     9504       58                63              0.30
>>      72    14256       70                89              0.65
>>      96    19008       93               117              1.1
>>     120    23760      132               138              1.7
>>     144    28512       92               165              2.3
>>     168    33264      215               218              3.2       periods of
>> "up:creating(laggy or crashed)"
>>     192    38016      392               344              4.0       periods of
>> "up:creating(laggy or crashed)"
>>     240    47520     1189               644              6.3       periods of
>> "up:creating(laggy or crashed)"
>>     288    57024>14400>14400              9.0       never went
>> active;>200 OSDs out, reporting "wrongly marked me down"
>
> Weird, pg_stat_t really shouldn't be growing quadratically.  Can you look
> at the size of the monitors pg/latest file, and see if those are growing
> quadratically as well?  I would expect it to be proportional to the
> encode time.

Here's the size of the last written mon/pgmap/latest for each run,
and the time it took to encode the pg_stat_t part of that last
instance of the file:

OSDs  size of  pg_stat_t
        latest   encode
                  time

  48   2976397  0.323052
  72   4472477  0.666633
  96   5969461  1.159198
120   7466021  1.738096
144   8963141  2.428229
168  10460309  3.203832
192  11956709  4.083013
240  14950445  6.453171
288  17916589  9.462052

Looks like the file size is growing linearly, but the
encode time is not?

>
> And maybe send us a copy of one of the big ones?

On its way.

-- Jim

>
> Thanks-
> sage
>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scaling issues
  2012-03-09 19:39   ` Jim Schutt
@ 2012-03-09 23:21     ` Jim Schutt
  2012-04-10 16:22       ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-03-09 23:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 03/09/2012 12:39 PM, Jim Schutt wrote:
> On 03/08/2012 05:26 PM, Sage Weil wrote:
>> On Thu, 8 Mar 2012, Jim Schutt wrote:
>>> Hi,
>>>
>>> I've been trying to scale up a Ceph filesystem to as big
>>> as I have hardware for - up to 288 OSDs right now.
>>>
>>> (I'm using commit ed0f605365e - tip of master branch from
>>> a few days ago.)
>>>
>>> My problem is that I cannot get a 288 OSD filesystem to go active
>>> (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
>>> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
>>> Note that as this is happening all the OSDs and the MDS are
>>> essentially idle; only the mon is busy.
>>>
>>> While tailing the mon log I noticed there was a periodic pause;
>>> after adding a little more debug printing, I learned that the
>>> pause was due to encoding pg_stat_t before writing the pg_map to disk.
>>>
>>> Here's the result of a scaling study I did on startup time for
>>> a freshly created filesystem. I normally run 24 OSDs/server on
>>> these machines with no trouble, for small numbers of OSDs.
>>>
>>> seconds from seconds from seconds to
>>> OSD PG store() mount store() mount encode
>>> to to all PGs pg_stat_t Notes
>>> up:active active+clean*
>>>
>>> 48 9504 58 63 0.30
>>> 72 14256 70 89 0.65
>>> 96 19008 93 117 1.1
>>> 120 23760 132 138 1.7
>>> 144 28512 92 165 2.3
>>> 168 33264 215 218 3.2 periods of
>>> "up:creating(laggy or crashed)"
>>> 192 38016 392 344 4.0 periods of
>>> "up:creating(laggy or crashed)"
>>> 240 47520 1189 644 6.3 periods of
>>> "up:creating(laggy or crashed)"
>>> 288 57024>14400>14400 9.0 never went
>>> active;>200 OSDs out, reporting "wrongly marked me down"
>>
>> Weird, pg_stat_t really shouldn't be growing quadratically. Can you look
>> at the size of the monitors pg/latest file, and see if those are growing
>> quadratically as well? I would expect it to be proportional to the
>> encode time.
>
> Here's the size of the last written mon/pgmap/latest for each run,
> and the time it took to encode the pg_stat_t part of that last
> instance of the file:
>
> OSDs size of pg_stat_t
> latest encode
> time
>
> 48 2976397 0.323052
> 72 4472477 0.666633
> 96 5969461 1.159198
> 120 7466021 1.738096
> 144 8963141 2.428229
> 168 10460309 3.203832
> 192 11956709 4.083013
> 240 14950445 6.453171
> 288 17916589 9.462052
>
> Looks like the file size is growing linearly, but the
> encode time is not?

I remembered I've been compiling with -g since I started working
with ceph (so I could poke around easily in gdb, and at small sizes
it didn't seem to hurt anything).  That's how I generated the above
results.

I recompiled with -g -O2, and got this:

OSDs  size of  pg_stat_t
        latest   encode
                  time

   48   2976461  0.052731
   72   4472477  0.107187
   96   5969477  0.194690
  120   7466021  0.311586
  144   8963141  0.465111
  168  10460317  0.680222
  192  11956709  0.713398
  240  14950437  1.159426
  288  17944413  1.714004

It seems that encoding time still isn't proportional to the
size of pgmap/latest.  However, things have improved enough
that my 288 OSD filesystem goes active pretty quickly (~90 sec),
so I can continue testing at that scale.

-- Jim


>
>>
>> And maybe send us a copy of one of the big ones?
>
> On its way.
>
> -- Jim
>
>>
>> Thanks-
>> sage
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scaling issues
  2012-03-09 23:21     ` Jim Schutt
@ 2012-04-10 16:22       ` Jim Schutt
  2012-04-10 16:39         ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-04-10 16:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 03/09/2012 04:21 PM, Jim Schutt wrote:
> On 03/09/2012 12:39 PM, Jim Schutt wrote:
>> On 03/08/2012 05:26 PM, Sage Weil wrote:
>>> On Thu, 8 Mar 2012, Jim Schutt wrote:
>>>> Hi,
>>>>
>>>> I've been trying to scale up a Ceph filesystem to as big
>>>> as I have hardware for - up to 288 OSDs right now.
>>>>
>>>> (I'm using commit ed0f605365e - tip of master branch from
>>>> a few days ago.)
>>>>
>>>> My problem is that I cannot get a 288 OSD filesystem to go active
>>>> (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
>>>> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
>>>> Note that as this is happening all the OSDs and the MDS are
>>>> essentially idle; only the mon is busy.
>>>>
>>>> While tailing the mon log I noticed there was a periodic pause;
>>>> after adding a little more debug printing, I learned that the
>>>> pause was due to encoding pg_stat_t before writing the pg_map to disk.
>>>>
>>>> Here's the result of a scaling study I did on startup time for
>>>> a freshly created filesystem. I normally run 24 OSDs/server on
>>>> these machines with no trouble, for small numbers of OSDs.

[snip]

>
> I recompiled with -g -O2, and got this:
>
> OSDs size of pg_stat_t
>      latest  encode
>              time
>
> 48  2976461  0.052731
> 72  4472477  0.107187
> 96  5969477  0.194690
> 120 7466021  0.311586
> 144 8963141  0.465111
> 168 10460317 0.680222
> 192 11956709 0.713398
> 240 14950437 1.159426
> 288 17944413 1.714004
>
> It seems that encoding time still isn't proportional to the
> size of pgmap/latest. However, things have improved enough
> that my 288 OSD filesystem goes active pretty quickly (~90 sec),
> so I can continue testing at that scale.
>

I'm still having trouble at 288 OSDs with under heavy write load
(166 linux clients running dd simultaneously).  I'm currently
running with master branch from last week - commit e792cd938897.

The symptom is that the cluster cycles between "up:active"
and "up:active(laggy or crashed)".  When the cluster goes into
"laggy or crashed" the client caps go stale, and cluster throughput
(as monitored by vmstat on OSD servers) trails off to zero.  After a
short idle period, the cluster goes back "up:active", clients
renew their caps, and cluster throughput goes back to its maximum
until the next cycle starts.

I believe this is a scaling issue because when I use pg_bits = 5
and pgp_bits = 5 (instead of the default 6) to build the filesystem,
I can write >20 TB using the same test, with no instances of the
cluster going  "laggy or crashed".  Perhaps it is related to
the encoding time for pg_stat_t that I reported above?

The problem with using pg_bits = 5 is that the data distribution
is not particularly even; after writing 20 TB to 288 OSDs I see
(max OSD use)/(min OSD use) = ~2. Even with pg_bits = 6 after
writing 20 TB I see (max OSD use)/(min OSD use) = ~1.5.
I think I'd like that variability to be even smaller.

AFAICS I'm getting 3 pools of (n_OSDs << pg_bits) PGs, one pool
each for each of CEPH_DATA_RULE, CEPH_METADATA_RULE, and
CEPH_RBD_RULE.  So, for 288 OSDs I get 3*(288<<6) = 55296 PGs,
plus a few thousand more for the localized PGs.

I can't seem to find any use of CEPH_RBD_RULE in the code, other
than to create that pool.  What am I missing?  I'd like to just
not create that pool to reduce my PG count - what problems might
that cause?

Also, what would be the downside if I tried to not create the
CEPH_METADATA_RULE pool, and just put everything into the
CEPH_DATA_RULE pool?  That way I could run with just one pool.

In the longer run, can anything be done to keep the monitor
daemon responsive when running with thousands of OSDs under a
heavy write load?

Thanks -- Jim


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scaling issues
  2012-04-10 16:22       ` Jim Schutt
@ 2012-04-10 16:39         ` Sage Weil
  2012-04-10 19:01           ` [EXTERNAL] " Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2012-04-10 16:39 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Tue, 10 Apr 2012, Jim Schutt wrote:
> On 03/09/2012 04:21 PM, Jim Schutt wrote:
> > On 03/09/2012 12:39 PM, Jim Schutt wrote:
> > > On 03/08/2012 05:26 PM, Sage Weil wrote:
> > > > On Thu, 8 Mar 2012, Jim Schutt wrote:
> > > > > Hi,
> > > > > 
> > > > > I've been trying to scale up a Ceph filesystem to as big
> > > > > as I have hardware for - up to 288 OSDs right now.
> > > > > 
> > > > > (I'm using commit ed0f605365e - tip of master branch from
> > > > > a few days ago.)
> > > > > 
> > > > > My problem is that I cannot get a 288 OSD filesystem to go active
> > > > > (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
> > > > > "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
> > > > > Note that as this is happening all the OSDs and the MDS are
> > > > > essentially idle; only the mon is busy.
> > > > > 
> > > > > While tailing the mon log I noticed there was a periodic pause;
> > > > > after adding a little more debug printing, I learned that the
> > > > > pause was due to encoding pg_stat_t before writing the pg_map to disk.
> > > > > 
> > > > > Here's the result of a scaling study I did on startup time for
> > > > > a freshly created filesystem. I normally run 24 OSDs/server on
> > > > > these machines with no trouble, for small numbers of OSDs.
> 
> [snip]
> 
> > 
> > I recompiled with -g -O2, and got this:
> > 
> > OSDs size of pg_stat_t
> >      latest  encode
> >              time
> > 
> > 48  2976461  0.052731
> > 72  4472477  0.107187
> > 96  5969477  0.194690
> > 120 7466021  0.311586
> > 144 8963141  0.465111
> > 168 10460317 0.680222
> > 192 11956709 0.713398
> > 240 14950437 1.159426
> > 288 17944413 1.714004
> > 
> > It seems that encoding time still isn't proportional to the
> > size of pgmap/latest. However, things have improved enough
> > that my 288 OSD filesystem goes active pretty quickly (~90 sec),
> > so I can continue testing at that scale.

A fix for this was just merged into master last night.

> I'm still having trouble at 288 OSDs with under heavy write load
> (166 linux clients running dd simultaneously).  I'm currently
> running with master branch from last week - commit e792cd938897.
> 
> The symptom is that the cluster cycles between "up:active"
> and "up:active(laggy or crashed)".  When the cluster goes into
> "laggy or crashed" the client caps go stale, and cluster throughput
> (as monitored by vmstat on OSD servers) trails off to zero.  After a
> short idle period, the cluster goes back "up:active", clients
> renew their caps, and cluster throughput goes back to its maximum
> until the next cycle starts.
> 
> I believe this is a scaling issue because when I use pg_bits = 5
> and pgp_bits = 5 (instead of the default 6) to build the filesystem,
> I can write >20 TB using the same test, with no instances of the
> cluster going  "laggy or crashed".  Perhaps it is related to
> the encoding time for pg_stat_t that I reported above?

Yeah, that sounds like the culprit to me.  Can you try with the latest 
master?

> The problem with using pg_bits = 5 is that the data distribution
> is not particularly even; after writing 20 TB to 288 OSDs I see
> (max OSD use)/(min OSD use) = ~2. Even with pg_bits = 6 after
> writing 20 TB I see (max OSD use)/(min OSD use) = ~1.5.
> I think I'd like that variability to be even smaller.

There is some infrastructure in the monitor to correct for the statistical 
imbalance, but it isn't triggered automatically yet.  It's probably time 
to look at that.

> AFAICS I'm getting 3 pools of (n_OSDs << pg_bits) PGs, one pool
> each for each of CEPH_DATA_RULE, CEPH_METADATA_RULE, and
> CEPH_RBD_RULE.  So, for 288 OSDs I get 3*(288<<6) = 55296 PGs,
> plus a few thousand more for the localized PGs.
> 
> I can't seem to find any use of CEPH_RBD_RULE in the code, other
> than to create that pool.  What am I missing?  I'd like to just
> not create that pool to reduce my PG count - what problems might
> that cause?

None.   We create the rbd pool by default but it isn't used by the 
filesystem; it's just the default pool used by the 'rbd' command line 
tool.

> Also, what would be the downside if I tried to not create the
> CEPH_METADATA_RULE pool, and just put everything into the
> CEPH_DATA_RULE pool?  That way I could run with just one pool.

You could do that too.  The idea was that people might want a different 
replication level or placement for metadata (faster nodes, more replicas, 
whatever).

But.. try with master first, as the PG scaling issue needs fixing 
regardless, is hopefully fixed now, and will probably make all of this 
moot... :)

> In the longer run, can anything be done to keep the monitor
> daemon responsive when running with thousands of OSDs under a
> heavy write load?

Right now the monitor is being used to aggregate usage information, which 
is probably not the best use of its time.  I don't expect it will become a 
real problem for a while, though (as long as we avoid bugs like this one).

Thanks!
sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXTERNAL] Re: scaling issues
  2012-04-10 16:39         ` Sage Weil
@ 2012-04-10 19:01           ` Jim Schutt
  2012-04-10 22:38             ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-04-10 19:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/10/2012 10:39 AM, Sage Weil wrote:
> On Tue, 10 Apr 2012, Jim Schutt wrote:
>> On 03/09/2012 04:21 PM, Jim Schutt wrote:
>>> On 03/09/2012 12:39 PM, Jim Schutt wrote:
>>>> On 03/08/2012 05:26 PM, Sage Weil wrote:
>>>>> On Thu, 8 Mar 2012, Jim Schutt wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been trying to scale up a Ceph filesystem to as big
>>>>>> as I have hardware for - up to 288 OSDs right now.
>>>>>>
>>>>>> (I'm using commit ed0f605365e - tip of master branch from
>>>>>> a few days ago.)
>>>>>>
>>>>>> My problem is that I cannot get a 288 OSD filesystem to go active
>>>>>> (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
>>>>>> "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
>>>>>> Note that as this is happening all the OSDs and the MDS are
>>>>>> essentially idle; only the mon is busy.
>>>>>>
>>>>>> While tailing the mon log I noticed there was a periodic pause;
>>>>>> after adding a little more debug printing, I learned that the
>>>>>> pause was due to encoding pg_stat_t before writing the pg_map to disk.
>>>>>>
>>>>>> Here's the result of a scaling study I did on startup time for
>>>>>> a freshly created filesystem. I normally run 24 OSDs/server on
>>>>>> these machines with no trouble, for small numbers of OSDs.
>>
>> [snip]
>>
>>>
>>> I recompiled with -g -O2, and got this:
>>>
>>> OSDs size of pg_stat_t
>>>       latest  encode
>>>               time
>>>
>>> 48  2976461  0.052731
>>> 72  4472477  0.107187
>>> 96  5969477  0.194690
>>> 120 7466021  0.311586
>>> 144 8963141  0.465111
>>> 168 10460317 0.680222
>>> 192 11956709 0.713398
>>> 240 14950437 1.159426
>>> 288 17944413 1.714004
>>>
>>> It seems that encoding time still isn't proportional to the
>>> size of pgmap/latest. However, things have improved enough
>>> that my 288 OSD filesystem goes active pretty quickly (~90 sec),
>>> so I can continue testing at that scale.
>
> A fix for this was just merged into master last night.

That's great!

>
>> I'm still having trouble at 288 OSDs with under heavy write load
>> (166 linux clients running dd simultaneously).  I'm currently
>> running with master branch from last week - commit e792cd938897.
>>
>> The symptom is that the cluster cycles between "up:active"
>> and "up:active(laggy or crashed)".  When the cluster goes into
>> "laggy or crashed" the client caps go stale, and cluster throughput
>> (as monitored by vmstat on OSD servers) trails off to zero.  After a
>> short idle period, the cluster goes back "up:active", clients
>> renew their caps, and cluster throughput goes back to its maximum
>> until the next cycle starts.
>>
>> I believe this is a scaling issue because when I use pg_bits = 5
>> and pgp_bits = 5 (instead of the default 6) to build the filesystem,
>> I can write>20 TB using the same test, with no instances of the
>> cluster going  "laggy or crashed".  Perhaps it is related to
>> the encoding time for pg_stat_t that I reported above?
>
> Yeah, that sounds like the culprit to me.  Can you try with the latest
> master?

I'm 24 TB into a 25 TB run with pg_bits = 6, and the cluster
has been up:active the whole time.  Seems fixed to me!

<checks running ceph -w output>

Interesting. One of my servers just died.

FWIW, when all 24 OSDs on that server dropped out, the cluster went
"up:active(laggy or crashed)" for about 8 minutes while recovery started,
then went back "up:active" while recovery is finishing.

I thought you'd be interested to know - is that behavior unremarkable?

FWIW, here's what happened on that server:

[510342.517157] ------------[ cut here ]------------
[510342.521855] kernel BUG at fs/btrfs/extent_io.c:3982!
[510342.526894] invalid opcode: 0000 [#1] SMP
[510342.531102] CPU 4
[510342.533028] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun kvm uinput sg sd_mod joydev ata_piix libata button microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ehci_hcd uhci_hcd ioatdma dm_mod i7core_edac edac_core nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[510342.587836]
[510342.589412] Pid: 16609, comm: kworker/4:2 Not tainted 3.3.1-00162-gd8b2857 #15 Supermicro X8DTH-i/6/iF/6F/X8DTH
[510342.599601] RIP: 0010:[<ffffffffa057924c>]  [<ffffffffa057924c>] btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
[510342.610893] RSP: 0018:ffff88015fb6ba10  EFLAGS: 00010202
[510342.616277] RAX: 0000000000000004 RBX: ffff880ab81865a0 RCX: ffff880174bc0230
[510342.623476] RDX: ffff8801335bf9b1 RSI: 00000000000d0fb8 RDI: ffff880ab81865a0
[510342.630675] RBP: ffff88015fb6ba40 R08: 0000000000000038 R09: 0000000000000003
[510342.637874] R10: 0000000000000008 R11: ffff8804658c9e40 R12: ffff88015fb6a000
[510342.645069] R13: ffff880ab81865a0 R14: 000000000000000e R15: ffff88015fb6bc10
[510342.652268] FS:  0000000000000000(0000) GS:ffff880627c80000(0000) knlGS:0000000000000000
[510342.660418] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[510342.666234] CR2: ffffffffff600400 CR3: 0000000001a05000 CR4: 00000000000006e0
[510342.673427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[510342.680627] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[510342.687827] Process kworker/4:2 (pid: 16609, threadinfo ffff88015fb6a000, task ffff880102ca4410)
[510342.696669] Stack:
[510342.698769]  ffff880100000000 ffff880ab81865a0 ffff88015fb6a000 ffff8806057d2eb0
[510342.706297]  000000000000000e ffff88015fb6bc10 ffff88015fb6ba70 ffffffffa05793f2
[510342.713825]  ffff88015fb6bb80 ffff880ab81865a0 ffff88015fb6bb50 0000000000000008
[510342.721362] Call Trace:
[510342.723912]  [<ffffffffa05793f2>] release_extent_buffer+0xa2/0xe0 [btrfs]
[510342.730790]  [<ffffffffa05795b4>] free_extent_buffer+0x34/0x80 [btrfs]
[510342.737407]  [<ffffffffa057a126>] btree_write_cache_pages+0x246/0x410 [btrfs]
[510342.744637]  [<ffffffffa054e96a>] btree_writepages+0x3a/0x50 [btrfs]
[510342.751060]  [<ffffffff810fc421>] do_writepages+0x21/0x40
[510342.756537]  [<ffffffff810f0b0b>] __filemap_fdatawrite_range+0x5b/0x60
[510342.763136]  [<ffffffff810f0de3>] filemap_fdatawrite_range+0x13/0x20
[510342.769568]  [<ffffffffa0554ecf>] btrfs_write_marked_extents+0x7f/0xe0 [btrfs]
[510342.776867]  [<ffffffffa0554f5e>] btrfs_write_and_wait_marked_extents+0x2e/0x60 [btrfs]
[510342.784951]  [<ffffffffa0554fbb>] btrfs_write_and_wait_transaction+0x2b/0x50 [btrfs]
[510342.792768]  [<ffffffffa055604c>] btrfs_commit_transaction+0x7ac/0xa10 [btrfs]
[510342.800060]  [<ffffffff81079540>] ? set_next_entity+0x90/0xa0
[510342.805875]  [<ffffffff8105f5d0>] ? wake_up_bit+0x40/0x40
[510342.811365]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20 [btrfs]
[510342.818403]  [<ffffffffa05565af>] do_async_commit+0x1f/0x30 [btrfs]
[510342.824748]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20 [btrfs]
[510342.831774]  [<ffffffff81058680>] process_one_work+0x140/0x490
[510342.837673]  [<ffffffff8105a417>] worker_thread+0x187/0x3f0
[510342.843319]  [<ffffffff8105a290>] ? manage_workers+0x120/0x120
[510342.849225]  [<ffffffff8105f02e>] kthread+0x9e/0xb0
[510342.854176]  [<ffffffff81486c64>] kernel_thread_helper+0x4/0x10
[510342.860168]  [<ffffffff8147d84a>] ? retint_restore_args+0xe/0xe
[510342.866161]  [<ffffffff8105ef90>] ? kthread_freezable_should_stop+0x80/0x80
[510342.873198]  [<ffffffff81486c60>] ? gs_change+0xb/0xb
[510342.878322] Code: 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 66 66 66 66 90 8b 47 38 49 89 fd 85 c0 75 0c 48 8b 47 20 4c 8d 7f 20 84 c0 79 04 <0f> 0b eb fe 48 8b 47 20 a8 04 75 f4 48 8b 07 49 89 c4 4c 03 67
[510342.898331] RIP  [<ffffffffa057924c>] btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
[510342.907294]  RSP <ffff88015fb6ba10>
[510342.911241] ---[ end trace 62013c6b6e2e5135 ]---

I'm running 3.3.1 + Chris Mason's btrfs for-linus branch (commit bc3f116fec194f,
just before merge into 3.4-rc1).  This kernel also has the ceph for-next branch
(commit c666601a935b, just before merge into 3.4-rc1), but that shouldn't matter
for this.

>
>> The problem with using pg_bits = 5 is that the data distribution
>> is not particularly even; after writing 20 TB to 288 OSDs I see
>> (max OSD use)/(min OSD use) = ~2. Even with pg_bits = 6 after
>> writing 20 TB I see (max OSD use)/(min OSD use) = ~1.5.
>> I think I'd like that variability to be even smaller.
>
> There is some infrastructure in the monitor to correct for the statistical
> imbalance, but it isn't triggered automatically yet.  It's probably time
> to look at that.

FWIW I've noticed while learning about how Ceph hashes objects into
PGs and onto OSDs that Bob Jenkins has a hash function newer than
what Ceph is using; see http://burtleburtle.net/bob/c/lookup3.c.

I haven't been able to find any comparisons of the older hash
to this newer one, wrt. uniformity; I wonder if it would be
more suitable, and thus reduce the imbalance without the need
for correction?

>
>> AFAICS I'm getting 3 pools of (n_OSDs<<  pg_bits) PGs, one pool
>> each for each of CEPH_DATA_RULE, CEPH_METADATA_RULE, and
>> CEPH_RBD_RULE.  So, for 288 OSDs I get 3*(288<<6) = 55296 PGs,
>> plus a few thousand more for the localized PGs.
>>
>> I can't seem to find any use of CEPH_RBD_RULE in the code, other
>> than to create that pool.  What am I missing?  I'd like to just
>> not create that pool to reduce my PG count - what problems might
>> that cause?
>
> None.   We create the rbd pool by default but it isn't used by the
> filesystem; it's just the default pool used by the 'rbd' command line
> tool.

OK, thanks for verifying.
>
>> Also, what would be the downside if I tried to not create the
>> CEPH_METADATA_RULE pool, and just put everything into the
>> CEPH_DATA_RULE pool?  That way I could run with just one pool.
>
> You could do that too.  The idea was that people might want a different
> replication level or placement for metadata (faster nodes, more replicas,
> whatever).

Again, thanks for verifying.

>
> But.. try with master first, as the PG scaling issue needs fixing
> regardless, is hopefully fixed now, and will probably make all of this
> moot... :)

I'll keep those tricks in mind if I ever get enough hardware
that the number of PGs becomes an issue again.

>
>> In the longer run, can anything be done to keep the monitor
>> daemon responsive when running with thousands of OSDs under a
>> heavy write load?
>
> Right now the monitor is being used to aggregate usage information, which
> is probably not the best use of its time.  I don't expect it will become a
> real problem for a while, though (as long as we avoid bugs like this one).

Hmm, if I did a few runs with bigger pg_bits values, that would give
some idea of how far away that time is, right?

>
> Thanks!

Thanks for the timely bug fix!  I really appreciate it.

-- Jim

> sage
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXTERNAL] Re: scaling issues
  2012-04-10 19:01           ` [EXTERNAL] " Jim Schutt
@ 2012-04-10 22:38             ` Sage Weil
  0 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2012-04-10 22:38 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Tue, 10 Apr 2012, Jim Schutt wrote:
> On 04/10/2012 10:39 AM, Sage Weil wrote:
> > On Tue, 10 Apr 2012, Jim Schutt wrote:
> > > On 03/09/2012 04:21 PM, Jim Schutt wrote:
> > > > On 03/09/2012 12:39 PM, Jim Schutt wrote:
> > > > > On 03/08/2012 05:26 PM, Sage Weil wrote:
> > > > > > On Thu, 8 Mar 2012, Jim Schutt wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I've been trying to scale up a Ceph filesystem to as big
> > > > > > > as I have hardware for - up to 288 OSDs right now.
> > > > > > > 
> > > > > > > (I'm using commit ed0f605365e - tip of master branch from
> > > > > > > a few days ago.)
> > > > > > > 
> > > > > > > My problem is that I cannot get a 288 OSD filesystem to go active
> > > > > > > (that's with 1 mon and 1 MDS). Pretty quickly I start seeing
> > > > > > > "mds e4 e4: 1/1/1 up {0=cs33=up:creating(laggy or crashed)}".
> > > > > > > Note that as this is happening all the OSDs and the MDS are
> > > > > > > essentially idle; only the mon is busy.
> > > > > > > 
> > > > > > > While tailing the mon log I noticed there was a periodic pause;
> > > > > > > after adding a little more debug printing, I learned that the
> > > > > > > pause was due to encoding pg_stat_t before writing the pg_map to
> > > > > > > disk.
> > > > > > > 
> > > > > > > Here's the result of a scaling study I did on startup time for
> > > > > > > a freshly created filesystem. I normally run 24 OSDs/server on
> > > > > > > these machines with no trouble, for small numbers of OSDs.
> > > 
> > > [snip]
> > > 
> > > > 
> > > > I recompiled with -g -O2, and got this:
> > > > 
> > > > OSDs size of pg_stat_t
> > > >       latest  encode
> > > >               time
> > > > 
> > > > 48  2976461  0.052731
> > > > 72  4472477  0.107187
> > > > 96  5969477  0.194690
> > > > 120 7466021  0.311586
> > > > 144 8963141  0.465111
> > > > 168 10460317 0.680222
> > > > 192 11956709 0.713398
> > > > 240 14950437 1.159426
> > > > 288 17944413 1.714004
> > > > 
> > > > It seems that encoding time still isn't proportional to the
> > > > size of pgmap/latest. However, things have improved enough
> > > > that my 288 OSD filesystem goes active pretty quickly (~90 sec),
> > > > so I can continue testing at that scale.
> > 
> > A fix for this was just merged into master last night.
> 
> That's great!
> 
> > 
> > > I'm still having trouble at 288 OSDs with under heavy write load
> > > (166 linux clients running dd simultaneously).  I'm currently
> > > running with master branch from last week - commit e792cd938897.
> > > 
> > > The symptom is that the cluster cycles between "up:active"
> > > and "up:active(laggy or crashed)".  When the cluster goes into
> > > "laggy or crashed" the client caps go stale, and cluster throughput
> > > (as monitored by vmstat on OSD servers) trails off to zero.  After a
> > > short idle period, the cluster goes back "up:active", clients
> > > renew their caps, and cluster throughput goes back to its maximum
> > > until the next cycle starts.
> > > 
> > > I believe this is a scaling issue because when I use pg_bits = 5
> > > and pgp_bits = 5 (instead of the default 6) to build the filesystem,
> > > I can write>20 TB using the same test, with no instances of the
> > > cluster going  "laggy or crashed".  Perhaps it is related to
> > > the encoding time for pg_stat_t that I reported above?
> > 
> > Yeah, that sounds like the culprit to me.  Can you try with the latest
> > master?
> 
> I'm 24 TB into a 25 TB run with pg_bits = 6, and the cluster
> has been up:active the whole time.  Seems fixed to me!

Hooray!

> <checks running ceph -w output>
> 
> Interesting. One of my servers just died.
> 
> FWIW, when all 24 OSDs on that server dropped out, the cluster went
> "up:active(laggy or crashed)" for about 8 minutes while recovery started,
> then went back "up:active" while recovery is finishing.
> 
> I thought you'd be interested to know - is that behavior unremarkable?
> 
> FWIW, here's what happened on that server:
> 
> [510342.517157] ------------[ cut here ]------------
> [510342.521855] kernel BUG at fs/btrfs/extent_io.c:3982!
> [510342.526894] invalid opcode: 0000 [#1] SMP
> [510342.531102] CPU 4
> [510342.533028] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm
> ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror
> dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap
> macvlan tun kvm uinput sg sd_mod joydev ata_piix libata button microcode
> mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_mad
> ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support
> ehci_hcd uhci_hcd ioatdma dm_mod i7core_edac edac_core nfs nfs_acl auth_rpcgss
> fscache lockd sunrpc tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
> [510342.587836]
> [510342.589412] Pid: 16609, comm: kworker/4:2 Not tainted 3.3.1-00162-gd8b2857
> #15 Supermicro X8DTH-i/6/iF/6F/X8DTH
> [510342.599601] RIP: 0010:[<ffffffffa057924c>]  [<ffffffffa057924c>]
> btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
> [510342.610893] RSP: 0018:ffff88015fb6ba10  EFLAGS: 00010202
> [510342.616277] RAX: 0000000000000004 RBX: ffff880ab81865a0 RCX:
> ffff880174bc0230
> [510342.623476] RDX: ffff8801335bf9b1 RSI: 00000000000d0fb8 RDI:
> ffff880ab81865a0
> [510342.630675] RBP: ffff88015fb6ba40 R08: 0000000000000038 R09:
> 0000000000000003
> [510342.637874] R10: 0000000000000008 R11: ffff8804658c9e40 R12:
> ffff88015fb6a000
> [510342.645069] R13: ffff880ab81865a0 R14: 000000000000000e R15:
> ffff88015fb6bc10
> [510342.652268] FS:  0000000000000000(0000) GS:ffff880627c80000(0000)
> knlGS:0000000000000000
> [510342.660418] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [510342.666234] CR2: ffffffffff600400 CR3: 0000000001a05000 CR4:
> 00000000000006e0
> [510342.673427] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [510342.680627] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [510342.687827] Process kworker/4:2 (pid: 16609, threadinfo ffff88015fb6a000,
> task ffff880102ca4410)
> [510342.696669] Stack:
> [510342.698769]  ffff880100000000 ffff880ab81865a0 ffff88015fb6a000
> ffff8806057d2eb0
> [510342.706297]  000000000000000e ffff88015fb6bc10 ffff88015fb6ba70
> ffffffffa05793f2
> [510342.713825]  ffff88015fb6bb80 ffff880ab81865a0 ffff88015fb6bb50
> 0000000000000008
> [510342.721362] Call Trace:
> [510342.723912]  [<ffffffffa05793f2>] release_extent_buffer+0xa2/0xe0 [btrfs]
> [510342.730790]  [<ffffffffa05795b4>] free_extent_buffer+0x34/0x80 [btrfs]
> [510342.737407]  [<ffffffffa057a126>] btree_write_cache_pages+0x246/0x410
> [btrfs]
> [510342.744637]  [<ffffffffa054e96a>] btree_writepages+0x3a/0x50 [btrfs]
> [510342.751060]  [<ffffffff810fc421>] do_writepages+0x21/0x40
> [510342.756537]  [<ffffffff810f0b0b>] __filemap_fdatawrite_range+0x5b/0x60
> [510342.763136]  [<ffffffff810f0de3>] filemap_fdatawrite_range+0x13/0x20
> [510342.769568]  [<ffffffffa0554ecf>] btrfs_write_marked_extents+0x7f/0xe0
> [btrfs]
> [510342.776867]  [<ffffffffa0554f5e>]
> btrfs_write_and_wait_marked_extents+0x2e/0x60 [btrfs]
> [510342.784951]  [<ffffffffa0554fbb>]
> btrfs_write_and_wait_transaction+0x2b/0x50 [btrfs]
> [510342.792768]  [<ffffffffa055604c>] btrfs_commit_transaction+0x7ac/0xa10
> [btrfs]
> [510342.800060]  [<ffffffff81079540>] ? set_next_entity+0x90/0xa0
> [510342.805875]  [<ffffffff8105f5d0>] ? wake_up_bit+0x40/0x40
> [510342.811365]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20
> [btrfs]
> [510342.818403]  [<ffffffffa05565af>] do_async_commit+0x1f/0x30 [btrfs]
> [510342.824748]  [<ffffffffa0556590>] ? btrfs_end_transaction+0x20/0x20
> [btrfs]
> [510342.831774]  [<ffffffff81058680>] process_one_work+0x140/0x490
> [510342.837673]  [<ffffffff8105a417>] worker_thread+0x187/0x3f0
> [510342.843319]  [<ffffffff8105a290>] ? manage_workers+0x120/0x120
> [510342.849225]  [<ffffffff8105f02e>] kthread+0x9e/0xb0
> [510342.854176]  [<ffffffff81486c64>] kernel_thread_helper+0x4/0x10
> [510342.860168]  [<ffffffff8147d84a>] ? retint_restore_args+0xe/0xe
> [510342.866161]  [<ffffffff8105ef90>] ?
> kthread_freezable_should_stop+0x80/0x80
> [510342.873198]  [<ffffffff81486c60>] ? gs_change+0xb/0xb
> [510342.878322] Code: 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 66 66 66
> 66 90 8b 47 38 49 89 fd 85 c0 75 0c 48 8b 47 20 4c 8d 7f 20 84 c0 79 04 <0f>
> 0b eb fe 48 8b 47 20 a8 04 75 f4 48 8b 07 49 89 c4 4c 03 67
> [510342.898331] RIP  [<ffffffffa057924c>]
> btrfs_release_extent_buffer_page.clone.0+0x2c/0x130 [btrfs]
> [510342.907294]  RSP <ffff88015fb6ba10>
> [510342.911241] ---[ end trace 62013c6b6e2e5135 ]---
> 
> I'm running 3.3.1 + Chris Mason's btrfs for-linus branch (commit
> bc3f116fec194f,
> just before merge into 3.4-rc1).  This kernel also has the ceph for-next
> branch
> (commit c666601a935b, just before merge into 3.4-rc1), but that shouldn't
> matter
> for this.
> 
> > 
> > > The problem with using pg_bits = 5 is that the data distribution
> > > is not particularly even; after writing 20 TB to 288 OSDs I see
> > > (max OSD use)/(min OSD use) = ~2. Even with pg_bits = 6 after
> > > writing 20 TB I see (max OSD use)/(min OSD use) = ~1.5.
> > > I think I'd like that variability to be even smaller.
> > 
> > There is some infrastructure in the monitor to correct for the statistical
> > imbalance, but it isn't triggered automatically yet.  It's probably time
> > to look at that.
> 
> FWIW I've noticed while learning about how Ceph hashes objects into
> PGs and onto OSDs that Bob Jenkins has a hash function newer than
> what Ceph is using; see http://burtleburtle.net/bob/c/lookup3.c.
> 
> I haven't been able to find any comparisons of the older hash
> to this newer one, wrt. uniformity; I wonder if it would be
> more suitable, and thus reduce the imbalance without the need
> for correction?

It's possible.  We plan to have a hard look at CRUSH and do some 
improvements in a month or two.  This will be part of that.

> > > AFAICS I'm getting 3 pools of (n_OSDs<<  pg_bits) PGs, one pool
> > > each for each of CEPH_DATA_RULE, CEPH_METADATA_RULE, and
> > > CEPH_RBD_RULE.  So, for 288 OSDs I get 3*(288<<6) = 55296 PGs,
> > > plus a few thousand more for the localized PGs.
> > > 
> > > I can't seem to find any use of CEPH_RBD_RULE in the code, other
> > > than to create that pool.  What am I missing?  I'd like to just
> > > not create that pool to reduce my PG count - what problems might
> > > that cause?
> > 
> > None.   We create the rbd pool by default but it isn't used by the
> > filesystem; it's just the default pool used by the 'rbd' command line
> > tool.
> 
> OK, thanks for verifying.
> > 
> > > Also, what would be the downside if I tried to not create the
> > > CEPH_METADATA_RULE pool, and just put everything into the
> > > CEPH_DATA_RULE pool?  That way I could run with just one pool.
> > 
> > You could do that too.  The idea was that people might want a different
> > replication level or placement for metadata (faster nodes, more replicas,
> > whatever).
> 
> Again, thanks for verifying.
> 
> > 
> > But.. try with master first, as the PG scaling issue needs fixing
> > regardless, is hopefully fixed now, and will probably make all of this
> > moot... :)
> 
> I'll keep those tricks in mind if I ever get enough hardware
> that the number of PGs becomes an issue again.
> 
> > 
> > > In the longer run, can anything be done to keep the monitor
> > > daemon responsive when running with thousands of OSDs under a
> > > heavy write load?
> > 
> > Right now the monitor is being used to aggregate usage information, which
> > is probably not the best use of its time.  I don't expect it will become a
> > real problem for a while, though (as long as we avoid bugs like this one).
> 
> Hmm, if I did a few runs with bigger pg_bits values, that would give
> some idea of how far away that time is, right?

Yeah.  Just watching ceph-mon CPU and memory utilization will give you a 
good idea.. it should scale linearly with the size of the cluster.  I 
suspect we can get pretty far by throwing hardware at it, but I'm not sure 
how close we are to it being problematic.

> Thanks for the timely bug fix!  I really appreciate it.

And thanks for testing!

sage


>
> 
> -- Jim
> 
> > sage
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-04-10 22:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-08 23:31 scaling issues Jim Schutt
2012-03-09  0:26 ` Sage Weil
2012-03-09 19:39   ` Jim Schutt
2012-03-09 23:21     ` Jim Schutt
2012-04-10 16:22       ` Jim Schutt
2012-04-10 16:39         ` Sage Weil
2012-04-10 19:01           ` [EXTERNAL] " Jim Schutt
2012-04-10 22:38             ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.