All of lore.kernel.org
 help / color / mirror / Atom feed
* Crash and strange things on MDS
@ 2013-02-04 18:01 Kevin Decherf
  2013-02-11 13:05 ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-04 18:01 UTC (permalink / raw)
  To: ceph-devel; +Cc: support

Hey everyone,

It's my first post here to expose a potential issue I found today using
Ceph 0.56.1.

The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
All nodes are running Exherbo (source-based distribution) with Ceph
0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
is mounted on ~60 clients (increasing each day). Objects are replicated
three times and the cluster handles only 7GB of data atm for 350k
objects.

In certain conditions (I don't know them atm), some clients hang,
generate CPU overloads (kworker) and are unable to make any IO on
Ceph. The active MDS have ~20Mbps in/out during the issue (less than
2Mbps in normal activity). I don't know if it's directly linked but we
also observe a lot of missing files at the same time.

The problem is similar to this one [1].

A restart of the client or the MDS was enough before today, but we found
a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
with ~25% clients hanging.

In logs I found a segfault with this backtrace [2] and 100,000 dumped
events during the first hang. We observed another hang which produces
lot of these events (in debug mode):
   - "mds.0.server FAIL on ESTALE but attempting recovery"
   - "mds.0.server reply_request -116 (Stale NFS file handle)
      client_request(client.10991:1031 getattr As #1000004bab0
      RETRY=132)"

We have no profiling tools available on these nodes, and I don't know
what I should search in the 35 GB log file.

Note: the segmentation fault occured only once but the problem was
observed four times on this cluster.

Any help may be appreciated.

References:
[1] http://www.spinics.net/lists/ceph-devel/msg04903.html
[2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
    1: /usr/bin/ceph-mds() [0x817e82]
    2: (()+0xf140) [0x7f9091d30140]
    3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
    4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
    5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
    6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
    7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
    8: (Server::kill_session(Session*)+0x137) [0x549c67]
    9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
    10: (MDS::tick()+0x338) [0x4da928]
    11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
    12: (SafeTimerThread::entry()+0xd) [0x782bad]
    13: (()+0x7ddf) [0x7f9091d28ddf]
    14: (clone()+0x6d) [0x7f90909cc24d]

Cheers,
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-04 18:01 Crash and strange things on MDS Kevin Decherf
@ 2013-02-11 13:05 ` Kevin Decherf
  2013-02-11 17:00   ` Sam Lang
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-11 13:05 UTC (permalink / raw)
  To: ceph-devel; +Cc: support

On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
> Hey everyone,
> 
> It's my first post here to expose a potential issue I found today using
> Ceph 0.56.1.
> 
> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
> All nodes are running Exherbo (source-based distribution) with Ceph
> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
> is mounted on ~60 clients (increasing each day). Objects are replicated
> three times and the cluster handles only 7GB of data atm for 350k
> objects.
> 
> In certain conditions (I don't know them atm), some clients hang,
> generate CPU overloads (kworker) and are unable to make any IO on
> Ceph. The active MDS have ~20Mbps in/out during the issue (less than
> 2Mbps in normal activity). I don't know if it's directly linked but we
> also observe a lot of missing files at the same time.
> 
> The problem is similar to this one [1].
> 
> A restart of the client or the MDS was enough before today, but we found
> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
> with ~25% clients hanging.
> 
> In logs I found a segfault with this backtrace [2] and 100,000 dumped
> events during the first hang. We observed another hang which produces
> lot of these events (in debug mode):
>    - "mds.0.server FAIL on ESTALE but attempting recovery"
>    - "mds.0.server reply_request -116 (Stale NFS file handle)
>       client_request(client.10991:1031 getattr As #1000004bab0
>       RETRY=132)"
> 
> We have no profiling tools available on these nodes, and I don't know
> what I should search in the 35 GB log file.
> 
> Note: the segmentation fault occured only once but the problem was
> observed four times on this cluster.
> 
> Any help may be appreciated.
> 
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>     1: /usr/bin/ceph-mds() [0x817e82]
>     2: (()+0xf140) [0x7f9091d30140]
>     3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>     4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>     5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>     6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>     7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>     8: (Server::kill_session(Session*)+0x137) [0x549c67]
>     9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>     10: (MDS::tick()+0x338) [0x4da928]
>     11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>     12: (SafeTimerThread::entry()+0xd) [0x782bad]
>     13: (()+0x7ddf) [0x7f9091d28ddf]
>     14: (clone()+0x6d) [0x7f90909cc24d]

I found a possible cause/way to reproduce this issue.
We have now ~90 clients for 18GB / 650k objects and the storm occurs
when we execute an "intensive IO" command (tar of the whole pool / rsync
in one folder) on one of our client (the only which uses ceph-fuse,
don't know if it's limited to it or not).

Any idea?

Cheers,
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 13:05 ` Kevin Decherf
@ 2013-02-11 17:00   ` Sam Lang
  2013-02-11 18:54     ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Sam Lang @ 2013-02-11 17:00 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: ceph-devel, support

On Mon, Feb 11, 2013 at 7:05 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
>> Hey everyone,
>>
>> It's my first post here to expose a potential issue I found today using
>> Ceph 0.56.1.
>>
>> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
>> All nodes are running Exherbo (source-based distribution) with Ceph
>> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
>> is mounted on ~60 clients (increasing each day). Objects are replicated
>> three times and the cluster handles only 7GB of data atm for 350k
>> objects.
>>
>> In certain conditions (I don't know them atm), some clients hang,
>> generate CPU overloads (kworker) and are unable to make any IO on
>> Ceph. The active MDS have ~20Mbps in/out during the issue (less than
>> 2Mbps in normal activity). I don't know if it's directly linked but we
>> also observe a lot of missing files at the same time.
>>
>> The problem is similar to this one [1].
>>
>> A restart of the client or the MDS was enough before today, but we found
>> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
>> with ~25% clients hanging.
>>
>> In logs I found a segfault with this backtrace [2] and 100,000 dumped
>> events during the first hang. We observed another hang which produces
>> lot of these events (in debug mode):
>>    - "mds.0.server FAIL on ESTALE but attempting recovery"
>>    - "mds.0.server reply_request -116 (Stale NFS file handle)
>>       client_request(client.10991:1031 getattr As #1000004bab0
>>       RETRY=132)"

Hi Kevin, sorry for the delayed response.
This looks like the mds cache is thrashing quite a bit, and with
multiple MDSs the tree partitioning is causing those estale messages.
In your case, you should probably run with just a single active mds (I
assume all three MDSs are active, but ceph -s will tell you for sure),
and the others as standby.  I don't think you'll be able to do that
without starting over though.

Also, you might want to increase the size of the mds cache if you have
enough memory on that machine.  mds cache size defaults to 100k, you
might increase it to 300k and see if you get the same problems.

>>
>> We have no profiling tools available on these nodes, and I don't know
>> what I should search in the 35 GB log file.
>>
>> Note: the segmentation fault occured only once but the problem was
>> observed four times on this cluster.
>>
>> Any help may be appreciated.
>>
>> References:
>> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
>> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>>     1: /usr/bin/ceph-mds() [0x817e82]
>>     2: (()+0xf140) [0x7f9091d30140]
>>     3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>>     4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>>     5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>>     6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>>     7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>>     8: (Server::kill_session(Session*)+0x137) [0x549c67]
>>     9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>>     10: (MDS::tick()+0x338) [0x4da928]
>>     11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>>     12: (SafeTimerThread::entry()+0xd) [0x782bad]
>>     13: (()+0x7ddf) [0x7f9091d28ddf]
>>     14: (clone()+0x6d) [0x7f90909cc24d]

>
> I found a possible cause/way to reproduce this issue.
> We have now ~90 clients for 18GB / 650k objects and the storm occurs
> when we execute an "intensive IO" command (tar of the whole pool / rsync
> in one folder) on one of our client (the only which uses ceph-fuse,
> don't know if it's limited to it or not).

Do you have debug logging enabled when you see this crash?  Can you
compress that mds log and post it somewhere or email it to me?

Thanks,
-sam

>
> Any idea?
>
> Cheers,
> --
> Kevin Decherf - @Kdecherf
> GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
> http://kdecherf.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 17:00   ` Sam Lang
@ 2013-02-11 18:54     ` Kevin Decherf
  2013-02-11 20:25       ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-11 18:54 UTC (permalink / raw)
  To: Sam Lang; +Cc: ceph-devel, support

On Mon, Feb 11, 2013 at 11:00:15AM -0600, Sam Lang wrote:
> Hi Kevin, sorry for the delayed response.
> This looks like the mds cache is thrashing quite a bit, and with
> multiple MDSs the tree partitioning is causing those estale messages.
> In your case, you should probably run with just a single active mds (I
> assume all three MDSs are active, but ceph -s will tell you for sure),
> and the others as standby.  I don't think you'll be able to do that
> without starting over though.

Hi Sam,

I know that MDS clustering is a bit buggy so I have only one active MDS
on this cluster.

Here is the output of ceph -s:

   ~ # ceph -s
      health HEALTH_OK
      monmap e1: 3 mons at {a=x:6789/0,b=y:6789/0,c=z:6789/0}, election epoch 48, quorum 0,1,2 a,b,c
      osdmap e79: 27 osds: 27 up, 27 in
       pgmap v895343: 5376 pgs: 5376 active+clean; 18987 MB data, 103 GB used, 21918 GB / 23201 GB avail
      mdsmap e73: 1/1/1 up {0=b=up:active}, 2 up:standby


> Also, you might want to increase the size of the mds cache if you have
> enough memory on that machine.  mds cache size defaults to 100k, you
> might increase it to 300k and see if you get the same problems.

I have 24GB of memory for each MDS, I will try to increase this value.
Thanks for advice.

> Do you have debug logging enabled when you see this crash?  Can you
> compress that mds log and post it somewhere or email it to me?

Yes, I have 34GB of raw logs (for this issue) but I have no debug log
of the beginning of the storm itself. I will upload a compressed
archive.


Furthermore, I observe another strange thing more or less related to the
storms.

During a rsync command to write ~20G of data on Ceph and during (and
after) the storm, one OSD sends a lot of data to the active MDS
(400Mbps peak each 6 seconds). After a quick check, I found that when I
stop osd.23, osd.14 stops its peaks.

I will forward a copy of the debug enabled log of osd14.

The only significant difference between osd.23 and others is the list of
hb_in where osd.14 is missing (but I think it's unrelated).

   ~ # ceph pg dump
   osdstat  kbused   kbavail  kb hb in hb out
   0  4016228  851255948   901042464   [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   1  4108748  851163428   901042464   [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]  []
   2  4276584  850995592   901042464   [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   3  3997368  851274808   901042464   [0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] []
   4  4358212  850913964   901042464   [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   5  4039112  851233064   901042464   [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   6  3971568  851300608   901042464   [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   7  3942556  851329620   901042464   [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   8  4275584  850996592   901042464   [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   9  4279308  850992868   901042464   [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   10 3728136  851544040   901042464   [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   11 3934096  851338080   901042464   [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   12 3991600  851280576   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   13 4211228  851060948   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   14 4169476  851102700   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26]   []
   15 4385584  850886592   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26]   []
   16 3761176  851511000   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26]   []
   17 3646096  851626080   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]   []
   18 4119448  851152728   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26]   []
   19 4592992  850679184   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26]   []
   20 3740840  851531336   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26]   []
   21 4363552  850908624   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26]   []
   22 3831420  851440756   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26]   []
   23 3681648  851590528   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26]   []
   24 3946192  851325984   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26]   []
   25 3954360  851317816   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26]   []
   26 3775532  851496644   901042464   [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]   []
    sum  109098644   22983250108 24328146528

Cheers,
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 18:54     ` Kevin Decherf
@ 2013-02-11 20:25       ` Gregory Farnum
  2013-02-11 22:24         ` Kevin Decherf
  2013-02-13 11:47         ` Kevin Decherf
  0 siblings, 2 replies; 27+ messages in thread
From: Gregory Farnum @ 2013-02-11 20:25 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>     1: /usr/bin/ceph-mds() [0x817e82]
>     2: (()+0xf140) [0x7f9091d30140]
>     3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>     4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>     5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>     6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>     7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>     8: (Server::kill_session(Session*)+0x137) [0x549c67]
>     9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>     10: (MDS::tick()+0x338) [0x4da928]
>     11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>     12: (SafeTimerThread::entry()+0xd) [0x782bad]
>     13: (()+0x7ddf) [0x7f9091d28ddf]
>     14: (clone()+0x6d) [0x7f90909cc24d]

This in particular is quite odd. Do you have any logging from when
that happened? (Oftentimes the log can have a bunch of debugging
information from shortly before the crash.)

On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> Furthermore, I observe another strange thing more or less related to the
> storms.
>
> During a rsync command to write ~20G of data on Ceph and during (and
> after) the storm, one OSD sends a lot of data to the active MDS
> (400Mbps peak each 6 seconds). After a quick check, I found that when I
> stop osd.23, osd.14 stops its peaks.

This is consistent with Sam's suggestion that MDS is thrashing its
cache, and is grabbing a directory object off of the OSDs. How large
are the directories you're using? If they're a significant fraction of
your cache size, it might be worth enabling the (sadly less stable)
directory fragmentation options, which will split them up into smaller
fragments that can be independently read and written to disk.
-Greg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 20:25       ` Gregory Farnum
@ 2013-02-11 22:24         ` Kevin Decherf
  2013-02-11 22:47           ` Gregory Farnum
  2013-02-13 11:47         ` Kevin Decherf
  1 sibling, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-11 22:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > References:
> > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> >     1: /usr/bin/ceph-mds() [0x817e82]
> >     2: (()+0xf140) [0x7f9091d30140]
> >     3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> >     4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> >     5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> >     6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> >     7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> >     8: (Server::kill_session(Session*)+0x137) [0x549c67]
> >     9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> >     10: (MDS::tick()+0x338) [0x4da928]
> >     11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> >     12: (SafeTimerThread::entry()+0xd) [0x782bad]
> >     13: (()+0x7ddf) [0x7f9091d28ddf]
> >     14: (clone()+0x6d) [0x7f90909cc24d]
> 
> This in particular is quite odd. Do you have any logging from when
> that happened? (Oftentimes the log can have a bunch of debugging
> information from shortly before the crash.)

Yes, there is a dump of 100,000 events for this backtrace in the linked
archive (I need 7 hours to upload it).

> 
> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > Furthermore, I observe another strange thing more or less related to the
> > storms.
> >
> > During a rsync command to write ~20G of data on Ceph and during (and
> > after) the storm, one OSD sends a lot of data to the active MDS
> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
> > stop osd.23, osd.14 stops its peaks.
> 
> This is consistent with Sam's suggestion that MDS is thrashing its
> cache, and is grabbing a directory object off of the OSDs. How large
> are the directories you're using? If they're a significant fraction of
> your cache size, it might be worth enabling the (sadly less stable)
> directory fragmentation options, which will split them up into smaller
> fragments that can be independently read and written to disk.

The distribution is heterogeneous: we have a folder of ~17G for 300k
objects, another of ~2G for 150k objects and a lof of smaller directories.
Are you talking about the mds bal frag and mds bal split * settings?
Do you have any advice about the value to use?

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 22:24         ` Kevin Decherf
@ 2013-02-11 22:47           ` Gregory Farnum
  2013-02-11 23:33             ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-11 22:47 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
>> On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > References:
>> > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
>> > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>> >     1: /usr/bin/ceph-mds() [0x817e82]
>> >     2: (()+0xf140) [0x7f9091d30140]
>> >     3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>> >     4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>> >     5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>> >     6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>> >     7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>> >     8: (Server::kill_session(Session*)+0x137) [0x549c67]
>> >     9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>> >     10: (MDS::tick()+0x338) [0x4da928]
>> >     11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>> >     12: (SafeTimerThread::entry()+0xd) [0x782bad]
>> >     13: (()+0x7ddf) [0x7f9091d28ddf]
>> >     14: (clone()+0x6d) [0x7f90909cc24d]
>>
>> This in particular is quite odd. Do you have any logging from when
>> that happened? (Oftentimes the log can have a bunch of debugging
>> information from shortly before the crash.)
>
> Yes, there is a dump of 100,000 events for this backtrace in the linked
> archive (I need 7 hours to upload it).

Can you just pastebin the last couple hundred lines? I'm mostly
interested if there's anything from the function which actually caused
the assert/segfault. Also, the log should compress well and get much
smaller!

>> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > Furthermore, I observe another strange thing more or less related to the
>> > storms.
>> >
>> > During a rsync command to write ~20G of data on Ceph and during (and
>> > after) the storm, one OSD sends a lot of data to the active MDS
>> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
>> > stop osd.23, osd.14 stops its peaks.
>>
>> This is consistent with Sam's suggestion that MDS is thrashing its
>> cache, and is grabbing a directory object off of the OSDs. How large
>> are the directories you're using? If they're a significant fraction of
>> your cache size, it might be worth enabling the (sadly less stable)
>> directory fragmentation options, which will split them up into smaller
>> fragments that can be independently read and written to disk.
>
> The distribution is heterogeneous: we have a folder of ~17G for 300k
> objects, another of ~2G for 150k objects and a lof of smaller directories.

Sorry, you mean 300,000 files in the single folder?
If so, that's definitely why it's behaving so badly — your folder is
larger than your maximum cache size settings, and so if you run an
"ls" or anything the MDS will read the whole thing off disk, then
instantly drop most of the folder from its cache. Then re-read again
for the next request to list contents, etc etc.

> Are you talking about the mds bal frag and mds bal split * settings?
> Do you have any advice about the value to use?
If you set "mds bal frag = true" in your config, it will split up
those very large directories into smaller fragments and behave a lot
better. This isn't quite as stable (thus the default to "off"), so if
you have the memory to just really up your cache size I'd start with
that and see if it makes your problems better. But if it doesn't,
directory fragmentation does work reasonably well and it's something
we'd be interested in bug reports for. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 22:47           ` Gregory Farnum
@ 2013-02-11 23:33             ` Kevin Decherf
  0 siblings, 0 replies; 27+ messages in thread
From: Kevin Decherf @ 2013-02-11 23:33 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Mon, Feb 11, 2013 at 02:47:13PM -0800, Gregory Farnum wrote:
> On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> > Yes, there is a dump of 100,000 events for this backtrace in the linked
> > archive (I need 7 hours to upload it).
> 
> Can you just pastebin the last couple hundred lines? I'm mostly
> interested if there's anything from the function which actually caused
> the assert/segfault. Also, the log should compress well and get much
> smaller!

Sent in pm.

And yes, I have a good compression rate but...

   % ls -lh                             
   total 38G
   -rw-r--r-- 1 kdecherf kdecherf 3.3G Feb 11 18:36 cc-ceph-log.tar.gz
   -rw------- 1 kdecherf kdecherf  66M Feb  4 17:57 ceph.log
   -rw-r--r-- 1 kdecherf kdecherf 3.5G Feb  4 14:44 ceph-mds.b.log
   -rw-r--r-- 1 kdecherf kdecherf  31G Feb  5 15:55 ceph-mds.c.log
   -rw-r--r-- 1 kdecherf kdecherf  27M Feb 11 19:46 ceph-osd.14.log

;-)

> > The distribution is heterogeneous: we have a folder of ~17G for 300k
> > objects, another of ~2G for 150k objects and a lof of smaller directories.
> 
> Sorry, you mean 300,000 files in the single folder?
> If so, that's definitely why it's behaving so badly — your folder is
> larger than your maximum cache size settings, and so if you run an
> "ls" or anything the MDS will read the whole thing off disk, then
> instantly drop most of the folder from its cache. Then re-read again
> for the next request to list contents, etc etc.

The biggest top-level folder contains 300k files but splitted into
several subfolders (a subfolder does not contain more than 10,000 files
at its level).

> > Are you talking about the mds bal frag and mds bal split * settings?
> > Do you have any advice about the value to use?
> If you set "mds bal frag = true" in your config, it will split up
> those very large directories into smaller fragments and behave a lot
> better. This isn't quite as stable (thus the default to "off"), so if
> you have the memory to just really up your cache size I'd start with
> that and see if it makes your problems better. But if it doesn't,
> directory fragmentation does work reasonably well and it's something
> we'd be interested in bug reports for. :)

I will try it, thanks!

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-11 20:25       ` Gregory Farnum
  2013-02-11 22:24         ` Kevin Decherf
@ 2013-02-13 11:47         ` Kevin Decherf
  2013-02-13 18:19           ` Gregory Farnum
  1 sibling, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-13 11:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > Furthermore, I observe another strange thing more or less related to the
> > storms.
> >
> > During a rsync command to write ~20G of data on Ceph and during (and
> > after) the storm, one OSD sends a lot of data to the active MDS
> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
> > stop osd.23, osd.14 stops its peaks.
> 
> This is consistent with Sam's suggestion that MDS is thrashing its
> cache, and is grabbing a directory object off of the OSDs. How large
> are the directories you're using? If they're a significant fraction of
> your cache size, it might be worth enabling the (sadly less stable)
> directory fragmentation options, which will split them up into smaller
> fragments that can be independently read and written to disk.

I set mds cache size to 400000 but now I observe ~900Mbps peaks from
osd.14 to the active mds, osd.18 and osd.2.

osd.14 shares some pg with osd.18 and osd.2:
http://pastebin.com/raw.php?i=uBAcTcu4


-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-13 11:47         ` Kevin Decherf
@ 2013-02-13 18:19           ` Gregory Farnum
  2013-02-16  1:02             ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-13 18:19 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
>> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > Furthermore, I observe another strange thing more or less related to the
>> > storms.
>> >
>> > During a rsync command to write ~20G of data on Ceph and during (and
>> > after) the storm, one OSD sends a lot of data to the active MDS
>> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
>> > stop osd.23, osd.14 stops its peaks.
>>
>> This is consistent with Sam's suggestion that MDS is thrashing its
>> cache, and is grabbing a directory object off of the OSDs. How large
>> are the directories you're using? If they're a significant fraction of
>> your cache size, it might be worth enabling the (sadly less stable)
>> directory fragmentation options, which will split them up into smaller
>> fragments that can be independently read and written to disk.
>
> I set mds cache size to 400000 but now I observe ~900Mbps peaks from
> osd.14 to the active mds, osd.18 and osd.2.
>
> osd.14 shares some pg with osd.18 and osd.2:
> http://pastebin.com/raw.php?i=uBAcTcu4

The high bandwidth from OSD to MDS really isn't a concern — that's the
MDS asking for data and getting it back quickly! We're concerned about
client responsiveness; has that gotten better?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-13 18:19           ` Gregory Farnum
@ 2013-02-16  1:02             ` Kevin Decherf
  2013-02-16 17:36               ` Sam Lang
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-16  1:02 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote:
> On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> >> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> >> > Furthermore, I observe another strange thing more or less related to the
> >> > storms.
> >> >
> >> > During a rsync command to write ~20G of data on Ceph and during (and
> >> > after) the storm, one OSD sends a lot of data to the active MDS
> >> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
> >> > stop osd.23, osd.14 stops its peaks.
> >>
> >> This is consistent with Sam's suggestion that MDS is thrashing its
> >> cache, and is grabbing a directory object off of the OSDs. How large
> >> are the directories you're using? If they're a significant fraction of
> >> your cache size, it might be worth enabling the (sadly less stable)
> >> directory fragmentation options, which will split them up into smaller
> >> fragments that can be independently read and written to disk.
> >
> > I set mds cache size to 400000 but now I observe ~900Mbps peaks from
> > osd.14 to the active mds, osd.18 and osd.2.
> >
> > osd.14 shares some pg with osd.18 and osd.2:
> > http://pastebin.com/raw.php?i=uBAcTcu4
> 
> The high bandwidth from OSD to MDS really isn't a concern — that's the
> MDS asking for data and getting it back quickly! We're concerned about
> client responsiveness; has that gotten better?

It seems better now, I didn't see any storm so far.

But we observe high latency on some of our clients (with no load). Does
it exist any documentation on how to read the perfcounters_dump output?
I would like to know if the MDS still has any problem with its cache or
if the latency comes from elsewhere.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-16  1:02             ` Kevin Decherf
@ 2013-02-16 17:36               ` Sam Lang
  2013-02-16 18:24                 ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Sam Lang @ 2013-02-16 17:36 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Gregory Farnum, ceph-devel, support

On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote:
>> On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
>> >> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> >> > Furthermore, I observe another strange thing more or less related to the
>> >> > storms.
>> >> >
>> >> > During a rsync command to write ~20G of data on Ceph and during (and
>> >> > after) the storm, one OSD sends a lot of data to the active MDS
>> >> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
>> >> > stop osd.23, osd.14 stops its peaks.
>> >>
>> >> This is consistent with Sam's suggestion that MDS is thrashing its
>> >> cache, and is grabbing a directory object off of the OSDs. How large
>> >> are the directories you're using? If they're a significant fraction of
>> >> your cache size, it might be worth enabling the (sadly less stable)
>> >> directory fragmentation options, which will split them up into smaller
>> >> fragments that can be independently read and written to disk.
>> >
>> > I set mds cache size to 400000 but now I observe ~900Mbps peaks from
>> > osd.14 to the active mds, osd.18 and osd.2.
>> >
>> > osd.14 shares some pg with osd.18 and osd.2:
>> > http://pastebin.com/raw.php?i=uBAcTcu4
>>
>> The high bandwidth from OSD to MDS really isn't a concern — that's the
>> MDS asking for data and getting it back quickly! We're concerned about
>> client responsiveness; has that gotten better?
>
> It seems better now, I didn't see any storm so far.
>
> But we observe high latency on some of our clients (with no load). Does
> it exist any documentation on how to read the perfcounters_dump output?
> I would like to know if the MDS still has any problem with its cache or
> if the latency comes from elsewhere.

You can use perfcounters_schema to get the output format in json,
using something like json_pp to make it somewhat readable for a human.
 You can post the dump output as well and we can try to help you
parse/understand what its telling you.
-sam

>
> --
> Kevin Decherf - @Kdecherf
> GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
> http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-16 17:36               ` Sam Lang
@ 2013-02-16 18:24                 ` Kevin Decherf
  2013-02-19 18:15                   ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-16 18:24 UTC (permalink / raw)
  To: Sam Lang; +Cc: Gregory Farnum, ceph-devel, support

On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote:
> On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > It seems better now, I didn't see any storm so far.
> >
> > But we observe high latency on some of our clients (with no load). Does
> > it exist any documentation on how to read the perfcounters_dump output?
> > I would like to know if the MDS still has any problem with its cache or
> > if the latency comes from elsewhere.
> 
> You can use perfcounters_schema to get the output format in json,
> using something like json_pp to make it somewhat readable for a human.
>  You can post the dump output as well and we can try to help you
> parse/understand what its telling you.

Ok, thanks.

Here you will find the output of perfcounters_dump on the active MDS:
http://pastebin.com/raw.php?i=BNerQ2fL

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-16 18:24                 ` Kevin Decherf
@ 2013-02-19 18:15                   ` Gregory Farnum
  2013-02-20  1:00                     ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-19 18:15 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Sat, Feb 16, 2013 at 10:24 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote:
>> On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > It seems better now, I didn't see any storm so far.
>> >
>> > But we observe high latency on some of our clients (with no load). Does
>> > it exist any documentation on how to read the perfcounters_dump output?
>> > I would like to know if the MDS still has any problem with its cache or
>> > if the latency comes from elsewhere.
>>
>> You can use perfcounters_schema to get the output format in json,
>> using something like json_pp to make it somewhat readable for a human.
>>  You can post the dump output as well and we can try to help you
>> parse/understand what its telling you.
>
> Ok, thanks.
>
> Here you will find the output of perfcounters_dump on the active MDS:
> http://pastebin.com/raw.php?i=BNerQ2fL

Looks like you've got ~424k dentries pinned, and it's trying to keep
400k inodes in cache. So you're still a bit oversubscribed, yes. This
might just be the issue where your clients are keeping a bunch of
inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
-Greg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-19 18:15                   ` Gregory Farnum
@ 2013-02-20  1:00                     ` Kevin Decherf
  2013-02-20  1:09                       ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-20  1:00 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
> Looks like you've got ~424k dentries pinned, and it's trying to keep
> 400k inodes in cache. So you're still a bit oversubscribed, yes. This
> might just be the issue where your clients are keeping a bunch of
> inodes cached for the VFS (http://tracker.ceph.com/issues/3289).

Thanks for the analyze. We use only one ceph-fuse client at this time
which makes all "high-load" commands like rsync, tar and cp on a huge
amount of files. Well, I will replace it by the kernel client.

After some tests we increased the mds cache size to 900k.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-20  1:00                     ` Kevin Decherf
@ 2013-02-20  1:09                       ` Gregory Farnum
  2013-02-26 17:57                         ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-20  1:09 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
>> Looks like you've got ~424k dentries pinned, and it's trying to keep
>> 400k inodes in cache. So you're still a bit oversubscribed, yes. This
>> might just be the issue where your clients are keeping a bunch of
>> inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
>
> Thanks for the analyze. We use only one ceph-fuse client at this time
> which makes all "high-load" commands like rsync, tar and cp on a huge
> amount of files. Well, I will replace it by the kernel client.

Oh, that bug is just an explanation of what's happening; I believe it
exists in the kernel client as well.
-Greg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-20  1:09                       ` Gregory Farnum
@ 2013-02-26 17:57                         ` Kevin Decherf
  2013-02-26 18:10                           ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-26 17:57 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
> On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
> >> Looks like you've got ~424k dentries pinned, and it's trying to keep
> >> 400k inodes in cache. So you're still a bit oversubscribed, yes. This
> >> might just be the issue where your clients are keeping a bunch of
> >> inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
> >
> > Thanks for the analyze. We use only one ceph-fuse client at this time
> > which makes all "high-load" commands like rsync, tar and cp on a huge
> > amount of files. Well, I will replace it by the kernel client.
> 
> Oh, that bug is just an explanation of what's happening; I believe it
> exists in the kernel client as well.

After setting the mds cache size to 900k, storms are gone.
However we continue to observe high latency on some clients (always the
same clients): each IO takes between 40 and 90ms (for example with
Wordpress, it takes ~20 seconds to load all needed files...).
With a non-laggy client, IO requests take less than 1ms.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 17:57                         ` Kevin Decherf
@ 2013-02-26 18:10                           ` Gregory Farnum
  2013-02-26 19:58                             ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-26 18:10 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
>> On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
>> >> Looks like you've got ~424k dentries pinned, and it's trying to keep
>> >> 400k inodes in cache. So you're still a bit oversubscribed, yes. This
>> >> might just be the issue where your clients are keeping a bunch of
>> >> inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
>> >
>> > Thanks for the analyze. We use only one ceph-fuse client at this time
>> > which makes all "high-load" commands like rsync, tar and cp on a huge
>> > amount of files. Well, I will replace it by the kernel client.
>>
>> Oh, that bug is just an explanation of what's happening; I believe it
>> exists in the kernel client as well.
>
> After setting the mds cache size to 900k, storms are gone.
> However we continue to observe high latency on some clients (always the
> same clients): each IO takes between 40 and 90ms (for example with
> Wordpress, it takes ~20 seconds to load all needed files...).
> With a non-laggy client, IO requests take less than 1ms.

I can't be sure from that description, but it sounds like you've got
one client which is generally holding the RW "caps" on the files, and
then another client which comes in occasionally to read those same
files. That requires the first client to drop its caps, and involves a
couple round-trip messages and is going to take some time — this is an
unavoidable consequence if you have clients sharing files, although
there's probably still room for us to optimize.

Can you describe your client workload in a bit more detail?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 18:10                           ` Gregory Farnum
@ 2013-02-26 19:58                             ` Kevin Decherf
  2013-02-26 20:26                               ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-26 19:58 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 26, 2013 at 10:10:06AM -0800, Gregory Farnum wrote:
> On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
> >> On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> >> > On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
> >> >> Looks like you've got ~424k dentries pinned, and it's trying to keep
> >> >> 400k inodes in cache. So you're still a bit oversubscribed, yes. This
> >> >> might just be the issue where your clients are keeping a bunch of
> >> >> inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
> >> >
> >> > Thanks for the analyze. We use only one ceph-fuse client at this time
> >> > which makes all "high-load" commands like rsync, tar and cp on a huge
> >> > amount of files. Well, I will replace it by the kernel client.
> >>
> >> Oh, that bug is just an explanation of what's happening; I believe it
> >> exists in the kernel client as well.
> >
> > After setting the mds cache size to 900k, storms are gone.
> > However we continue to observe high latency on some clients (always the
> > same clients): each IO takes between 40 and 90ms (for example with
> > Wordpress, it takes ~20 seconds to load all needed files...).
> > With a non-laggy client, IO requests take less than 1ms.
> 
> I can't be sure from that description, but it sounds like you've got
> one client which is generally holding the RW "caps" on the files, and
> then another client which comes in occasionally to read those same
> files. That requires the first client to drop its caps, and involves a
> couple round-trip messages and is going to take some time — this is an
> unavoidable consequence if you have clients sharing files, although
> there's probably still room for us to optimize.
> 
> Can you describe your client workload in a bit more detail?

We have one folder per application (php, java, ruby). Every application has
small (<1M) files. The folder is mounted by only one client by default.

In case of overload, another clients spawn to mount the same folder and
access the same files.

In the following test, only one client was used to serve the
application (a website using wordpress).

I made the test with strace to see the time of each IO request (strace -T
-e trace=file) and I noticed the same pattern:

...
[pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
[pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
[pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
[pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
[pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
[pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
...

~250 files were accessed for only one request (thanks Wordpress.).

The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.

I have a debug (debug_mds=20) log of the active mds during this test if you want.
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 19:58                             ` Kevin Decherf
@ 2013-02-26 20:26                               ` Gregory Farnum
  2013-02-26 21:57                                 ` Kevin Decherf
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2013-02-26 20:26 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> We have one folder per application (php, java, ruby). Every application has
> small (<1M) files. The folder is mounted by only one client by default.
>
> In case of overload, another clients spawn to mount the same folder and
> access the same files.
>
> In the following test, only one client was used to serve the
> application (a website using wordpress).
>
> I made the test with strace to see the time of each IO request (strace -T
> -e trace=file) and I noticed the same pattern:
>
> ...
> [pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
> [pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
> [pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
> [pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
> [pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
> [pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
> ...
>
> ~250 files were accessed for only one request (thanks Wordpress.).

Okay, that is slower than I'd expect, even for an across-the-wire request...

> The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.

What kernel and why are you using nodcache? Did you have problems
without that mount option? That's forcing an MDS access for most
operations, rather than using local data.

> I have a debug (debug_mds=20) log of the active mds during this test if you want.

Yeah, can you post it somewhere?
-Greg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 20:26                               ` Gregory Farnum
@ 2013-02-26 21:57                                 ` Kevin Decherf
  2013-02-26 21:58                                   ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Decherf @ 2013-02-26 21:57 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
> On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> > We have one folder per application (php, java, ruby). Every application has
> > small (<1M) files. The folder is mounted by only one client by default.
> >
> > In case of overload, another clients spawn to mount the same folder and
> > access the same files.
> >
> > In the following test, only one client was used to serve the
> > application (a website using wordpress).
> >
> > I made the test with strace to see the time of each IO request (strace -T
> > -e trace=file) and I noticed the same pattern:
> >
> > ...
> > [pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
> > [pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
> > [pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
> > [pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
> > [pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
> > [pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
> > ...
> >
> > ~250 files were accessed for only one request (thanks Wordpress.).
> 
> Okay, that is slower than I'd expect, even for an across-the-wire request...
> 
> > The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.
> 
> What kernel and why are you using nodcache?

We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
secretfile as mount options) and I didn't find it in the documentation of
mount.ceph.

> Did you have problems
> without that mount option? That's forcing an MDS access for most
> operations, rather than using local data.

Good question, I will try it (-o dcache?).

> > I have a debug (debug_mds=20) log of the active mds during this test if you want.
> 
> Yeah, can you post it somewhere?

Upload in progress :-)

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 21:57                                 ` Kevin Decherf
@ 2013-02-26 21:58                                   ` Gregory Farnum
  2013-02-27  0:03                                     ` Yan, Zheng 
       [not found]                                     ` <20130227004923.GQ16091@kdecherf.com>
  0 siblings, 2 replies; 27+ messages in thread
From: Gregory Farnum @ 2013-02-26 21:58 UTC (permalink / raw)
  To: Kevin Decherf, Sage Weil; +Cc: Sam Lang, ceph-devel, support

On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
>> On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> > We have one folder per application (php, java, ruby). Every application has
>> > small (<1M) files. The folder is mounted by only one client by default.
>> >
>> > In case of overload, another clients spawn to mount the same folder and
>> > access the same files.
>> >
>> > In the following test, only one client was used to serve the
>> > application (a website using wordpress).
>> >
>> > I made the test with strace to see the time of each IO request (strace -T
>> > -e trace=file) and I noticed the same pattern:
>> >
>> > ...
>> > [pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
>> > [pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
>> > [pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
>> > [pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
>> > [pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
>> > [pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
>> > ...
>> >
>> > ~250 files were accessed for only one request (thanks Wordpress.).
>>
>> Okay, that is slower than I'd expect, even for an across-the-wire request...
>>
>> > The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.
>>
>> What kernel and why are you using nodcache?
>
> We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
> secretfile as mount options) and I didn't find it in the documentation of
> mount.ceph.
>
>> Did you have problems
>> without that mount option? That's forcing an MDS access for most
>> operations, rather than using local data.
>
> Good question, I will try it (-o dcache?).

Oh right — I forgot Sage had enabled that by default; I don't recall
how necessary it is. (Sage?)


>> > I have a debug (debug_mds=20) log of the active mds during this test if you want.
>>
>> Yeah, can you post it somewhere?
>
> Upload in progress :-)

Looking forward to it. ;)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-26 21:58                                   ` Gregory Farnum
@ 2013-02-27  0:03                                     ` Yan, Zheng 
  2013-02-27  0:14                                       ` Sage Weil
       [not found]                                     ` <20130227004923.GQ16091@kdecherf.com>
  1 sibling, 1 reply; 27+ messages in thread
From: Yan, Zheng  @ 2013-02-27  0:03 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Kevin Decherf, Sage Weil, Sam Lang, ceph-devel, support

On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
>> On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
>>> On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
>>> > We have one folder per application (php, java, ruby). Every application has
>>> > small (<1M) files. The folder is mounted by only one client by default.
>>> >
>>> > In case of overload, another clients spawn to mount the same folder and
>>> > access the same files.
>>> >
>>> > In the following test, only one client was used to serve the
>>> > application (a website using wordpress).
>>> >
>>> > I made the test with strace to see the time of each IO request (strace -T
>>> > -e trace=file) and I noticed the same pattern:
>>> >
>>> > ...
>>> > [pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
>>> > [pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
>>> > [pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
>>> > [pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
>>> > [pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
>>> > [pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
>>> > ...
>>> >
>>> > ~250 files were accessed for only one request (thanks Wordpress.).
>>>
>>> Okay, that is slower than I'd expect, even for an across-the-wire request...
>>>
>>> > The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.
>>>
>>> What kernel and why are you using nodcache?
>>
>> We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
>> secretfile as mount options) and I didn't find it in the documentation of
>> mount.ceph.
>>
>>> Did you have problems
>>> without that mount option? That's forcing an MDS access for most
>>> operations, rather than using local data.
>>
>> Good question, I will try it (-o dcache?).
>
> Oh right — I forgot Sage had enabled that by default; I don't recall
> how necessary it is. (Sage?)
>

That code is buggy, see ceph_dir_test_complete(), it always return false.

Yan, Zheng


>>> > I have a debug (debug_mds=20) log of the active mds during this test if you want.
>>>
>>> Yeah, can you post it somewhere?
>>
>> Upload in progress :-)
>
> Looking forward to it. ;)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-02-27  0:03                                     ` Yan, Zheng 
@ 2013-02-27  0:14                                       ` Sage Weil
  0 siblings, 0 replies; 27+ messages in thread
From: Sage Weil @ 2013-02-27  0:14 UTC (permalink / raw)
  To: Yan, Zheng ; +Cc: Gregory Farnum, Kevin Decherf, Sam Lang, ceph-devel, support

On Wed, 27 Feb 2013, Yan, Zheng  wrote:
> On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum <greg@inktank.com> wrote:
> > On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf <kevin@kdecherf.com> wrote:
> >> On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
> >>> On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf <kevin@kdecherf.com> wrote:
> >>> > We have one folder per application (php, java, ruby). Every application has
> >>> > small (<1M) files. The folder is mounted by only one client by default.
> >>> >
> >>> > In case of overload, another clients spawn to mount the same folder and
> >>> > access the same files.
> >>> >
> >>> > In the following test, only one client was used to serve the
> >>> > application (a website using wordpress).
> >>> >
> >>> > I made the test with strace to see the time of each IO request (strace -T
> >>> > -e trace=file) and I noticed the same pattern:
> >>> >
> >>> > ...
> >>> > [pid  4378] stat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.033409>
> >>> > [pid  4378] lstat("/data/wp-includes/user.php", {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 <0.081642>
> >>> > [pid  4378] open("/data/wp-includes/user.php", O_RDONLY) = 5 <0.041138>
> >>> > [pid  4378] stat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.082303>
> >>> > [pid  4378] lstat("/data/wp-includes/meta.php", {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 <0.004090>
> >>> > [pid  4378] open("/data/wp-includes/meta.php", O_RDONLY) = 5 <0.081929>
> >>> > ...
> >>> >
> >>> > ~250 files were accessed for only one request (thanks Wordpress.).
> >>>
> >>> Okay, that is slower than I'd expect, even for an across-the-wire request...
> >>>
> >>> > The fs is mounted with these options: rw,noatime,name=<hidden>,secret=<hidden>,nodcache.
> >>>
> >>> What kernel and why are you using nodcache?
> >>
> >> We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
> >> secretfile as mount options) and I didn't find it in the documentation of
> >> mount.ceph.
> >>
> >>> Did you have problems
> >>> without that mount option? That's forcing an MDS access for most
> >>> operations, rather than using local data.
> >>
> >> Good question, I will try it (-o dcache?).
> >
> > Oh right ? I forgot Sage had enabled that by default; I don't recall
> > how necessary it is. (Sage?)
> >
> 
> That code is buggy, see ceph_dir_test_complete(), it always return false.

FWIW, I think #4023 may be the root of those original bugs.  I think that 
should be up next (as far as fs/ceph goes) after after this i_mutex stuff 
is sorted out.

sage


> 
> Yan, Zheng
> 
> 
> >>> > I have a debug (debug_mds=20) log of the active mds during this test if you want.
> >>>
> >>> Yeah, can you post it somewhere?
> >>
> >> Upload in progress :-)
> >
> > Looking forward to it. ;)
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
       [not found]                                           ` <20130308232943.GA2197-fShu9kyPgSlWk0Htik3J/w@public.gmane.org>
@ 2013-03-15 20:32                                             ` Greg Farnum
       [not found]                                               ` <ECAA10260D284057A52D78127F8634A8-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Greg Farnum @ 2013-03-15 20:32 UTC (permalink / raw)
  To: Kevin Decherf; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
> On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
> > On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf <kevin@kdecherf.com (mailto:kevin@kdecherf.com)> wrote:
> > > You will find the archive here: <snip>
> > > The data is not anonymized. Interesting folders/files here are
> > > /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
> >  
> >  
> >  
> > Sorry for the delay, but I have retrieved this archive locally at
> > least so if you want to remove it from your webserver you can do so.
> > :) Also, I notice when I untar it that the file name includes
> > "filtered" — what filters did you run it through?
>  
>  
>  
> Hi Gregory,
>  
> Do you have any news about it?
>  

I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this.
Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces.

On the up side for Ceph, >80% of your requests take "0" milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :)
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
       [not found]                                               ` <ECAA10260D284057A52D78127F8634A8-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
@ 2013-03-15 22:40                                                 ` Marc-Antoine Perennou
  2013-03-15 22:53                                                   ` Greg Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Marc-Antoine Perennou @ 2013-03-15 22:40 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Le 15 mars 2013 à 21:32, Greg Farnum <greg@inktank.com> a écrit :

> On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
>> On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
>>> On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf <kevin@kdecherf.com (mailto:kevin@kdecherf.com)> wrote:
>>>> You will find the archive here: <snip>
>>>> The data is not anonymized. Interesting folders/files here are
>>>> /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
>>>
>>>
>>>
>>> Sorry for the delay, but I have retrieved this archive locally at
>>> least so if you want to remove it from your webserver you can do so.
>>> :) Also, I notice when I untar it that the file name includes
>>> "filtered" — what filters did you run it through?
>>
>>
>>
>> Hi Gregory,
>>
>> Do you have any news about it?
>
> I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this.
> Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces.
>
> On the up side for Ceph, >80% of your requests take "0" milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :)
> -Greg
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thank you a lot for these explanations, looking forward for these fixes!
Do you have some public bug reports regarding this to link us?

Good luck, thank you for your great job and have a nice weekend

Marc-Antoine Perennou
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Crash and strange things on MDS
  2013-03-15 22:40                                                 ` Marc-Antoine Perennou
@ 2013-03-15 22:53                                                   ` Greg Farnum
  0 siblings, 0 replies; 27+ messages in thread
From: Greg Farnum @ 2013-03-15 22:53 UTC (permalink / raw)
  To: Marc-Antoine Perennou; +Cc: ceph-devel@vger.kernel.org, ceph-users

On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote:
> Thank you a lot for these explanations, looking forward for these fixes!
> Do you have some public bug reports regarding this to link us?
> 
> Good luck, thank you for your great job and have a nice weekend
> 
> Marc-Antoine Perennou 
Well, for now the fixes are for stuff like "make analysis take less time, and export timing information more easily". The most immediately applicable one is probably http://tracker.ceph.com/issues/4354, which I hope to start on next week and should be done by the end of the sprint.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-03-15 22:53 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-04 18:01 Crash and strange things on MDS Kevin Decherf
2013-02-11 13:05 ` Kevin Decherf
2013-02-11 17:00   ` Sam Lang
2013-02-11 18:54     ` Kevin Decherf
2013-02-11 20:25       ` Gregory Farnum
2013-02-11 22:24         ` Kevin Decherf
2013-02-11 22:47           ` Gregory Farnum
2013-02-11 23:33             ` Kevin Decherf
2013-02-13 11:47         ` Kevin Decherf
2013-02-13 18:19           ` Gregory Farnum
2013-02-16  1:02             ` Kevin Decherf
2013-02-16 17:36               ` Sam Lang
2013-02-16 18:24                 ` Kevin Decherf
2013-02-19 18:15                   ` Gregory Farnum
2013-02-20  1:00                     ` Kevin Decherf
2013-02-20  1:09                       ` Gregory Farnum
2013-02-26 17:57                         ` Kevin Decherf
2013-02-26 18:10                           ` Gregory Farnum
2013-02-26 19:58                             ` Kevin Decherf
2013-02-26 20:26                               ` Gregory Farnum
2013-02-26 21:57                                 ` Kevin Decherf
2013-02-26 21:58                                   ` Gregory Farnum
2013-02-27  0:03                                     ` Yan, Zheng 
2013-02-27  0:14                                       ` Sage Weil
     [not found]                                     ` <20130227004923.GQ16091@kdecherf.com>
     [not found]                                       ` <CAPYLRzhbygkA9=DkVr474Nw8AOC2hAFG-1D6uS4WyfR=kUBXWQ@mail.gmail.com>
     [not found]                                         ` <20130308232943.GA2197@kdecherf.com>
     [not found]                                           ` <20130308232943.GA2197-fShu9kyPgSlWk0Htik3J/w@public.gmane.org>
2013-03-15 20:32                                             ` Greg Farnum
     [not found]                                               ` <ECAA10260D284057A52D78127F8634A8-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
2013-03-15 22:40                                                 ` Marc-Antoine Perennou
2013-03-15 22:53                                                   ` Greg Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.