All of lore.kernel.org
 help / color / mirror / Atom feed
* Permanent MDS restarting under load
@ 2015-11-10 14:32 Oleksandr Natalenko
  2015-11-10 20:38 ` [ceph-users] " Gregory Farnum
  0 siblings, 1 reply; 3+ messages in thread
From: Oleksandr Natalenko @ 2015-11-10 14:32 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hello.

We have CephFS deployed over Ceph cluster (0.94.5).

We experience constant MDS restarting under high IOPS workload (e.g. 
rsyncing lots of small mailboxes from another storage to CephFS using 
ceph-fuse client). First, cluster health goes to HEALTH_WARN state with 
the following disclaimer:

===
mds0: Behind on trimming (321/30)
===

Also, slow requests start to appear:

===
2 requests are blocked > 32 sec
===

Then, after a while, one of MDSes fails with the following log:

===
лис 10 16:07:41 baikal bash[10122]: 2015-11-10 16:07:41.915540 
7f2484f13700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 16:07:41 baikal bash[10122]: starting mds.baikal at :/0
лис 10 16:07:42 baikal bash[10122]: 2015-11-10 16:07:42.003189 
7f82b477e7c0 -1 mds.-1.0 log_to_monitors {default=true}
===

I guess writing lots of small files bloats MDS log, and MDS doesn't 
catch trimming in time. That's why it is marked as failed and is 
replaced by standby MDS. We tried to limit mds_log_max_events to 30 
events, but that caused MDS to fail very quickly with the following 
stacktrace:

===
Stacktrace: https://gist.github.com/4c8a89682e81b0049f3e
===

Is that normal situation, or one could rate-limit client requests? May 
be there should be additional knobs to tune CephFS for handling such a 
workload?

Cluster info goes below.

CentOS 7.1, Ceph 0.94.5.

Cluster maps:

===
      osdmap e5894: 20 osds: 20 up, 20 in
       pgmap v8959901: 1024 pgs, 12 pools, 5156 GB data, 23074 kobjects
             20101 GB used, 30468 GB / 50570 GB avail
                 1024 active+clean
===

CephFS list:

===
name: myfs, metadata pool: mds_meta_storage, data pools: 
[mds_xattrs_storage fs_samba fs_pbx fs_misc fs_web fs_mail fs_ott ]
===

Both MDS data and metadata pools are located on PCI-E SSDs:

===
  -9  0.44800 root pcie-ssd
  -7  0.22400     host data-pcie-ssd
   7  0.22400         osd.7                                 up  1.00000   
        1.00000
  -8  0.22400     host baikal-pcie-ssd
   6  0.22400         osd.6                                 up  1.00000   
        1.00000

pool 20 'mds_meta_storage' replicated size 2 min_size 1 crush_ruleset 2 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 4333 flags 
hashpspool stripe_width 0
pool 21 'mds_xattrs_storage' replicated size 2 min_size 1 crush_ruleset 
2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4337 flags 
hashpspool crash_replay_interval 45 stripe_width 0

     mds_meta_storage       20     37422k         0          169G       
234714
     mds_xattrs_storage     21          0         0          169G     
11271588

rule pcie-ssd {
         ruleset 2
         type replicated
         min_size 1
         max_size 2
         step take pcie-ssd
         step chooseleaf firstn 0 type host
         step emit
}
===

There is 1 active MDS as well as 1 stand-by MDS:

===
mdsmap e9035: 1/1/1 up {0=data=up:active}, 1 up:standby
===

Also we have 10 OSDs on HDDs for additional data pools:

===
  -6 37.00000 root sata-hdd-misc
  -4 18.50000     host data-sata-hdd-misc
   1  3.70000         osd.1                                 up  1.00000   
        1.00000
   3  3.70000         osd.3                                 up  1.00000   
        1.00000
   4  3.70000         osd.4                                 up  1.00000   
        1.00000
   5  3.70000         osd.5                                 up  1.00000   
        1.00000
  10  3.70000         osd.10                                up  1.00000   
        1.00000
  -5 18.50000     host baikal-sata-hdd-misc
   0  3.70000         osd.0                                 up  1.00000   
        1.00000
  11  3.70000         osd.11                                up  1.00000   
        1.00000
  12  3.70000         osd.12                                up  1.00000   
        1.00000
  13  3.70000         osd.13                                up  1.00000   
        1.00000
  14  3.70000         osd.14                                up  1.00000   
        1.00000

     fs_samba               22      2162G      4.28         3814G      
1168619
     fs_pbx                 23      1551G      3.07         3814G      
3908813
     fs_misc                24       436G      0.86         3814G       
112114
     fs_web                 25     58642M      0.11         3814G       
378946
     fs_mail                26       442G      0.88         3814G      
6414073
     fs_ott                 27          0         0         3814G         
    0

rule sata-hdd-misc {
         ruleset 4
         type replicated
         min_size 1
         max_size 4
         step take sata-hdd-misc
         step choose firstn 2 type host
         step chooseleaf firstn 2 type osd
         step emit
}
===

CephFS folders pool affinity is done via setfattr. For example:

===
# file: mail
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=fs_mail"
===
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ceph-users] Permanent MDS restarting under load
  2015-11-10 14:32 Permanent MDS restarting under load Oleksandr Natalenko
@ 2015-11-10 20:38 ` Gregory Farnum
  2015-11-11  7:38   ` Oleksandr Natalenko
  0 siblings, 1 reply; 3+ messages in thread
From: Gregory Farnum @ 2015-11-10 20:38 UTC (permalink / raw)
  To: Oleksandr Natalenko; +Cc: Ceph Users, ceph-devel

On Tue, Nov 10, 2015 at 6:32 AM, Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
> Hello.
>
> We have CephFS deployed over Ceph cluster (0.94.5).
>
> We experience constant MDS restarting under high IOPS workload (e.g.
> rsyncing lots of small mailboxes from another storage to CephFS using
> ceph-fuse client). First, cluster health goes to HEALTH_WARN state with the
> following disclaimer:
>
> ===
> mds0: Behind on trimming (321/30)
> ===
>
> Also, slow requests start to appear:
>
> ===
> 2 requests are blocked > 32 sec
> ===

Which requests are they? Are these MDS operations or OSD ones?

>
> Then, after a while, one of MDSes fails with the following log:
>
> ===
> лис 10 16:07:41 baikal bash[10122]: 2015-11-10 16:07:41.915540 7f2484f13700
> -1 MDSIOContextBase: blacklisted!  Restarting...
> лис 10 16:07:41 baikal bash[10122]: starting mds.baikal at :/0
> лис 10 16:07:42 baikal bash[10122]: 2015-11-10 16:07:42.003189 7f82b477e7c0
> -1 mds.-1.0 log_to_monitors {default=true}
> ===

So that "blacklisted" means that the monitors decided the MDS was
nonresponsive, failed over to another daemon, and blocked this one off
from the cluster.

> I guess writing lots of small files bloats MDS log, and MDS doesn't catch
> trimming in time. That's why it is marked as failed and is replaced by
> standby MDS. We tried to limit mds_log_max_events to 30 events, but that
> caused MDS to fail very quickly with the following stacktrace:
>
> ===
> Stacktrace: https://gist.github.com/4c8a89682e81b0049f3e
> ===
>
> Is that normal situation, or one could rate-limit client requests? May be
> there should be additional knobs to tune CephFS for handling such a
> workload?

Yeah, the MDS doesn't really do a good job back-pressuring clients
right now when it or the OSDs aren't keeping up with the workload.
That's something we need to work on once fsck stuff is behaving. rsync
is also (sadly) a workload that frequently exposes these problems, but
I'm not used to seeing the MDS daemon get stuck quite that quickly.
How frequently is it actually getting swapped?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ceph-users] Permanent MDS restarting under load
  2015-11-10 20:38 ` [ceph-users] " Gregory Farnum
@ 2015-11-11  7:38   ` Oleksandr Natalenko
  0 siblings, 0 replies; 3+ messages in thread
From: Oleksandr Natalenko @ 2015-11-11  7:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Users, ceph-devel


10.11.2015 22:38, Gregory Farnum wrote:

> Which requests are they? Are these MDS operations or OSD ones?

Those requests appeared in ceph -w output and are the follows:

https://gist.github.com/5045336f6fb7d532138f

Is that correct that there are OSD operations blocked? osd.3 is one of 
data pool HDDs, and other OSDs also appear in slow requests warning 
besides osd.3 as well.

I guess that may be related to replica 4 setup of our cluster and only 5 
OSDs for each host. But we plan to add 6 more OSDs to each host after 
data migration is finished. Could that help in spreading load?

> So that "blacklisted" means that the monitors decided the MDS was
> nonresponsive, failed over to another daemon, and blocked this one off
> from the cluster.

So, one could adjust blacklist timeout, but there is no way to 
rate-limit requests? Am I correct?

> Yeah, the MDS doesn't really do a good job back-pressuring clients
> right now when it or the OSDs aren't keeping up with the workload.
> That's something we need to work on once fsck stuff is behaving. rsync
> is also (sadly) a workload that frequently exposes these problems, but
> I'm not used to seeing the MDS daemon get stuck quite that quickly.
> How frequently is it actually getting swapped?

Quite often. MDSes are swapped once per 1 minute or so under heavy load:

===
лис 10 10:40:47 data.la.net.ua bash[18112]: 2015-11-10 10:40:47.357633 
7f76c42e2700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:41:49 data.la.net.ua bash[18112]: 2015-11-10 10:41:49.237962 
7f1a939af700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:43:14 data.la.net.ua bash[18112]: 2015-11-10 10:43:14.899375 
7f17f6eaa700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:44:11 data.la.net.ua bash[18112]: 2015-11-10 10:44:11.810116 
7f693b64c700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:45:14 data.la.net.ua bash[18112]: 2015-11-10 10:45:14.761684 
7f7616097700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:46:35 data.la.net.ua bash[18112]: 2015-11-10 10:46:35.927190 
7fdfb7f62700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:47:41 data.la.net.ua bash[18112]: 2015-11-10 10:47:41.888064 
7fb88139b700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:49:57 data.la.net.ua bash[18112]: 2015-11-10 10:49:57.542545 
7fbb360eb700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:51:02 data.la.net.ua bash[18112]: 2015-11-10 10:51:02.486907 
7fb488fa1700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:52:03 data.la.net.ua bash[18112]: 2015-11-10 10:52:03.871463 
7f4cc0236700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:53:20 data.la.net.ua bash[18112]: 2015-11-10 10:53:20.290494 
7f9dc48d3700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:54:17 data.la.net.ua bash[18112]: 2015-11-10 10:54:17.086940 
7f45a9105700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:55:17 data.la.net.ua bash[18112]: 2015-11-10 10:55:17.547123 
7f6c48f50700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:56:32 data.la.net.ua bash[18112]: 2015-11-10 10:56:32.558378 
7f2bf0a70700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:57:34 data.la.net.ua bash[18112]: 2015-11-10 10:57:34.534306 
7fc69b42c700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:58:37 data.la.net.ua bash[18112]: 2015-11-10 10:58:37.061903 
7fea3de23700 -1 MDSIOContextBase: blacklisted!  Restarting...
лис 10 10:59:52 data.la.net.ua bash[18112]: 2015-11-10 10:59:52.579594 
7fe23b468700 -1 MDSIOContextBase: blacklisted!  Restarting...
===

Any idea?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-11  7:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10 14:32 Permanent MDS restarting under load Oleksandr Natalenko
2015-11-10 20:38 ` [ceph-users] " Gregory Farnum
2015-11-11  7:38   ` Oleksandr Natalenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.