Slow request problem

* Slow request problem - help please.
@ 2014-04-16 14:52 Ilya Storozhilov
  2014-04-16 16:05 ` Ian Colle
  0 siblings, 1 reply; 2+ messages in thread
From: Ilya Storozhilov @ 2014-04-16 14:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: Andrey Kuznetsov

Hello Ceph developers,

we encountered a problem with 3 OSDs/3 MONs Ceph cluster (CentOS 6.5. 64 bit, Ceph "Emperor", v0.72.2 from RPM), which causes errors on RGW-node. A respective RGW output is:

2014-04-15 09:50:03.906527 7fcd4c5e6700  1 ====== starting new request req=0x1811b20 =====
2014-04-15 09:50:03.908509 7fcd64dfa700  1 Executing get_obj operation
2014-04-15 09:50:03.909138 7fcd64dfa700  0 ERROR: s->cio->print() returned err=-1
2014-04-15 09:50:03.909164 7fcd64dfa700  0 ERROR: s->cio->print() returned err=-1
2014-04-15 09:50:03.909168 7fcd64dfa700  0 ERROR: s->cio->print() returned err=-1
2014-04-15 09:50:03.909171 7fcd64dfa700  0 ERROR: s->cio->print() returned err=-1
2014-04-15 09:50:03.909401 7fcd64dfa700  1 ====== req done req=0x16f4c10 http_status=403 ======
2014-04-15 09:50:03.909444 7fcd64dfa700  1 ====== starting new request req=0x1816610 =====
2014-04-15 09:50:07.444606 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd66bfd700' had timed out after 600
2014-04-15 09:50:07.444628 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd245a6700' had timed out after 600
2014-04-15 09:50:07.444634 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd263a9700' had timed out after 600
2014-04-15 09:50:07.444639 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd28bad700' had timed out after 600
2014-04-15 09:50:07.444642 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd2a9b0700' had timed out after 600
2014-04-15 09:50:07.444649 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd349c0700' had timed out after 600
2014-04-15 09:50:07.444652 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd367c3700' had timed out after 600
2014-04-15 09:50:07.444668 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd385c6700' had timed out after 600
2014-04-15 09:50:07.444673 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd3adca700' had timed out after 600
2014-04-15 09:50:07.444676 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd3b7cb700' had timed out after 600
2014-04-15 09:50:07.444681 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd3e9d0700' had timed out after 600
2014-04-15 09:50:07.444685 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd3fdd2700' had timed out after 600
2014-04-15 09:50:07.444689 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd411d4700' had timed out after 600
2014-04-15 09:50:07.444694 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd475de700' had timed out after 600
2014-04-15 09:50:07.444699 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd4a7e3700' had timed out after 600
2014-04-15 09:50:07.444703 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd4bbe5700' had timed out after 600
2014-04-15 09:50:07.444708 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd4cfe7700' had timed out after 600
2014-04-15 09:50:07.444713 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd533f1700' had timed out after 600
2014-04-15 09:50:07.444717 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd547f3700' had timed out after 600
2014-04-15 09:50:07.444728 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd55bf5700' had timed out after 600
2014-04-15 09:50:07.444734 7fce4f6f8700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7fcd579f8700' had timed out after 600
2014-04-15 09:50:10.155239 7fcd2a9b0700  1 ====== req done req=0x15831f0 http_status=200 ======

An output of the “ceph -s” command is:

$ sudo ceph -s
    cluster 5c6c95b8-8234-4b34-96a9-facc3c561b72
     health HEALTH_WARN 88 requests are blocked > 32 sec
     monmap e2: 3 mons at {ceph-81=10.44.1.81:6789/0,ceph-82=10.44.1.82:6789/0,ceph-83=10.44.1.83:6789/0}, election epoch 22, quorum 0,1,2 ceph-81,ceph-82,ceph-83
     osdmap e519: 3 osds: 3 up, 3 in
      pgmap v407538: 296 pgs, 32 pools, 64060 kB data, 280 objects
            17388 MB used, 1058 GB / 1133 GB avail
                 296 active+clean
  client io 2148 B/s rd, 102 B/s wr, 3 op/s

Ceph complains with following messages in OSD log-file:

2014-04-16 02:21:18.065305 osd.0 [WRN] slow request 30.613493 seconds old, received at 2014-04-16 02:20:47.451757: osd_op(client.5793.0:1270 default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write 1572864~524288] 172.d7dc6717 e608) v4 currently commit sent
2014-04-16 02:21:18.065309 osd.0 [WRN] slow request 30.611804 seconds old, received at 2014-04-16 02:20:47.453446: osd_op(client.5793.0:1272 2014-04-16-02-default.5793.1-s3testqwer2 [append 0~276] 104.2973847e e608) v4 currently commit sent
2014-04-16 02:21:18.065312 osd.0 [WRN] slow request 30.610639 seconds old, received at 2014-04-16 02:20:47.454611: osd_op(client.5793.0:1273 default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write 2097152~524288] 172.d7dc6717 e608) v4 currently commit sent
2014-04-16 02:21:18.065316 osd.0 [WRN] slow request 30.608710 seconds old, received at 2014-04-16 02:20:47.456540: osd_op(client.5793.0:1274 default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write 2621440~524288] 172.d7dc6717 e608) v4 currently commit sent
2014-04-16 02:21:18.065319 osd.0 [WRN] slow request 30.605942 seconds old, received at 2014-04-16 02:20:47.459308: osd_op(client.5793.0:1275 default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write 3145728~329984] 172.d7dc6717 e608) v4 currently commit sent

After OSD log-verbosity increase we got the following messages from OSD:

2014-04-16 03:23:32.337961 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45949 s=2 pgs=15930 cs=2649 l=0 c=0x565a680).fault, initiating reconnect
2014-04-16 03:23:32.338939 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45950 s=2 pgs=15931 cs=2651 l=0 c=0x565a680).fault, initiating reconnect
2014-04-16 03:23:32.339956 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45951 s=2 pgs=15932 cs=2653 l=0 c=0x565a680).fault, initiating reconnect
2014-04-16 03:23:32.341167 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45952 s=2 pgs=15933 cs=2655 l=0 c=0x565a680).fault, initiating reconnect
2014-04-16 03:23:32.342212 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45953 s=2 pgs=15934 cs=2657 l=0 c=0x565a680).fault, initiating reconnect
2014-04-16 03:23:32.343330 7ff93987a700  0 -- 10.44.1.82:6801/2050730 >> 10.44.1.81:6805/1055074 pipe(0x72c0780 sd=54 :45954 s=2 pgs=15935 cs=2659 l=0 c=0x565a680).fault, initiating reconnect
[14:26:33] Ilya Storozhilov: 2014-04-16 03:25:55.170361 7ff955f87700  0 log [WRN] : 15 slow requests, 2 included below; oldest blocked for > 33.673854 secs
2014-04-16 03:25:55.170373 7ff955f87700  0 log [WRN] : slow request 33.636609 seconds old, received at 2014-04-16 03:25:21.533670: osd_op(client.5793.0:5387 default.5793.2__shadow__MMYl0ajXAfpaBMg55xl6ZVlD7U1lstn_3 [writefull 0~87104] 172.12238d13 e617) v4 currently waiting for subops from [2]
2014-04-16 03:25:55.170380 7ff955f87700  0 log [WRN] : slow request 33.122809 seconds old, received at 2014-04-16 03:25:22.047470: osd_op(client.5793.0:5394 default.5793.2_myobjects_14 [cmpxattr user.rgw.idtag (17) op 1 mode 1,create 0~0,delete,setxattr user.rgw.idtag (17),writefull 0~524288,setxattr user.rgw.manifest (818),setxattr user.rgw.acl (185),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33)] 172.3a483fca e617) v4 currently waiting for subops from [2]

This page http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ contains the following information about this problem:

Possible causes include:

    A bad drive (check dmesg output)
    A bug in the kernel file system bug (check dmesg output)
    An overloaded cluster (check system load, iostat, etc.)
    A bug in the ceph-osd daemon.

Possible solutions:

    Remove VMs Cloud Solutions from Ceph Hosts
    Upgrade Kernel
    Upgrade Ceph
    Restart OSDs

We checked our environment for bad drive, kernel bug and system load and found that everything works fine but OSD consumes 100% of CPU when this error appears. OSD restart did not help.

Can you help us in some way, e.g. give an advise where to dig or whatever? Thank you very much!

Best Regards, 

Ilya Storozhilov

Lead Software Engineer

EPAM Systems 

Tver office, Russia 

GMT+4 

EPAM Internal ext.:  55529

Office phone:        +7 (4822) 630-070 ext. 55529

Office fax:          +7 (4822) 630-073

Mobile phone:        +7 (904) 021-0763

E-mail:              Ilya_Storozhilov@epam.com 

http://www.epam.com 

CONFIDENTIALITY CAUTION AND DISCLAIMER 

This message is intended only for the use of the individual(s) or entity(ies) to which it is addressed and contains information that is legally privileged and confidential. If you are not the intended recipient, or the person responsible for delivering the
 message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. All unintended recipients are obliged to delete this message and destroy any printed copies.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread