rbd map command hangs for 15 minutes during system start up

* rbd map command hangs for 15 minutes during system start up
@ 2012-11-08 22:10 Mandell Degerness
  2012-11-09  1:43 ` Josh Durgin
  0 siblings, 1 reply; 56+ messages in thread
From: Mandell Degerness @ 2012-11-08 22:10 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2084 bytes --]

We are seeing a somewhat random, but frequent hang on our systems
during startup.  The hang happens at the point where an "rbd map
<rbdvol>" command is run.

I've attached the ceph logs from the cluster.  The map command happens
at Nov  8 18:41:09 on server 172.18.0.15.  The process which hung can
be seen in the log as 172.18.0.15:0/1143980479.

It appears as if the TCP socket is opened to the OSD, but then times
out 15 minutes later, the process gets data when the socket is closed
on the client server and it retries.

Please help.

We are using ceph version 0.48.2argonaut
(commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe).

We are using a 3.5.7 kernel with the following list of patches applied:

1-libceph-encapsulate-out-message-data-setup.patch
2-libceph-dont-mark-footer-complete-before-it-is.patch
3-libceph-move-init-of-bio_iter.patch
4-libceph-dont-use-bio_iter-as-a-flag.patch
5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
8-libceph-protect-ceph_con_open-with-mutex.patch
9-libceph-reset-connection-retry-on-successfully-negotiation.patch
10-rbd-only-reset-capacity-when-pointing-to-head.patch
11-rbd-set-image-size-when-header-is-updated.patch
12-libceph-fix-crypto-key-null-deref-memory-leak.patch
13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
17-libceph-check-for-invalid-mapping.patch
18-ceph-propagate-layout-error-on-osd-request-creation.patch
19-rbd-BUG-on-invalid-layout.patch
20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
21-ceph-avoid-32-bit-page-index-overflow.patch
23-ceph-fix-dentry-reference-leak-in-encode_fh.patch

Any suggestions?

One thought is that the following patch (which we could not apply) is
what is required:

22-rbd-reset-BACKOFF-if-unable-to-re-queue.patch

Regards,
Mandell Degerness

[-- Attachment #2: hanglog_ceph.log.gz --]
[-- Type: application/x-gzip, Size: 21632 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread