All of lore.kernel.org
 help / color / mirror / Atom feed
From: Haomai Wang <haomai@xsky.com>
To: Avner Ben Hanoch <avnerb@mellanox.com>
Cc: Marov Aleksey <Marov.A@raidix.com>, Sage Weil <sweil@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: ceph issue
Date: Fri, 2 Dec 2016 11:12:05 +0800	[thread overview]
Message-ID: <CACJqLyaw8Oag6XN-HSyrRPonMhMxqpbXuQei6cFE0LdBaNxUqw@mail.gmail.com> (raw)
In-Reply-To: <DB3PR05MB079316AC35562CC422E44DC0A9B70@DB3PR05MB0793.eurprd05.prod.outlook.com>

On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch <avnerb@mellanox.com> wrote:
>
> I guess that like the rest of ceph, the new rdma code must also support multiple applications in parallel.
>
> I am also reproducing your error => 2 instances of fio can't run in parallel with ceph rdma.
>
> * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 sec")
>
> * and with all osds printing messages like " heartbeat_check: no reply from ..."
>
> * And with log files contains errors:
>   $ grep error ceph-osd.0.log
>   2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
>   2016-11-23 09:20:54.090388 7f9b43951700  1 -- 36.0.0.2:6802/10634 >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk reading from fd=139 : Unknown error -104
>   2016-11-23 09:20:58.411912 7f9b44953700  1 RDMAStack polling work request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR
>   2016-11-23 09:20:58.411934 7f9b44953700  1 RDMAStack polling work request returned error for buffer(0x7f9b553d20d0) status(12:RETRY_EXC_ERR

error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter
Exceeded: The local transport timeout retry counter was exceeded while
trying to send this message. This means that the remote side didn't
send any Ack or Nack. If this happens when sending the first message,
usually this mean that the connection attributes are wrong or the
remote side isn't in a state that it can respond to messages. If this
happens after sending the first message, usually it means that the
remote QP isn't available anymore. Relevant for RC QPs."

we set qp retry_cnt to 7 and timeout is 14

  // How long to wait before retrying if packet lost or server dead.
  // Supposedly the timeout is 4.096us*2^timeout.  However, the actual
  // timeout appears to be 4.096us*2^(timeout+1), so the setting
  // below creates a 135ms timeout.
  qpa.timeout = 14;

  // How many times to retry after timeouts before giving up.
  qpa.retry_cnt = 7;

is this means the receiver side lack of memory or not polling work request ASAP?

>
>
>
> Command lines that I used:
>   ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1
>   ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1
>
> > -----Original Message-----
> > From: Marov Aleksey
> > Sent: Tuesday, November 22, 2016 17:59
> >
> > I didn't try this blocksize. But in my case fio crushed if I use more than one
> > job. With one job everything works fine. Is it worth more deep investigating?

  reply	other threads:[~2016-12-02  3:12 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <FEC85B105C5F644CA51BDB90657EEA9828425083@ddsm-mbx01.digdes.com>
2016-11-17 14:49 ` ceph issue Sage Weil
2016-11-18  7:19   ` Haomai Wang
2016-11-18  9:23     ` HA: " Marov Aleksey
2016-11-18 11:26       ` Haomai Wang
2016-11-20 13:21         ` Avner Ben Hanoch
2016-11-20 14:29         ` Avner Ben Hanoch
2016-11-21 10:40           ` Haomai Wang
2016-11-21 16:20             ` HA: " Marov Aleksey
2016-11-22 14:41               ` Avner Ben Hanoch
2016-11-22 15:59                 ` HA: " Marov Aleksey
2016-11-23  9:30                   ` Avner Ben Hanoch
2016-12-02  3:12                     ` Haomai Wang [this message]
2016-12-05  9:37                       ` Avner Ben Hanoch
2016-12-06 15:36                         ` HA: " Marov Aleksey
2016-12-06 17:15                           ` Haomai Wang
2016-12-07  8:57                             ` HA: " Marov Aleksey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACJqLyaw8Oag6XN-HSyrRPonMhMxqpbXuQei6cFE0LdBaNxUqw@mail.gmail.com \
    --to=haomai@xsky.com \
    --cc=Marov.A@raidix.com \
    --cc=avnerb@mellanox.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.