All of lore.kernel.org
 help / color / mirror / Atom feed
* I/O Errors due to keepalive timeouts with NVMf RDMA
@ 2017-07-07  9:48 Johannes Thumshirn
  2017-07-08 18:14 ` Max Gurtovoy
  2017-07-10  7:06 ` Sagi Grimberg
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-07  9:48 UTC (permalink / raw)


Hi,

In my recent tests I'm facing I/O errors with nvme_rdma because of the
keepalive timer expiring.

This is easily reproducible on hfi1, but also on mlx4 with the follwing fio
job:

[global]
direct=1
rw=randrw
ioengine=libaio 
size=16g 
norandommap 
time_based
runtime=10m 
group_reporting 
bs=4k 
iodepth=128
numjobs=88

[NVMf-test]
filename=/dev/nvme0n1 


This happens with libaio as well as psync as I/O engine (haven't checked
others yet).

here's the dmesg excerpt:
nvme nvme0: failed nvme_keep_alive_end_io error=-5
nvme nvme0: Reconnecting in 10 seconds...
blk_update_request: 31 callbacks suppressed
blk_update_request: I/O error, dev nvme0n1, sector 73391680
blk_update_request: I/O error, dev nvme0n1, sector 52827640
blk_update_request: I/O error, dev nvme0n1, sector 125050288
blk_update_request: I/O error, dev nvme0n1, sector 32099608
blk_update_request: I/O error, dev nvme0n1, sector 65805440
blk_update_request: I/O error, dev nvme0n1, sector 120114368
blk_update_request: I/O error, dev nvme0n1, sector 48812368
nvme0n1: detected capacity change from 68719476736 to -67549595420313600
blk_update_request: I/O error, dev nvme0n1, sector 0
buffer_io_error: 23 callbacks suppressed
Buffer I/O error on dev nvme0n1, logical block 0, async page read
blk_update_request: I/O error, dev nvme0n1, sector 0
Buffer I/O error on dev nvme0n1, logical block 0, async page read
blk_update_request: I/O error, dev nvme0n1, sector 0
Buffer I/O error on dev nvme0n1, logical block 0, async page read
ldm_validate_partition_table(): Disk read failed.
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 3, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
nvme0n1: unable to read partition table

I'm seeing this on stock v4.12 as well as on our backports.

My current hypothesis is that I saturate the RDMA link so the keepalives have
no chance to get to the target. Is there a way to priorize the admin queue
somehow?

Thanks,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-07  9:48 I/O Errors due to keepalive timeouts with NVMf RDMA Johannes Thumshirn
@ 2017-07-08 18:14 ` Max Gurtovoy
  2017-07-10  7:59   ` Johannes Thumshirn
  2017-07-10  7:06 ` Sagi Grimberg
  1 sibling, 1 reply; 23+ messages in thread
From: Max Gurtovoy @ 2017-07-08 18:14 UTC (permalink / raw)




On 7/7/2017 12:48 PM, Johannes Thumshirn wrote:
> Hi,

Hi Johannes,

>
> In my recent tests I'm facing I/O errors with nvme_rdma because of the
> keepalive timer expiring.
>
> This is easily reproducible on hfi1, but also on mlx4 with the follwing fio
> job:

I need more info to repro.
What is the backing store at the target ?
are you using RoCE or IB link layer ?
ConnectX-3 vs. ConnectX-3 B2B ?
what is the FW on both target and host ?
what is the KATO ?
can you increase it as a WA ?

>
> [global]
> direct=1
> rw=randrw
> ioengine=libaio
> size=16g
> norandommap
> time_based
> runtime=10m
> group_reporting
> bs=4k
> iodepth=128
> numjobs=88
>
> [NVMf-test]
> filename=/dev/nvme0n1
>
>
> This happens with libaio as well as psync as I/O engine (haven't checked
> others yet).
>
> here's the dmesg excerpt:
> nvme nvme0: failed nvme_keep_alive_end_io error=-5
> nvme nvme0: Reconnecting in 10 seconds...
> blk_update_request: 31 callbacks suppressed
> blk_update_request: I/O error, dev nvme0n1, sector 73391680
> blk_update_request: I/O error, dev nvme0n1, sector 52827640
> blk_update_request: I/O error, dev nvme0n1, sector 125050288
> blk_update_request: I/O error, dev nvme0n1, sector 32099608
> blk_update_request: I/O error, dev nvme0n1, sector 65805440
> blk_update_request: I/O error, dev nvme0n1, sector 120114368
> blk_update_request: I/O error, dev nvme0n1, sector 48812368
> nvme0n1: detected capacity change from 68719476736 to -67549595420313600
> blk_update_request: I/O error, dev nvme0n1, sector 0
> buffer_io_error: 23 callbacks suppressed
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> blk_update_request: I/O error, dev nvme0n1, sector 0
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> blk_update_request: I/O error, dev nvme0n1, sector 0
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> ldm_validate_partition_table(): Disk read failed.
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 3, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> nvme0n1: unable to read partition table
>
> I'm seeing this on stock v4.12 as well as on our backports.
>
> My current hypothesis is that I saturate the RDMA link so the keepalives have
> no chance to get to the target. Is there a way to priorize the admin queue
> somehow?
>
> Thanks,
> 	Johannes
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-07  9:48 I/O Errors due to keepalive timeouts with NVMf RDMA Johannes Thumshirn
  2017-07-08 18:14 ` Max Gurtovoy
@ 2017-07-10  7:06 ` Sagi Grimberg
  2017-07-10  7:17   ` Hannes Reinecke
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-10  7:06 UTC (permalink / raw)


Hey Johannes,

> I'm seeing this on stock v4.12 as well as on our backports.
> 
> My current hypothesis is that I saturate the RDMA link so the keepalives have
> no chance to get to the target.

Your observation seems correct to me, because we have no
way to guarantee that a keep-alive capsule will be prioritized higher
than normal I/O in the fabric layer (as you said, the link might be
saturated).

> Is there a way to priorize the admin queue somehow?

Not really (at least for rdma). We made kato configurable,
perhaps we should give a higher default to not see it even
in extreme workloads?

Couple of questions:
- Are you using RoCE (v2 or v1)? or Infiniband?
- Does it happen with mlx5 as well?
- Are host/target connected via switch/router? if so is flow-control
   on? and what are the host/target port speeds?
- Can you try and turn debug logging to know what delays (keep-alive
   from host to target or the keep-alive response)?
- What kato is required to not stumble on this?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10  7:06 ` Sagi Grimberg
@ 2017-07-10  7:17   ` Hannes Reinecke
  2017-07-10  8:46     ` Max Gurtovoy
  2017-07-10  8:59     ` Jack Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Hannes Reinecke @ 2017-07-10  7:17 UTC (permalink / raw)


On 07/10/2017 09:06 AM, Sagi Grimberg wrote:
> Hey Johannes,
> 
>> I'm seeing this on stock v4.12 as well as on our backports.
>>
>> My current hypothesis is that I saturate the RDMA link so the
>> keepalives have
>> no chance to get to the target.
> 
> Your observation seems correct to me, because we have no
> way to guarantee that a keep-alive capsule will be prioritized higher
> than normal I/O in the fabric layer (as you said, the link might be
> saturated).
> 
>> Is there a way to priorize the admin queue somehow?
> 
> Not really (at least for rdma). We made kato configurable,
> perhaps we should give a higher default to not see it even
> in extreme workloads?
> 
> Couple of questions:
> - Are you using RoCE (v2 or v1)? or Infiniband?
> - Does it happen with mlx5 as well?
> - Are host/target connected via switch/router? if so is flow-control
>   on? and what are the host/target port speeds?
> - Can you try and turn debug logging to know what delays (keep-alive
>   from host to target or the keep-alive response)?
> - What kato is required to not stumble on this?
> 
Well, this sounds identically to the path_checker problem we're having
in multipathing (and hch complained about several times).
There's a rather easy solution to it: don't send keepalives if I/O is
running, but rather tack it on the most current I/O packet.
In the end, you only want to know if the link is alive; you don't have
to transfer any data as such.
So if you just add a flag (maybe on the RDMA layer) to the next command
to be sent you could easily simulate keepalive without having to send
additional commands.

(Will probably break all sorts of layering, but if you push it down far
enough maybe no-one will notice.)
(And if hch complains ... well .. he invented the thing, didn't he?)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-08 18:14 ` Max Gurtovoy
@ 2017-07-10  7:59   ` Johannes Thumshirn
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-10  7:59 UTC (permalink / raw)


On Sat, Jul 08, 2017@09:14:26PM +0300, Max Gurtovoy wrote:
> I need more info to repro.
> What is the backing store at the target ?

Either null_blk or zram, doesn't really make a difference.

> are you using RoCE or IB link layer ?

ibstat says IB

> ConnectX-3 vs. ConnectX-3 B2B ?
Mellanox Technologies MT26418

> what is the FW on both target and host ?
> what is the KATO ?
default

> can you increase it as a WA ?
Haven't tried yet.

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10  7:17   ` Hannes Reinecke
@ 2017-07-10  8:46     ` Max Gurtovoy
  2017-07-10  9:10       ` Johannes Thumshirn
  2017-07-10  8:59     ` Jack Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Max Gurtovoy @ 2017-07-10  8:46 UTC (permalink / raw)




On 7/10/2017 10:17 AM, Hannes Reinecke wrote:
> On 07/10/2017 09:06 AM, Sagi Grimberg wrote:
>> Hey Johannes,
>>
>>> I'm seeing this on stock v4.12 as well as on our backports.
>>>
>>> My current hypothesis is that I saturate the RDMA link so the
>>> keepalives have
>>> no chance to get to the target.
>>
>> Your observation seems correct to me, because we have no
>> way to guarantee that a keep-alive capsule will be prioritized higher
>> than normal I/O in the fabric layer (as you said, the link might be
>> saturated).
>>
>>> Is there a way to priorize the admin queue somehow?
>>
>> Not really (at least for rdma). We made kato configurable,
>> perhaps we should give a higher default to not see it even
>> in extreme workloads?
>>
>> Couple of questions:
>> - Are you using RoCE (v2 or v1)? or Infiniband?
>> - Does it happen with mlx5 as well?
>> - Are host/target connected via switch/router? if so is flow-control
>>   on? and what are the host/target port speeds?
>> - Can you try and turn debug logging to know what delays (keep-alive
>>   from host to target or the keep-alive response)?
>> - What kato is required to not stumble on this?
>>

Sagi,
see some answers from Johannes to my questions earlier.


> Well, this sounds identically to the path_checker problem we're having
> in multipathing (and hch complained about several times).
> There's a rather easy solution to it: don't send keepalives if I/O is
> running, but rather tack it on the most current I/O packet.
> In the end, you only want to know if the link is alive; you don't have
> to transfer any data as such.
> So if you just add a flag (maybe on the RDMA layer) to the next command
> to be sent you could easily simulate keepalive without having to send
> additional commands.

Hannes,
This is a good solution and actually the way we work in iSCSI/iSER with 
nopin/nopout.
Don't you think it should be a ctrl attribute ?

>
> (Will probably break all sorts of layering, but if you push it down far
> enough maybe no-one will notice.)
> (And if hch complains ... well .. he invented the thing, didn't he?)
>
> Cheers,
>
> Hannes
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10  7:17   ` Hannes Reinecke
  2017-07-10  8:46     ` Max Gurtovoy
@ 2017-07-10  8:59     ` Jack Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Jack Wang @ 2017-07-10  8:59 UTC (permalink / raw)


2017-07-10 9:17 GMT+02:00 Hannes Reinecke <hare at suse.de>:
snip
> Well, this sounds identically to the path_checker problem we're having
> in multipathing (and hch complained about several times).
> There's a rather easy solution to it: don't send keepalives if I/O is
> running, but rather tack it on the most current I/O packet.
> In the end, you only want to know if the link is alive; you don't have
> to transfer any data as such.

We did exactly the same for IBTRS for our heatbeat, :)

Cheers,
Jack
> --
> Dr. Hannes Reinecke                Teamlead Storage & Networking
> hare at suse.de                                   +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
> GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG N?rnberg)
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10  8:46     ` Max Gurtovoy
@ 2017-07-10  9:10       ` Johannes Thumshirn
  2017-07-10 10:13         ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-10  9:10 UTC (permalink / raw)


On Mon, Jul 10, 2017@11:46:47AM +0300, Max Gurtovoy wrote:
> >>- What kato is required to not stumble on this?

Tried up to 120 now, still broken.

> >Well, this sounds identically to the path_checker problem we're having
> >in multipathing (and hch complained about several times).
> >There's a rather easy solution to it: don't send keepalives if I/O is
> >running, but rather tack it on the most current I/O packet.
> >In the end, you only want to know if the link is alive; you don't have
> >to transfer any data as such.
> >So if you just add a flag (maybe on the RDMA layer) to the next command
> >to be sent you could easily simulate keepalive without having to send
> >additional commands.
> 
> Hannes,
> This is a good solution and actually the way we work in iSCSI/iSER with
> nopin/nopout.
> Don't you think it should be a ctrl attribute ?

Let me see if I can come up with something.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10  9:10       ` Johannes Thumshirn
@ 2017-07-10 10:13         ` Sagi Grimberg
  2017-07-10 10:20           ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-10 10:13 UTC (permalink / raw)



> Tried up to 120 now, still broken.

Then something else is broken. 120 seconds is literally forever
in IB world. can you please turn on pr_debug in
nvmet_execute_keep_alive() and check that the target sees it
in a timely manner?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 10:13         ` Sagi Grimberg
@ 2017-07-10 10:20           ` Johannes Thumshirn
  2017-07-10 11:04             ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-10 10:20 UTC (permalink / raw)


On Mon, Jul 10, 2017@01:13:35PM +0300, Sagi Grimberg wrote:
> 
> >Tried up to 120 now, still broken.
> 
> Then something else is broken. 120 seconds is literally forever
> in IB world. can you please turn on pr_debug in
> nvmet_execute_keep_alive() and check that the target sees it
> in a timely manner?

OK, running a test now. I have a local test patch that cancels and
re-schedules the kato work on every mq_ops->complete() for testing as well
which I also like to check as a proof of my hypothesis and then I'll report
back.

Thanks,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 10:20           ` Johannes Thumshirn
@ 2017-07-10 11:04             ` Sagi Grimberg
  2017-07-10 11:33               ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-10 11:04 UTC (permalink / raw)



> OK, running a test now. I have a local test patch that cancels and
> re-schedules the kato work on every mq_ops->complete() for testing as well
> which I also like to check as a proof of my hypothesis and then I'll report
> back.

That won't work as the target is relying to get a keep-alive every
kato+grace-constant, otherwise it will teardown the controller

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 11:04             ` Sagi Grimberg
@ 2017-07-10 11:33               ` Johannes Thumshirn
  2017-07-10 11:41                 ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-10 11:33 UTC (permalink / raw)


On Mon, Jul 10, 2017@02:04:37PM +0300, Sagi Grimberg wrote:
> 
> >OK, running a test now. I have a local test patch that cancels and
> >re-schedules the kato work on every mq_ops->complete() for testing as well
> >which I also like to check as a proof of my hypothesis and then I'll report
> >back.
> 
> That won't work as the target is relying to get a keep-alive every
> kato+grace-constant, otherwise it will teardown the controller

Damn, OK. I'll  re-think my approach

ANyways here are my results:
Target:
[254069.431101] nvmet: adding queue 1 to ctrl 1.
[254069.446254] nvmet: adding queue 2 to ctrl 1.
[...]
[254070.017617] nvmet: adding queue 44 to ctrl 1.
[254190.693126] nvmet: ctrl 1 update keep-alive timer for 130 secs
[254311.910372] nvmet: ctrl 1 update keep-alive timer for 130 secs
[254444.269014] nvmet: ctrl 1 keep-alive timer (130 seconds) expired!
[254444.283809] nvmet: ctrl 1 fatal error occurred!
[254444.298315] nvmet_rdma: freeing queue 0
[254444.308572] nvmet_rdma: freeing queue 1
[...]
[254444.767472] nvmet_rdma: freeing queue 44

Host:
[353698.784927] nvme nvme0: creating 44 I/O queues.
[353699.572467] nvme nvme0: new ctrl: NQN
"nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
addr 1.1.1.2:4420
[353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
transport retry counter exceeded (12)
[353960.840895] nvme nvme0: Reconnecting in 10 seconds...
[353960.853582] blk_update_request: I/O error, dev nvme0n1, sector 14183280
[353960.869599] blk_update_request: I/O error, dev nvme0n1, sector 32251848
[353960.869601] blk_update_request: I/O error, dev nvme0n1, sector 3500872
[353960.869602] blk_update_request: I/O error, dev nvme0n1, sector 3266216
[353960.869603] blk_update_request: I/O error, dev nvme0n1, sector 12926288
[353960.869607] blk_update_request: I/O error, dev nvme0n1, sector 27661040
[353960.869609] blk_update_request: I/O error, dev nvme0n1, sector 32564280
[353960.869610] blk_update_request: I/O error, dev nvme0n1, sector 12912072
[353960.869611] blk_update_request: I/O error, dev nvme0n1, sector 16570728
[353960.869613] blk_update_request: I/O error, dev nvme0n1, sector 33096144
[353961.036738] nvme0n1: detected capacity change from 68719476736 to
-67526893324191744
[353961.055986] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.073360] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090572] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090575] ldm_validate_partition_table(): Disk read failed.
[353961.090578] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090582] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090585] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090589] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090593] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090598] Buffer I/O error on dev nvme0n1, logical block 3, async page
read
[353961.090602] Buffer I/O error on dev nvme0n1, logical block 0, async page
read
[353961.090607]  nvme0n1: unable to read partition table
[353973.021283] nvme nvme0: rdma_resolve_addr wait failed (-104).
[353973.048717] nvme nvme0: Failed reconnect attempt 1
[353973.060073] nvme nvme0: Reconnecting in 10 seconds...
[353983.101337] nvme nvme0: rdma_resolve_addr wait failed (-104).
[353983.128739] nvme nvme0: Failed reconnect attempt 2
[353983.140280] nvme nvme0: Reconnecting in 10 seconds...
[353993.181354] nvme nvme0: rdma_resolve_addr wait failed (-104).
[353993.208714] nvme nvme0: Failed reconnect attempt 3
[353993.208716] nvme nvme0: Reconnecting in 10 seconds...
[354003.229292] nvme nvme0: rdma_resolve_addr wait failed (-104).
[354003.256712] nvme nvme0: Failed reconnect attempt 4
[354003.268189] nvme nvme0: Reconnecting in 10 seconds...
[354013.309211] nvme nvme0: rdma_resolve_addr wait failed (-104).
[354013.336695] nvme nvme0: Failed reconnect attempt 5
[354013.348043] nvme nvme0: Reconnecting in 10 seconds...
[354023.389262] nvme nvme0: rdma_resolve_addr wait failed (-104).
[354023.416682] nvme nvme0: Failed reconnect attempt 6
[354023.428021] nvme nvme0: Reconnecting in 10 seconds...


-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 11:33               ` Johannes Thumshirn
@ 2017-07-10 11:41                 ` Sagi Grimberg
  2017-07-10 11:50                   ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-10 11:41 UTC (permalink / raw)


> Host:
> [353698.784927] nvme nvme0: creating 44 I/O queues.
> [353699.572467] nvme nvme0: new ctrl: NQN
> "nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
> addr 1.1.1.2:4420
> [353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
> transport retry counter exceeded (12)

Exhausted retries, wow... That is really strange...

Host sent the keep-alive and it never made it to the host, the HCA
retried for 7+ times and gave up.

Are you running with a switch? which one? is the switch experience
higher ingress?

> [353960.840895] nvme nvme0: Reconnecting in 10 seconds...
> [353960.853582] blk_update_request: I/O error, dev nvme0n1, sector 14183280
> [353960.869599] blk_update_request: I/O error, dev nvme0n1, sector 32251848
> [353960.869601] blk_update_request: I/O error, dev nvme0n1, sector 3500872
> [353960.869602] blk_update_request: I/O error, dev nvme0n1, sector 3266216
> [353960.869603] blk_update_request: I/O error, dev nvme0n1, sector 12926288
> [353960.869607] blk_update_request: I/O error, dev nvme0n1, sector 27661040
> [353960.869609] blk_update_request: I/O error, dev nvme0n1, sector 32564280
> [353960.869610] blk_update_request: I/O error, dev nvme0n1, sector 12912072
> [353960.869611] blk_update_request: I/O error, dev nvme0n1, sector 16570728
> [353960.869613] blk_update_request: I/O error, dev nvme0n1, sector 33096144
> [353961.036738] nvme0n1: detected capacity change from 68719476736 to
> -67526893324191744
> [353961.055986] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.073360] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090572] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090575] ldm_validate_partition_table(): Disk read failed.
> [353961.090578] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090582] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090585] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090589] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090593] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090598] Buffer I/O error on dev nvme0n1, logical block 3, async page
> read
> [353961.090602] Buffer I/O error on dev nvme0n1, logical block 0, async page
> read
> [353961.090607]  nvme0n1: unable to read partition table
> [353973.021283] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [353973.048717] nvme nvme0: Failed reconnect attempt 1
> [353973.060073] nvme nvme0: Reconnecting in 10 seconds...
> [353983.101337] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [353983.128739] nvme nvme0: Failed reconnect attempt 2
> [353983.140280] nvme nvme0: Reconnecting in 10 seconds...
> [353993.181354] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [353993.208714] nvme nvme0: Failed reconnect attempt 3
> [353993.208716] nvme nvme0: Reconnecting in 10 seconds...
> [354003.229292] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [354003.256712] nvme nvme0: Failed reconnect attempt 4
> [354003.268189] nvme nvme0: Reconnecting in 10 seconds...
> [354013.309211] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [354013.336695] nvme nvme0: Failed reconnect attempt 5
> [354013.348043] nvme nvme0: Reconnecting in 10 seconds...
> [354023.389262] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [354023.416682] nvme nvme0: Failed reconnect attempt 6
> [354023.428021] nvme nvme0: Reconnecting in 10 seconds...

And why aren't you able to reconnect?

Something smells mis-configured here...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 11:41                 ` Sagi Grimberg
@ 2017-07-10 11:50                   ` Johannes Thumshirn
  2017-07-10 12:04                     ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-10 11:50 UTC (permalink / raw)


On Mon, Jul 10, 2017@02:41:28PM +0300, Sagi Grimberg wrote:
> >Host:
> >[353698.784927] nvme nvme0: creating 44 I/O queues.
> >[353699.572467] nvme nvme0: new ctrl: NQN
> >"nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
> >addr 1.1.1.2:4420
> >[353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
> >transport retry counter exceeded (12)
> 
> Exhausted retries, wow... That is really strange...
> 
> Host sent the keep-alive and it never made it to the host, the HCA
> retried for 7+ times and gave up.
> 
> Are you running with a switch? which one? is the switch experience
> higher ingress?

This (unfortunately) was the OmniPath setup as I only was a guest on the IB
installation and the other team needed it back. Anyways I did see this on IB
as well (regardless of SLE12-SP3 and v4.12 final). The switch is an Intel Edge
100 OmniPath switch.

[...]

> 
> And why aren't you able to reconnect?
> 
> Something smells mis-configured here...

I am it just takes ages:
[354235.064586] nvme nvme0: Failed reconnect attempt 27
[354235.076054] nvme nvme0: Reconnecting in 10 seconds...
[354245.117100] nvme nvme0: rdma_resolve_addr wait failed (-104).
[354245.144574] nvme nvme0: Failed reconnect attempt 28
[354245.156097] nvme nvme0: Reconnecting in 10 seconds...
[354255.244008] nvme nvme0: creating 44 I/O queues.
[354255.877529] nvme nvme0: Successfully reconnected
[354255.900579] nvme0n1: detected capacity change from -67526893324191744 to
68719476736

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 11:50                   ` Johannes Thumshirn
@ 2017-07-10 12:04                     ` Sagi Grimberg
  2017-07-11  8:52                       ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-10 12:04 UTC (permalink / raw)



>>> [353698.784927] nvme nvme0: creating 44 I/O queues.
>>> [353699.572467] nvme nvme0: new ctrl: NQN
>>> "nqn.2014-08.org.nvmexpress:NVMf:uuid:c36f2c23-354d-416c-95de-f2b8ec353a82",
>>> addr 1.1.1.2:4420
>>> [353960.804750] nvme nvme0: SEND for CQE 0xffff88011c0cca58 failed with status
>>> transport retry counter exceeded (12)
>>
>> Exhausted retries, wow... That is really strange...
>>
>> Host sent the keep-alive and it never made it to the host, the HCA
>> retried for 7+ times and gave up.
>>
>> Are you running with a switch? which one? is the switch experience
>> higher ingress?
> 
> This (unfortunately) was the OmniPath setup as I only was a guest on the IB
> installation and the other team needed it back. Anyways I did see this on IB
> as well (regardless of SLE12-SP3 and v4.12 final). The switch is an Intel Edge
> 100 OmniPath switch.

Note that your keep-alive does not fail after 120 seconds, it is failed
by the HCA after 7 HCA retries (which is roughly around 35 seconds).

And if your keep-alive did not make it in 35 seconds, then its an
indication that something is wrong... which is exactly what keep-alives
are designed to do... So I'm not at all sure that we need to compensate
for this in the driver at all, something is clearly wrong in your
fabric.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-10 12:04                     ` Sagi Grimberg
@ 2017-07-11  8:52                       ` Johannes Thumshirn
  2017-07-11  9:19                         ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-11  8:52 UTC (permalink / raw)


On Mon, Jul 10, 2017@03:04:52PM +0300, Sagi Grimberg wrote:
> And if your keep-alive did not make it in 35 seconds, then its an
> indication that something is wrong... which is exactly what keep-alives
> are designed to do... So I'm not at all sure that we need to compensate
> for this in the driver at all, something is clearly wrong in your
> fabric.

Not that I disagree with you, but two different (not connected) fabrics
(OmniPath and IB) and both are broken, while I see no problems on IPoIB?

Not sure how likely this is.

But still trying to figure out what's going on.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-11  8:52                       ` Johannes Thumshirn
@ 2017-07-11  9:19                         ` Sagi Grimberg
  2017-07-11  9:21                           ` Johannes Thumshirn
  2017-07-14 11:25                           ` Johannes Thumshirn
  0 siblings, 2 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-07-11  9:19 UTC (permalink / raw)



> Not that I disagree with you, but two different (not connected) fabrics
> (OmniPath and IB) and both are broken, while I see no problems on IPoIB?
> 
> Not sure how likely this is.

I didn't mean that the fabric is broken for sure, I was simply saying
that having a 64 byte send not making it through a switch port sounds
like a problem to me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-11  9:19                         ` Sagi Grimberg
@ 2017-07-11  9:21                           ` Johannes Thumshirn
  2017-07-14 11:25                           ` Johannes Thumshirn
  1 sibling, 0 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-11  9:21 UTC (permalink / raw)


On Tue, Jul 11, 2017@12:19:12PM +0300, Sagi Grimberg wrote:
 
> I didn't mean that the fabric is broken for sure, I was simply saying
> that having a 64 byte send not making it through a switch port sounds
> like a problem to me.

We're trying to setup a new test setup w/ RoCE today to rule out eventual
OmniPath and old InfiniBand HW problems. Still this all is a bit mysterious to
me currently.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-11  9:19                         ` Sagi Grimberg
  2017-07-11  9:21                           ` Johannes Thumshirn
@ 2017-07-14 11:25                           ` Johannes Thumshirn
  2017-08-15 22:46                             ` Guilherme G. Piccoli
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2017-07-14 11:25 UTC (permalink / raw)


On Tue, Jul 11, 2017@12:19:12PM +0300, Sagi Grimberg wrote:
> I didn't mean that the fabric is broken for sure, I was simply saying
> that having a 64 byte send not making it through a switch port sounds
> like a problem to me.

So JFTR I now have a 3rd setup with RoCE over mlx5 (and a Mellanox Switch) and
I can reproduce it again on this setup.

host# ibstat
CA 'mlx5_0'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.20.1010
	Hardware version: 0
	Node GUID: 0x248a070300554504
	System image GUID: 0x248a070300554504
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 56
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0x268a07fffe554504
		Link layer: Ethernet

target# ibstat
CA 'mlx5_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.20.1010
	Hardware version: 0
	Node GUID: 0x248a070300937248
	System image GUID: 0x248a070300937248
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0x268a07fffe937248
		Link layer: Ethernet


host# dmesg
nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 9.9.9.6:4420
nvme nvme0: creating 24 I/O queues.
nvme nvme0: new ctrl: NQN "nvmf-test", addr 9.9.9.6:4420
test start
nvme nvme0: failed nvme_keep_alive_end_io error=-5
nvme nvme0: Reconnecting in 10 seconds...
blk_update_request: I/O error, dev nvme0n1, sector 23000728
blk_update_request: I/O error, dev nvme0n1, sector 32385208
blk_update_request: I/O error, dev nvme0n1, sector 13965416
blk_update_request: I/O error, dev nvme0n1, sector 32825384
blk_update_request: I/O error, dev nvme0n1, sector 47701688
blk_update_request: I/O error, dev nvme0n1, sector 994584
blk_update_request: I/O error, dev nvme0n1, sector 26306816
blk_update_request: I/O error, dev nvme0n1, sector 27715008
blk_update_request: I/O error, dev nvme0n1, sector 32470064
blk_update_request: I/O error, dev nvme0n1, sector 29905512
nvme0n1: detected capacity change from 68719476736 to -67550056326088704
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
ldm_validate_partition_table(): Disk read failed.
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
Buffer I/O error on dev nvme0n1, logical block 3, async page read
Buffer I/O error on dev nvme0n1, logical block 0, async page read
nvme0n1: unable to read partition table

The fio command used was:
fio --name=test --iodepth=128 --numjobs=$(nproc) --size=23g --time_based \
    --runtime=15m --filename=/dev/nvme0n1 --ioengine=libaio --direct=1 \
     --rw=randrw


-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-07-14 11:25                           ` Johannes Thumshirn
@ 2017-08-15 22:46                             ` Guilherme G. Piccoli
  2017-08-16  8:16                               ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Guilherme G. Piccoli @ 2017-08-15 22:46 UTC (permalink / raw)


On 07/14/2017 08:25 AM, Johannes Thumshirn wrote:
> On Tue, Jul 11, 2017@12:19:12PM +0300, Sagi Grimberg wrote:
>> I didn't mean that the fabric is broken for sure, I was simply saying
>> that having a 64 byte send not making it through a switch port sounds
>> like a problem to me.
> 
> So JFTR I now have a 3rd setup with RoCE over mlx5 (and a Mellanox Switch) and
> I can reproduce it again on this setup.

Hi Johannes, we are reproducing a similar stack trace in our
environment, with SR-IOV (Mellanox IB too).

Is there any news about this subject? Is the idea of changing the kato
proposed by Hannes feasible? Did you test with some experimental patch
to achieve this?

Thanks in advance, if there's some data I could collect to help further
discussion of this issue, I'd be glad to do so.

Cheers,


Guilherme


> 
> host# ibstat
> CA 'mlx5_0'
> 	CA type: MT4115
> 	Number of ports: 1
> 	Firmware version: 12.20.1010
> 	Hardware version: 0
> 	Node GUID: 0x248a070300554504
> 	System image GUID: 0x248a070300554504
> 	Port 1:
> 		State: Active
> 		Physical state: LinkUp
> 		Rate: 56
> 		Base lid: 0
> 		LMC: 0
> 		SM lid: 0
> 		Capability mask: 0x04010000
> 		Port GUID: 0x268a07fffe554504
> 		Link layer: Ethernet
> 
> target# ibstat
> CA 'mlx5_0'
> 	CA type: MT4117
> 	Number of ports: 1
> 	Firmware version: 14.20.1010
> 	Hardware version: 0
> 	Node GUID: 0x248a070300937248
> 	System image GUID: 0x248a070300937248
> 	Port 1:
> 		State: Down
> 		Physical state: Disabled
> 		Rate: 25
> 		Base lid: 0
> 		LMC: 0
> 		SM lid: 0
> 		Capability mask: 0x04010000
> 		Port GUID: 0x268a07fffe937248
> 		Link layer: Ethernet
> 
> 
> host# dmesg
> nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 9.9.9.6:4420
> nvme nvme0: creating 24 I/O queues.
> nvme nvme0: new ctrl: NQN "nvmf-test", addr 9.9.9.6:4420
> test start
> nvme nvme0: failed nvme_keep_alive_end_io error=-5
> nvme nvme0: Reconnecting in 10 seconds...
> blk_update_request: I/O error, dev nvme0n1, sector 23000728
> blk_update_request: I/O error, dev nvme0n1, sector 32385208
> blk_update_request: I/O error, dev nvme0n1, sector 13965416
> blk_update_request: I/O error, dev nvme0n1, sector 32825384
> blk_update_request: I/O error, dev nvme0n1, sector 47701688
> blk_update_request: I/O error, dev nvme0n1, sector 994584
> blk_update_request: I/O error, dev nvme0n1, sector 26306816
> blk_update_request: I/O error, dev nvme0n1, sector 27715008
> blk_update_request: I/O error, dev nvme0n1, sector 32470064
> blk_update_request: I/O error, dev nvme0n1, sector 29905512
> nvme0n1: detected capacity change from 68719476736 to -67550056326088704
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> ldm_validate_partition_table(): Disk read failed.
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> Buffer I/O error on dev nvme0n1, logical block 3, async page read
> Buffer I/O error on dev nvme0n1, logical block 0, async page read
> nvme0n1: unable to read partition table
> 
> The fio command used was:
> fio --name=test --iodepth=128 --numjobs=$(nproc) --size=23g --time_based \
>     --runtime=15m --filename=/dev/nvme0n1 --ioengine=libaio --direct=1 \
>      --rw=randrw
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-08-15 22:46                             ` Guilherme G. Piccoli
@ 2017-08-16  8:16                               ` Christoph Hellwig
  2017-08-16 16:19                                 ` Guilherme G. Piccoli
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2017-08-16  8:16 UTC (permalink / raw)


We're having discussions in the working group to allow I/O commands
to reset the keepalive timer.  Stay tuned.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-08-16  8:16                               ` Christoph Hellwig
@ 2017-08-16 16:19                                 ` Guilherme G. Piccoli
  2017-08-28 10:15                                   ` Guan Junxiong
  0 siblings, 1 reply; 23+ messages in thread
From: Guilherme G. Piccoli @ 2017-08-16 16:19 UTC (permalink / raw)


On 08/16/2017 05:16 AM, Christoph Hellwig wrote:
> We're having discussions in the working group to allow I/O commands
> to reset the keepalive timer.  Stay tuned.
> 

Cool, thanks Christoph!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* I/O Errors due to keepalive timeouts with NVMf RDMA
  2017-08-16 16:19                                 ` Guilherme G. Piccoli
@ 2017-08-28 10:15                                   ` Guan Junxiong
  0 siblings, 0 replies; 23+ messages in thread
From: Guan Junxiong @ 2017-08-28 10:15 UTC (permalink / raw)




On 2017/8/17 0:19, Guilherme G. Piccoli wrote:
> On 08/16/2017 05:16 AM, Christoph Hellwig wrote:
>> We're having discussions in the working group to allow I/O commands
>> to reset the keepalive timer.  Stay tuned.
>>
> 
> Cool, thanks Christoph!
Thanks too.
BTW, does "I/O commands" refer to any IO capsule a host/target sends/receives,
or refer to adding new commands into the NVMf Spec?

Thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-08-28 10:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-07  9:48 I/O Errors due to keepalive timeouts with NVMf RDMA Johannes Thumshirn
2017-07-08 18:14 ` Max Gurtovoy
2017-07-10  7:59   ` Johannes Thumshirn
2017-07-10  7:06 ` Sagi Grimberg
2017-07-10  7:17   ` Hannes Reinecke
2017-07-10  8:46     ` Max Gurtovoy
2017-07-10  9:10       ` Johannes Thumshirn
2017-07-10 10:13         ` Sagi Grimberg
2017-07-10 10:20           ` Johannes Thumshirn
2017-07-10 11:04             ` Sagi Grimberg
2017-07-10 11:33               ` Johannes Thumshirn
2017-07-10 11:41                 ` Sagi Grimberg
2017-07-10 11:50                   ` Johannes Thumshirn
2017-07-10 12:04                     ` Sagi Grimberg
2017-07-11  8:52                       ` Johannes Thumshirn
2017-07-11  9:19                         ` Sagi Grimberg
2017-07-11  9:21                           ` Johannes Thumshirn
2017-07-14 11:25                           ` Johannes Thumshirn
2017-08-15 22:46                             ` Guilherme G. Piccoli
2017-08-16  8:16                               ` Christoph Hellwig
2017-08-16 16:19                                 ` Guilherme G. Piccoli
2017-08-28 10:15                                   ` Guan Junxiong
2017-07-10  8:59     ` Jack Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.