All of lore.kernel.org
 help / color / mirror / Atom feed
* NFSD generic R/W API (sendto path) performance results
@ 2016-11-15 18:45 Chuck Lever
       [not found] ` <9170C872-DEE1-4D96-B9D8-E9D2B3F91915-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2016-11-15 18:45 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg; +Cc: List Linux RDMA Mailing

I've built a prototype conversion of the in-kernel NFS server's sendto
path to use the new generic R/W API. This path handles NFS Replies, so
it is responsible for building and sending RDMA Writes carrying NFS
READ payloads, and for transmitting all NFS Replies.

I've published the prototype (against my for-4.10 server series) here:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfsd-rdma-rw-api

It's the very last patch in the series.


"iozone -i0 -i1 -s2g -r1m -I" with NFSv3, sec=sys, CX-3 on both sides,
FDR fabric, share is a tmpfs. This test writes and reads a 2GB file with
1MB direct writes and reads.

The client forms NFS requests with a single 1MB RDMA segment to catch
the NFS READ payload. Before the conversion, the server posts a series
of single Write WRs with 30 pages each, for each RDMA segment written
to the client. After the conversion, the server posts a single chain
of 30-page Write WRs for each RDMA segment written to the client.

Before the API conversion: rdma_stat_post_send = 45097

After the API conversion: rdma_stat_post_send = 16411

That's what I expected to see. This shows the number of ib_post_send
calls is significantly lower after the conversion.


Unfortunately the throughput and latency numbers are worse (ignore
the write/rewrite numbers for now). Output is in kBytes/sec.

Before conversion, one iozone run:

              kB  reclen    write  rewrite    read    reread
         2097152    1024   772835   931267  1895922  1927848

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140    avg bytes received per op: 1048704
    backlog wait: 0.006345     RTT: 0.321132     total execute time: 0.332113

After conversion:

              kB  reclen    write  rewrite    read    reread
         2097152    1024   703850   913824  1561682  1441448

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140    avg bytes received per op: 1048704
    backlog wait: 0.010737     RTT: 0.469497     total execute time: 0.488043

That's 140us worse RTT per READ, in this run. The gap between before and
after was roughly the same for all runs.


To partially explain this, I captured traffic on the server using ibdump
during a similar iozone test. This removes fabric and client HCA latencies
from the picture.

This is a QD=1 test, so it's easy to analyze individual NFS READ operations
in each capture. I computed three latency numbers per READ transaction
based on the timestamps in the capture file, which should be accurate to
1 microsecond:

1. Call took: the time between when the server i/f sees the incoming RDMA
Send carrying the NFS READ Call, and when the server i/f sees the outgoing
RDMA Send carrying the NFS READ Reply.

2. Call-to-first-Write: the time between when the server i/f sees the
incoming RDMA Send carrying the NFS READ Call, and when the server i/f
sees the first outgoing RDMA Write request. Roughly how long it takes
the server to prepare and post the RDMA Writes.

3. First-to-last-Write: the time between when the server i/f sees the
first outgoing RDMA Write request, and when the server i/f sees the
last outgoing RDMA Write request. Roughly how long it takes the HCA
to transmit the RDMA Writes.


Averages over 5 NFS READ calls chosen at random, before conversion:
Call took 414us. Call-to-first-Write 85us. First-to-last-Write 327us

Averages over 5 NFS READ calls chosen at random, after conversion:
Call took 521us. Call-to-first-Write 160us. First-to-last-Write 360us

The gap between before and after results was 100% consistent with
the average results across the individual NFS READ operations.


There are two stories here:

1. Call-to-first-Write takes longer. My first guess is that the server
takes longer to build and DMA map a long Write WR chain than it does
to build, map, and post a single Write WR. The HCA can get started
transmitting Writes sooner, and the server continues working on
posting Write WRs in parallel with the on-the-wire activity.

2. First-to-last-Write takes longer. I don't have any explanation
for the HCA taking 10% longer to transmit the full 1MB payload.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: NFSD generic R/W API (sendto path) performance results
       [not found] ` <9170C872-DEE1-4D96-B9D8-E9D2B3F91915-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-11-15 20:35   ` Steve Wise
  2016-11-16 19:45     ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Steve Wise @ 2016-11-15 20:35 UTC (permalink / raw)
  To: 'Chuck Lever', 'Christoph Hellwig',
	'Sagi Grimberg'
  Cc: 'List Linux RDMA Mailing'

> 
> I've built a prototype conversion of the in-kernel NFS server's sendto
> path to use the new generic R/W API. This path handles NFS Replies, so
> it is responsible for building and sending RDMA Writes carrying NFS
> READ payloads, and for transmitting all NFS Replies.
> 
> I've published the prototype (against my for-4.10 server series) here:
> 
>
http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfsd-rdma-rw
-api
> 
> It's the very last patch in the series.
> 
> 
> "iozone -i0 -i1 -s2g -r1m -I" with NFSv3, sec=sys, CX-3 on both sides,
> FDR fabric, share is a tmpfs. This test writes and reads a 2GB file with
> 1MB direct writes and reads.
> 
> The client forms NFS requests with a single 1MB RDMA segment to catch
> the NFS READ payload. Before the conversion, the server posts a series
> of single Write WRs with 30 pages each, for each RDMA segment written
> to the client. After the conversion, the server posts a single chain
> of 30-page Write WRs for each RDMA segment written to the client.
> 
> Before the API conversion: rdma_stat_post_send = 45097
> 
> After the API conversion: rdma_stat_post_send = 16411
> 
> That's what I expected to see. This shows the number of ib_post_send
> calls is significantly lower after the conversion.
> 
> 
> Unfortunately the throughput and latency numbers are worse (ignore
> the write/rewrite numbers for now). Output is in kBytes/sec.
> 
> Before conversion, one iozone run:
> 
>               kB  reclen    write  rewrite    read    reread
>          2097152    1024   772835   931267  1895922  1927848
> 
> READ:
>     4098 ops (49%)
>     avg bytes sent per op: 140    avg bytes received per op: 1048704
>     backlog wait: 0.006345     RTT: 0.321132     total execute time: 0.332113
> 
> After conversion:
> 
>               kB  reclen    write  rewrite    read    reread
>          2097152    1024   703850   913824  1561682  1441448
> 
> READ:
>     4098 ops (49%)
>     avg bytes sent per op: 140    avg bytes received per op: 1048704
>     backlog wait: 0.010737     RTT: 0.469497     total execute time: 0.488043
> 
> That's 140us worse RTT per READ, in this run. The gap between before and
> after was roughly the same for all runs.
> 
> 
> To partially explain this, I captured traffic on the server using ibdump
> during a similar iozone test. This removes fabric and client HCA latencies
> from the picture.
> 
> This is a QD=1 test, so it's easy to analyze individual NFS READ operations
> in each capture. I computed three latency numbers per READ transaction
> based on the timestamps in the capture file, which should be accurate to
> 1 microsecond:
> 
> 1. Call took: the time between when the server i/f sees the incoming RDMA
> Send carrying the NFS READ Call, and when the server i/f sees the outgoing
> RDMA Send carrying the NFS READ Reply.
> 
> 2. Call-to-first-Write: the time between when the server i/f sees the
> incoming RDMA Send carrying the NFS READ Call, and when the server i/f
> sees the first outgoing RDMA Write request. Roughly how long it takes
> the server to prepare and post the RDMA Writes.
> 
> 3. First-to-last-Write: the time between when the server i/f sees the
> first outgoing RDMA Write request, and when the server i/f sees the
> last outgoing RDMA Write request. Roughly how long it takes the HCA
> to transmit the RDMA Writes.
> 
> 
> Averages over 5 NFS READ calls chosen at random, before conversion:
> Call took 414us. Call-to-first-Write 85us. First-to-last-Write 327us
> 
> Averages over 5 NFS READ calls chosen at random, after conversion:
> Call took 521us. Call-to-first-Write 160us. First-to-last-Write 360us
> 
> The gap between before and after results was 100% consistent with
> the average results across the individual NFS READ operations.
> 
> 

Good work here! 

> There are two stories here:
> 
> 1. Call-to-first-Write takes longer. My first guess is that the server
> takes longer to build and DMA map a long Write WR chain than it does
> to build, map, and post a single Write WR. The HCA can get started
> transmitting Writes sooner, and the server continues working on
> posting Write WRs in parallel with the on-the-wire activity.
>

So perhaps the RDMA R/W API can have a threshold where it will dump a list of
WRs once it exceeds the threshold, and continue chunking?  That threshold, by
the way, is probably device-specific.
 
> 2. First-to-last-Write takes longer. I don't have any explanation
> for the HCA taking 10% longer to transmit the full 1MB payload.
>

Perhaps the single WR posts are hitting device's fast-path and lowering latency
vs a long chain post that must be DMAed by the device?  I'm not sure exactly how
the MLX devices work, but they do have a fast path that utilizes the CPU's
write-combining logic to send a WR over the bus as a single PCIE transaction.
But your WRs are probably large since they have 30 pages in the SGE.  I'm not
sure what the threshold is for this fastpath logic for mlx.  For cxgb, its 64B,
so the WR would have to fit in 64B to take advantage.

Steve.


 
 
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
  2016-11-15 20:35   ` Steve Wise
@ 2016-11-16 19:45     ` Chuck Lever
       [not found]       ` <BA9DC9F7-C893-428B-AFE5-EFCCD13C9F25-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2016-11-16 19:45 UTC (permalink / raw)
  To: Steve Wise; +Cc: Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing


> On Nov 15, 2016, at 3:35 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> 
>> 
>> I've built a prototype conversion of the in-kernel NFS server's sendto
>> path to use the new generic R/W API. This path handles NFS Replies, so
>> it is responsible for building and sending RDMA Writes carrying NFS
>> READ payloads, and for transmitting all NFS Replies.
>> 
>> I've published the prototype (against my for-4.10 server series) here:
>> 
>> 
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfsd-rdma-rw
> -api
>> 
>> It's the very last patch in the series.
>> 
>> 
>> "iozone -i0 -i1 -s2g -r1m -I" with NFSv3, sec=sys, CX-3 on both sides,
>> FDR fabric, share is a tmpfs. This test writes and reads a 2GB file with
>> 1MB direct writes and reads.
>> 
>> The client forms NFS requests with a single 1MB RDMA segment to catch
>> the NFS READ payload. Before the conversion, the server posts a series
>> of single Write WRs with 30 pages each, for each RDMA segment written
>> to the client. After the conversion, the server posts a single chain
>> of 30-page Write WRs for each RDMA segment written to the client.
>> 
>> Before the API conversion: rdma_stat_post_send = 45097
>> 
>> After the API conversion: rdma_stat_post_send = 16411
>> 
>> That's what I expected to see. This shows the number of ib_post_send
>> calls is significantly lower after the conversion.
>> 
>> 
>> Unfortunately the throughput and latency numbers are worse (ignore
>> the write/rewrite numbers for now). Output is in kBytes/sec.
>> 
>> Before conversion, one iozone run:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   772835   931267  1895922  1927848
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.006345     RTT: 0.321132     total execute time: 0.332113
>> 
>> After conversion:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   703850   913824  1561682  1441448
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.010737     RTT: 0.469497     total execute time: 0.488043
>> 
>> That's 140us worse RTT per READ, in this run. The gap between before and
>> after was roughly the same for all runs.
>> 
>> 
>> To partially explain this, I captured traffic on the server using ibdump
>> during a similar iozone test. This removes fabric and client HCA latencies
>> from the picture.
>> 
>> This is a QD=1 test, so it's easy to analyze individual NFS READ operations
>> in each capture. I computed three latency numbers per READ transaction
>> based on the timestamps in the capture file, which should be accurate to
>> 1 microsecond:
>> 
>> 1. Call took: the time between when the server i/f sees the incoming RDMA
>> Send carrying the NFS READ Call, and when the server i/f sees the outgoing
>> RDMA Send carrying the NFS READ Reply.
>> 
>> 2. Call-to-first-Write: the time between when the server i/f sees the
>> incoming RDMA Send carrying the NFS READ Call, and when the server i/f
>> sees the first outgoing RDMA Write request. Roughly how long it takes
>> the server to prepare and post the RDMA Writes.
>> 
>> 3. First-to-last-Write: the time between when the server i/f sees the
>> first outgoing RDMA Write request, and when the server i/f sees the
>> last outgoing RDMA Write request. Roughly how long it takes the HCA
>> to transmit the RDMA Writes.
>> 
>> 
>> Averages over 5 NFS READ calls chosen at random, before conversion:
>> Call took 414us. Call-to-first-Write 85us. First-to-last-Write 327us
>> 
>> Averages over 5 NFS READ calls chosen at random, after conversion:
>> Call took 521us. Call-to-first-Write 160us. First-to-last-Write 360us
>> 
>> The gap between before and after results was 100% consistent with
>> the average results across the individual NFS READ operations.
>> 
>> 
> 
> Good work here! 
> 
>> There are two stories here:
>> 
>> 1. Call-to-first-Write takes longer. My first guess is that the server
>> takes longer to build and DMA map a long Write WR chain than it does
>> to build, map, and post a single Write WR. The HCA can get started
>> transmitting Writes sooner, and the server continues working on
>> posting Write WRs in parallel with the on-the-wire activity.
>> 
> 
> So perhaps the RDMA R/W API can have a threshold where it will dump a list of
> WRs once it exceeds the threshold, and continue chunking?  That threshold, by
> the way, is probably device-specific.
> 
>> 2. First-to-last-Write takes longer. I don't have any explanation
>> for the HCA taking 10% longer to transmit the full 1MB payload.
>> 
> 
> Perhaps the single WR posts are hitting device's fast-path and lowering latency
> vs a long chain post that must be DMAed by the device?  I'm not sure exactly how
> the MLX devices work, but they do have a fast path that utilizes the CPU's
> write-combining logic to send a WR over the bus as a single PCIE transaction.
> But your WRs are probably large since they have 30 pages in the SGE.  I'm not
> sure what the threshold is for this fastpath logic for mlx.  For cxgb, its 64B,
> so the WR would have to fit in 64B to take advantage.

Out of curiosity, I hacked up my NFS client to limit the size of RDMA
segments to 30 pages (the server HCA's max_sge).

A 1MB NFS READ now takes 9 segments. That forces the after-conversion
server to build single-Write chains and use 9 post_send calls to
transmit the READ payload, just like the before-conversion server.

Performance of before- and after-conversion servers is now equivalent.

              kB  reclen    write  rewrite    read    reread
         2097152    1024  1061237  1141614  1961410  2000223                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140    avg bytes received per op: 1048704
    backlog wait: 0.006345     RTT: 0.314300     total execute time: 0.325037

At 60-page segments (2 Write WRs per chain), I see about the same
throughput, and RT latency is a touch higher.

At 61-page segments (3 Write WRs per chain), throughput drops
significantly:

              kB  reclen    write  rewrite    read    reread
         2097152    1024   932665   976784  1627842  1627169                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140	avg bytes received per op: 1048704
    backlog wait: 0.009761 	RTT: 0.383358 	total execute time: 0.398731

A couple of random samples of an ibdump capture show that most of the
latency increase is in the Call-to-first-Write gap (1. above).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]       ` <BA9DC9F7-C893-428B-AFE5-EFCCD13C9F25-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-11-17 12:46         ` Christoph Hellwig
       [not found]           ` <20161117124602.GA25821-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2016-11-17 12:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing

On Wed, Nov 16, 2016 at 02:45:33PM -0500, Chuck Lever wrote:
> Out of curiosity, I hacked up my NFS client to limit the size of RDMA
> segments to 30 pages (the server HCA's max_sge).
> 
> A 1MB NFS READ now takes 9 segments. That forces the after-conversion
> server to build single-Write chains and use 9 post_send calls to
> transmit the READ payload, just like the before-conversion server.
> 
> Performance of before- and after-conversion servers is now equivalent.
> 
>               kB  reclen    write  rewrite    read    reread
>          2097152    1024  1061237  1141614  1961410  2000223                                                                                  

What HCA is this, btw?  Also did you try to always register for > max_sge
calls?  The code can already register all segments with the
rdma_rw_force_mr module option, so it would only need a small tweak for
that behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]           ` <20161117124602.GA25821-jcswGhMUV9g@public.gmane.org>
@ 2016-11-17 15:04             ` Chuck Lever
       [not found]               ` <84B43CFF-EBF7-4758-8751-8C97102C5BCF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2016-11-17 15:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Steve Wise, Sagi Grimberg, List Linux RDMA Mailing


> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> 
> On Wed, Nov 16, 2016 at 02:45:33PM -0500, Chuck Lever wrote:
>> Out of curiosity, I hacked up my NFS client to limit the size of RDMA
>> segments to 30 pages (the server HCA's max_sge).
>> 
>> A 1MB NFS READ now takes 9 segments. That forces the after-conversion
>> server to build single-Write chains and use 9 post_send calls to
>> transmit the READ payload, just like the before-conversion server.
>> 
>> Performance of before- and after-conversion servers is now equivalent.
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024  1061237  1141614  1961410  2000223                                                                                  
> 
> What HCA is this, btw?

ConnectX-3 Pro, f/w 2.31.5050


> Also did you try to always register for > max_sge
> calls?  The code can already register all segments with the
> rdma_rw_force_mr module option, so it would only need a small tweak for
> that behavior.

For various reasons I decided the design should build one WR chain for
each RDMA segment provided by the client. Good clients expose just
one RDMA segment for the whole NFS READ payload.

Does force_mr make the generic API use FRWR with RDMA Write? I had
assumed it changed only the behavior with RDMA Read. I'll try that
too, if RDMA Write can easily be made to use FRWR.

But I'd like a better explanation for this result. Could be a bug
in my implementation, my design, or in the driver. Continuing to
investigate.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]               ` <84B43CFF-EBF7-4758-8751-8C97102C5BCF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-11-17 19:20                 ` Chuck Lever
       [not found]                   ` <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2016-11-17 19:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Steve Wise, Sagi Grimberg, List Linux RDMA Mailing


> On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>> Also did you try to always register for > max_sge
>> calls?  The code can already register all segments with the
>> rdma_rw_force_mr module option, so it would only need a small tweak for
>> that behavior.
> 
> For various reasons I decided the design should build one WR chain for
> each RDMA segment provided by the client. Good clients expose just
> one RDMA segment for the whole NFS READ payload.
> 
> Does force_mr make the generic API use FRWR with RDMA Write? I had
> assumed it changed only the behavior with RDMA Read. I'll try that
> too, if RDMA Write can easily be made to use FRWR.

Unfortunately, some RPC replies are formed from two or three
discontiguous buffers. The gap test in ib_sg_to_pages returns
a smaller number than sg_nents in this case, and rdma_rw_init_ctx
fails.

Thus with my current prototype I'm not able to test with FRWR.

I could fix this in my prototype, but it would be nicer for me if
rdma_rw_init_ctx handled this case the same for FRWR as it does
for physical addressing, which doesn't seem to have any problem
with a discontiguous SGL.


> But I'd like a better explanation for this result. Could be a bug
> in my implementation, my design, or in the driver. Continuing to
> investigate.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: NFSD generic R/W API (sendto path) performance results
       [not found]                   ` <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-11-17 20:03                     ` Steve Wise
  2016-11-17 20:20                       ` Chuck Lever
  2016-11-17 20:20                     ` Sagi Grimberg
  1 sibling, 1 reply; 11+ messages in thread
From: Steve Wise @ 2016-11-17 20:03 UTC (permalink / raw)
  To: 'Chuck Lever', 'Christoph Hellwig'
  Cc: 'Sagi Grimberg', 'List Linux RDMA Mailing'

> 
> > On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> >> Also did you try to always register for > max_sge
> >> calls?  The code can already register all segments with the
> >> rdma_rw_force_mr module option, so it would only need a small tweak for
> >> that behavior.
> >
> > For various reasons I decided the design should build one WR chain for
> > each RDMA segment provided by the client. Good clients expose just
> > one RDMA segment for the whole NFS READ payload.
> >
> > Does force_mr make the generic API use FRWR with RDMA Write? I had
> > assumed it changed only the behavior with RDMA Read. I'll try that
> > too, if RDMA Write can easily be made to use FRWR.
> 
> Unfortunately, some RPC replies are formed from two or three
> discontiguous buffers. The gap test in ib_sg_to_pages returns
> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
> fails.
> 
> Thus with my current prototype I'm not able to test with FRWR.
> 
> I could fix this in my prototype, but it would be nicer for me if
> rdma_rw_init_ctx handled this case the same for FRWR as it does
> for physical addressing, which doesn't seem to have any problem
> with a discontiguous SGL.

Just to make sure I'm understanding you, for rdma-rw to handle this,  it would
have to use multiple REG_MR registrations, one for each contiguous area in the
scatter list.  

Right?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]                   ` <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-11-17 20:03                     ` Steve Wise
@ 2016-11-17 20:20                     ` Sagi Grimberg
       [not found]                       ` <c6190e4c-9b8e-3937-ba38-7861eebeaaae-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2016-11-17 20:20 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig; +Cc: Steve Wise, List Linux RDMA Mailing


>>> Also did you try to always register for > max_sge
>>> calls?  The code can already register all segments with the
>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>> that behavior.
>>
>> For various reasons I decided the design should build one WR chain for
>> each RDMA segment provided by the client. Good clients expose just
>> one RDMA segment for the whole NFS READ payload.
>>
>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>> assumed it changed only the behavior with RDMA Read. I'll try that
>> too, if RDMA Write can easily be made to use FRWR.
>
> Unfortunately, some RPC replies are formed from two or three
> discontiguous buffers. The gap test in ib_sg_to_pages returns
> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
> fails.
>
> Thus with my current prototype I'm not able to test with FRWR.
>
> I could fix this in my prototype, but it would be nicer for me if
> rdma_rw_init_ctx handled this case the same for FRWR as it does
> for physical addressing, which doesn't seem to have any problem
> with a discontiguous SGL.
>
>
>> But I'd like a better explanation for this result. Could be a bug
>> in my implementation, my design, or in the driver. Continuing to
>> investigate.

Hi Chuck, sorry for the late reply (have been busy lately..)

I think that the Call-to-first-Write phenomenon you are seeing makes
perfect sense, the question is, is a QD=1 1M transfers latency that
interesting? Did you see a positive effect on small (say 4k) transfers?
both latency and iops scalability should be able to improve especially
when serving multiple clients.

If indeed you feel that this is an interesting workload to optimize, I
think we can come up with something.

About the First-to-last-Write, thats weird, and sound like a bug
somewhere. Maybe Mellanox folks can tell us if splitting 1M to multiple
writes works better (although I cannot comprehend why).

Question, are the send and receive cqs still in IB_POLL_SOFTIRQ mode?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
  2016-11-17 20:03                     ` Steve Wise
@ 2016-11-17 20:20                       ` Chuck Lever
  0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2016-11-17 20:20 UTC (permalink / raw)
  To: Steve Wise; +Cc: Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing


> On Nov 17, 2016, at 3:03 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> 
>>> 
>>> On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>> 
>>>> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>>>> Also did you try to always register for > max_sge
>>>> calls?  The code can already register all segments with the
>>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>>> that behavior.
>>> 
>>> For various reasons I decided the design should build one WR chain for
>>> each RDMA segment provided by the client. Good clients expose just
>>> one RDMA segment for the whole NFS READ payload.
>>> 
>>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>>> assumed it changed only the behavior with RDMA Read. I'll try that
>>> too, if RDMA Write can easily be made to use FRWR.
>> 
>> Unfortunately, some RPC replies are formed from two or three
>> discontiguous buffers. The gap test in ib_sg_to_pages returns
>> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
>> fails.
>> 
>> Thus with my current prototype I'm not able to test with FRWR.
>> 
>> I could fix this in my prototype, but it would be nicer for me if
>> rdma_rw_init_ctx handled this case the same for FRWR as it does
>> for physical addressing, which doesn't seem to have any problem
>> with a discontiguous SGL.
> 
> Just to make sure I'm understanding you, for rdma-rw to handle this,  it would
> have to use multiple REG_MR registrations, one for each contiguous area in the
> scatter list.  
> 
> Right?

Right, that's the approach the NFS client takes. See
net/sunrpc/xprtrdma/frwr_ops.c :: frwr_op_map.

If the passed-in memory list isn't contiguous, frwr_op_map stops
registering and returns to the caller, who allocates another
MR and calls in again with the remaining part of the list.

I think this would not apply to SG_GAP MRs, which should
already be able to handle discontiguous SGLs?

Note this doesn't apply to most NFS READs, where just the data
payload is going via RDMA Write, and the payload is already in
a contiguous piece of memory. But Reply chunks, which are used
for READDIRs and other requests, can be built from discontiguous
memory.

I haven't looked closely at the RDMA Read logic, but I think
it always reads into a contiguous set of pages, then builds
the xdr_buf out of that. It shouldn't have the same problem
(and it is already known to work with FRWR ;-).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]                       ` <c6190e4c-9b8e-3937-ba38-7861eebeaaae-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-11-17 20:42                         ` Chuck Lever
       [not found]                           ` <EB5A41EB-53AB-4BC9-A5A3-893A9828A5C9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2016-11-17 20:42 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Steve Wise, List Linux RDMA Mailing


> On Nov 17, 2016, at 3:20 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>>>> Also did you try to always register for > max_sge
>>>> calls?  The code can already register all segments with the
>>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>>> that behavior.
>>> 
>>> For various reasons I decided the design should build one WR chain for
>>> each RDMA segment provided by the client. Good clients expose just
>>> one RDMA segment for the whole NFS READ payload.
>>> 
>>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>>> assumed it changed only the behavior with RDMA Read. I'll try that
>>> too, if RDMA Write can easily be made to use FRWR.
>> 
>> Unfortunately, some RPC replies are formed from two or three
>> discontiguous buffers. The gap test in ib_sg_to_pages returns
>> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
>> fails.
>> 
>> Thus with my current prototype I'm not able to test with FRWR.
>> 
>> I could fix this in my prototype, but it would be nicer for me if
>> rdma_rw_init_ctx handled this case the same for FRWR as it does
>> for physical addressing, which doesn't seem to have any problem
>> with a discontiguous SGL.
>> 
>> 
>>> But I'd like a better explanation for this result. Could be a bug
>>> in my implementation, my design, or in the driver. Continuing to
>>> investigate.
> 
> Hi Chuck, sorry for the late reply (have been busy lately..)
> 
> I think that the Call-to-first-Write phenomenon you are seeing makes
> perfect sense, the question is, is a QD=1 1M transfers latency that
> interesting? Did you see a positive effect on small (say 4k) transfers?
> both latency and iops scalability should be able to improve especially
> when serving multiple clients.
> 
> If indeed you feel that this is an interesting workload to optimize, I
> think we can come up with something.

I believe 1MB transfers are interesting: NFS is frequently used in
back-up scenarios, for example, and believe it or not, also for
non-linear editing and animation (4K video).

QD=1 exposes the individual components of latency. In this case, we
can clearly see the cost of preparing the data payload for transfer.
It's basically a tweak so we can debug the problem.

In the "Real World" I expect to see larger transfers, where several
1MB I/Os are dispatched in parallel. I don't reach fabric bandwidth
until 10 or more are in flight, which I think should be improved.

Wrt 4KB, I didn't see much change there, though I admit I didn't
expect much change since both cases have to DMA map a page, and post
one Write WR, so I haven't looked too closely. I'm already down
around 32us for a 4KB NFS READ, even without the server changes,
and 30us for 4KB NFS READ all-inline.

I agree, though, that there is a 3:1 reduction in ib_post_send calls
with the generic API in my test harness, and that can only be a good
thing.


> About the First-to-last-Write, thats weird, and sound like a bug
> somewhere. Maybe Mellanox folks can tell us if splitting 1M to multiple
> writes works better (although I cannot comprehend why).
> 
> Question, are the send and receive cqs still in IB_POLL_SOFTIRQ mode?

Yes, both are still in SOFTIRQ. A while back I played with using the
Work Queue mode, and noticed some slowdowns when the Send CQ was
changed to use Work Queue. I didn't explore it further.

Someday, these will need to be changed, so that the server runs
entirely in process context (as it does for TCP, AFAICT). That gets
rid of BH flapping, but adds heavy-weight context switching latencies.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFSD generic R/W API (sendto path) performance results
       [not found]                           ` <EB5A41EB-53AB-4BC9-A5A3-893A9828A5C9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-11-23 15:01                             ` Chuck Lever
  0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2016-11-23 15:01 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Steve Wise, List Linux RDMA Mailing


> On Nov 17, 2016, at 3:42 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> 
>> On Nov 17, 2016, at 3:20 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>> 
>> 
>>>>> Also did you try to always register for > max_sge
>>>>> calls?  The code can already register all segments with the
>>>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>>>> that behavior.
>>>> 
>>>> For various reasons I decided the design should build one WR chain for
>>>> each RDMA segment provided by the client. Good clients expose just
>>>> one RDMA segment for the whole NFS READ payload.
>>>> 
>>>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>>>> assumed it changed only the behavior with RDMA Read. I'll try that
>>>> too, if RDMA Write can easily be made to use FRWR.
>>> 
>>> Unfortunately, some RPC replies are formed from two or three
>>> discontiguous buffers. The gap test in ib_sg_to_pages returns
>>> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
>>> fails.
>>> 
>>> Thus with my current prototype I'm not able to test with FRWR.
>>> 
>>> I could fix this in my prototype, but it would be nicer for me if
>>> rdma_rw_init_ctx handled this case the same for FRWR as it does
>>> for physical addressing, which doesn't seem to have any problem
>>> with a discontiguous SGL.
>>> 
>>> 
>>>> But I'd like a better explanation for this result. Could be a bug
>>>> in my implementation, my design, or in the driver. Continuing to
>>>> investigate.
>> 
>> Hi Chuck, sorry for the late reply (have been busy lately..)
>> 
>> I think that the Call-to-first-Write phenomenon you are seeing makes
>> perfect sense, the question is, is a QD=1 1M transfers latency that
>> interesting? Did you see a positive effect on small (say 4k) transfers?
>> both latency and iops scalability should be able to improve especially
>> when serving multiple clients.
>> 
>> If indeed you feel that this is an interesting workload to optimize, I
>> think we can come up with something.
> 
> I believe 1MB transfers are interesting: NFS is frequently used in
> back-up scenarios, for example, and believe it or not, also for
> non-linear editing and animation (4K video).
> 
> QD=1 exposes the individual components of latency. In this case, we
> can clearly see the cost of preparing the data payload for transfer.
> It's basically a tweak so we can debug the problem.
> 
> In the "Real World" I expect to see larger transfers, where several
> 1MB I/Os are dispatched in parallel. I don't reach fabric bandwidth
> until 10 or more are in flight, which I think should be improved.

I've found what looks like the problem.

After disabling DMA API and IOMMU debugging, the post-conversion
server shows 1MB NFS READ latency averaging about 403us, measured
via ibdump captured on the server. Pre-conversion, latency averages
about 397us in my set-up.

Post-conversion, it takes a little longer for the first RDMA Write
request after the NFS READ Call arrives (longer Call-to-first-Write),
but the RDMA Writes are transmitted on average a little faster with
the generic API (shorter First-to-last-Write).

But the "can't reach fabric bandwidth" issue appears to be a client
issue, not a server issue.

I compared a client-side and server-side ibdump capture taken during
the same benchmark run. The server emits an RDMA Write every
microsecond or two, like clockwork. The client, though, shows
occasional RDMA Write Middles arriving after a several hundred
microsecond pause. That's enough to slow down the time between NFS
READ Reply and the next NFS READ Call, and that impacts benchmarked
throughput.

ibqueryerrors shows the client's HCA and switch port are congested
(PortXmitWait). That counter goes up whenever the NFS workload
involves a significant number of RDMA Writes.

During benchmarking, I have used the default NFS rsize of 1MB.

If I mount with a smaller rsize, the ratio of RDMA Writes streamed
per NFS READ request goes down. At rsize=262144, iozone can reach
very close to fabric bandwidth (5.2GB/s) with 2MB and 4MB I/O, which
is about the best I can hope for.

This congestion issue also makes the iozone 1MB QD=1 benchmark
results highly variable.

With ConnectX-4 on the client, the congestion problem seems worse.
The good rsize=256KB result happens only when the client is using its
CX-3 Pro HCA.

Somehow I need to determine why the client's HCA gets hosed up during
heavy RDMA Write workloads. Could be the HCAs need to be in different
slots? Maybe it's NUMA effects? Perhaps the workload's ratio between
RDMA Write and RDMA Send?

Any advice is appreciated! ;-)


--
Chuck Lever


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-23 15:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 18:45 NFSD generic R/W API (sendto path) performance results Chuck Lever
     [not found] ` <9170C872-DEE1-4D96-B9D8-E9D2B3F91915-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-11-15 20:35   ` Steve Wise
2016-11-16 19:45     ` Chuck Lever
     [not found]       ` <BA9DC9F7-C893-428B-AFE5-EFCCD13C9F25-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-11-17 12:46         ` Christoph Hellwig
     [not found]           ` <20161117124602.GA25821-jcswGhMUV9g@public.gmane.org>
2016-11-17 15:04             ` Chuck Lever
     [not found]               ` <84B43CFF-EBF7-4758-8751-8C97102C5BCF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-11-17 19:20                 ` Chuck Lever
     [not found]                   ` <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-11-17 20:03                     ` Steve Wise
2016-11-17 20:20                       ` Chuck Lever
2016-11-17 20:20                     ` Sagi Grimberg
     [not found]                       ` <c6190e4c-9b8e-3937-ba38-7861eebeaaae-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-11-17 20:42                         ` Chuck Lever
     [not found]                           ` <EB5A41EB-53AB-4BC9-A5A3-893A9828A5C9-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-11-23 15:01                             ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.