Remove single NFS client performance bottleneck: Only 4 nfsd active

All of lore.kernel.org
 help / color / mirror / Atom feed

* Remove single NFS client performance bottleneck: Only 4 nfsd active
@ 2020-01-26 23:41 Sven Breuner
  2020-01-27 14:06 ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Sven Breuner @ 2020-01-26 23:41 UTC (permalink / raw)
  To: linux-nfs

Hi,

I'm using the kernel NFS client/server and am trying to read as many small files 
per second as possible from a single NFS client, but seem to run into a bottleneck.

Maybe this is just a tunable that I am missing, because the CPUs on client and 
server side are mostly idle, the 100Gbit (RoCE) network links between client and 
server are also mostly idle and the NVMe drives in the server are also mostly 
idle (and the server has enough RAM to easily fit my test data set in the 
ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache 
doesn't change the result much).

This is my test case:
# Create 1.6M 10KB files through 128 mdtest processes in different directories...
$ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 
100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C

# Read all the files through 128 mdtest processes (the case that matters 
primarily for my test)...
$ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 
100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E

The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.

I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why 
the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the 
few nfsd processes that are active use less than 50% of their core each. The 
CPUs are shown as >90% idle in "top" on client and server during the read phase.

I've tried:
* CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 
with 4.18 kernel on server side
* TCP & RDMA
* Mounted as NFSv3/v4.1/v4.2
* Increased tcp_slot_table_entries to 1024

...but all that didn't change the fact that only 4 nfsd processes are active on 
the server and thus I'm getting the same result already if /proc/fs/nfsd/threads 
is set to only 4 instead of 64.

Any pointer to how I can overcome this limit will be greatly appreciated.

Thanks in advance

Sven

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-26 23:41 Remove single NFS client performance bottleneck: Only 4 nfsd active Sven Breuner
@ 2020-01-27 14:06 ` Chuck Lever
  2020-01-27 14:12   ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2020-01-27 14:06 UTC (permalink / raw)
  To: Sven Breuner; +Cc: Linux NFS Mailing List

Hi Sven-

> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
> 
> Hi,
> 
> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
> 
> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
> 
> This is my test case:
> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
> 
> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
> 
> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.

What is the typical size of the NFS READ I/Os on the wire?

Are you sure your mpirun workload is generating enough parallelism?


> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
> 
> I've tried:
> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
> * TCP & RDMA
> * Mounted as NFSv3/v4.1/v4.2
> * Increased tcp_slot_table_entries to 1024
> 
> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
> 
> Any pointer to how I can overcome this limit will be greatly appreciated.
> 
> Thanks in advance
> 
> Sven
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-27 14:06 ` Chuck Lever
@ 2020-01-27 14:12   ` Chuck Lever
  2020-01-27 17:27     ` Sven Breuner
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2020-01-27 14:12 UTC (permalink / raw)
  To: Sven Breuner; +Cc: Linux NFS Mailing List



> On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> Hi Sven-
> 
>> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
>> 
>> Hi,
>> 
>> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
>> 
>> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
>> 
>> This is my test case:
>> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
>> 
>> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
>> 
>> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
> 
> What is the typical size of the NFS READ I/Os on the wire?
> 
> Are you sure your mpirun workload is generating enough parallelism?

A couple of other thoughts:

What's the client hardware like? NUMA? Fast memory? CPU count?
Have you configured device interrupt affinity and used tuned
to disable CPU sleep states, etc?

Have you properly configured your 100GbE switch and cards?

I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
The configuration of the cards and switch is critical to good
performance.


>> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
>> 
>> I've tried:
>> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
>> * TCP & RDMA
>> * Mounted as NFSv3/v4.1/v4.2
>> * Increased tcp_slot_table_entries to 1024
>> 
>> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
>> 
>> Any pointer to how I can overcome this limit will be greatly appreciated.
>> 
>> Thanks in advance
>> 
>> Sven
>> 
> 
> --
> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-27 14:12   ` Chuck Lever
@ 2020-01-27 17:27     ` Sven Breuner
  2020-01-27 17:45       ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Sven Breuner @ 2020-01-27 17:27 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List

Hi Chuck,

thanks for looking into this. (Answers inline...)

Chuck Lever wrote on 27.01.2020 15:12:
>
>> On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> Hi Sven-
>>
>>> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
>>>
>>> Hi,
>>>
>>> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
>>>
>>> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
>>>
>>> This is my test case:
>>> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
>>>
>>> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
>>>
>>> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
>> What is the typical size of the NFS READ I/Os on the wire?
The application is fetching each full 10KB file in a single read op (so 
"read(fd, buf, 10240)" ) and NFS wsize/rsize is 512KB.
>> Are you sure your mpirun workload is generating enough parallelism?
Yes, MPI is only used to start the 128 processes and aggregate the performance 
results in the end. For the actual file read phase, all 128 processes run 
completely independent without any communication/synchronization. Each process 
is working in its own subdir with its own set of 10KB files.
(Running the same test directly on the local xfs of the NFS server box results 
in ~350,000 10KB file reads per sec after cache drop and >1 mio 10KB file reads 
per sec from page cache. Just mentioning this for the sake of completeness to 
show that this is not hitting a limit on the server side.)
> A couple of other thoughts:
>
> What's the client hardware like? NUMA? Fast memory? CPU count?

Client and server are dual socket Intel Xeon E5-2690 v4 @ 2.60GHz (14 cores per 
socket plus hyper threading), all 4 memory channels per socket populated with 
fastest possible DIMMs (DDR4 2400).

Also tried pool_mode auto/global/pernode on server side.

> Have you configured device interrupt affinity and used tuned
> to disable CPU sleep states, etc?

Yes, CPU power saving (frequency scaling) disabled. Tried tuned profiles 
latency-performance and and throughput-performance. Also tried irqbalance and 
mlnx_affinity.

All without any significant effect unfortunately.

> Have you properly configured your 100GbE switch and cards?
>
> I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
> The configuration of the cards and switch is critical to good
> performance.
Yes, I can absolutely confirm that having this part of the config correct is 
critical for great performance :-) All configured with PFC and ECN and 
double-checked for packets to be tagged correctly and lossless in the RoCE case. 
The topology is simple: Client and server connected to same Mellanox switch, 
nothing else happening on the switch.
>
>
>>> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
>>>
>>> I've tried:
>>> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
>>> * TCP & RDMA
>>> * Mounted as NFSv3/v4.1/v4.2
>>> * Increased tcp_slot_table_entries to 1024
>>>
>>> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
>>>
>>> Any pointer to how I can overcome this limit will be greatly appreciated.
>>>
>>> Thanks in advance
>>>
>>> Sven
>>>
>> --
>> Chuck Lever
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-27 17:27     ` Sven Breuner
@ 2020-01-27 17:45       ` Chuck Lever
  2020-01-28 23:22         ` Sven Breuner
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever @ 2020-01-27 17:45 UTC (permalink / raw)
  To: Sven Breuner; +Cc: Linux NFS Mailing List



> On Jan 27, 2020, at 12:27 PM, Sven Breuner <sven@excelero.com> wrote:
> 
> Hi Chuck,
> 
> thanks for looking into this. (Answers inline...)
> 
> Chuck Lever wrote on 27.01.2020 15:12:
>> 
>>> On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> Hi Sven-
>>> 
>>>> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
>>>> 
>>>> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
>>>> 
>>>> This is my test case:
>>>> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
>>>> 
>>>> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
>>>> 
>>>> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
>>> What is the typical size of the NFS READ I/Os on the wire?
> The application is fetching each full 10KB file in a single read op (so "read(fd, buf, 10240)" ) and NFS wsize/rsize is 512KB.

512KB is not going to matter if every file contains only 10KB.
This means 10KB READs. The I/O size is going to limit data
throughput. 20KIOPS seems low for RDMA, but is about right for
TCP.

If you see wire operations, then the client's page cache is not
being used at all?


>>> Are you sure your mpirun workload is generating enough parallelism?
> Yes, MPI is only used to start the 128 processes and aggregate the performance results in the end. For the actual file read phase, all 128 processes run completely independent without any communication/synchronization. Each process is working in its own subdir with its own set of 10KB files.
> (Running the same test directly on the local xfs of the NFS server box results in ~350,000 10KB file reads per sec after cache drop and >1 mio 10KB file reads per sec from page cache. Just mentioning this for the sake of completeness to show that this is not hitting a limit on the server side.)
>> A couple of other thoughts:
>> 
>> What's the client hardware like? NUMA? Fast memory? CPU count?
> 
> Client and server are dual socket Intel Xeon E5-2690 v4 @ 2.60GHz (14 cores per socket plus hyper threading), all 4 memory channels per socket populated with fastest possible DIMMs (DDR4 2400).

One recommendation: Disable HT in the BIOS.


> Also tried pool_mode auto/global/pernode on server side.

NUMA seems to matter more on the client than on the server.


>> Have you configured device interrupt affinity and used tuned
>> to disable CPU sleep states, etc?
> 
> Yes, CPU power saving (frequency scaling) disabled. Tried tuned profiles latency-performance and and throughput-performance. Also tried irqbalance and mlnx_affinity.
> 
> All without any significant effect unfortunately.

And lspci -vvv confirms you are getting the right PCIe link
settings on both systems?


>> Have you properly configured your 100GbE switch and cards?
>> 
>> I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
>> The configuration of the cards and switch is critical to good
>> performance.
> Yes, I can absolutely confirm that having this part of the config correct is critical for great performance :-) All configured with PFC and ECN and double-checked for packets to be tagged correctly and lossless in the RoCE case. The topology is simple: Client and server connected to same Mellanox switch, nothing else happening on the switch.
>> 
>> 
>>>> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
>>>> 
>>>> I've tried:
>>>> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
>>>> * TCP & RDMA
>>>> * Mounted as NFSv3/v4.1/v4.2
>>>> * Increased tcp_slot_table_entries to 1024
>>>> 
>>>> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
>>>> 
>>>> Any pointer to how I can overcome this limit will be greatly appreciated.
>>>> 
>>>> Thanks in advance
>>>> 
>>>> Sven
>>>> 
>>> --
>>> Chuck Lever
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-27 17:45       ` Chuck Lever
@ 2020-01-28 23:22         ` Sven Breuner
  2020-01-29  0:43           ` Chuck Lever
  0 siblings, 1 reply; 7+ messages in thread
From: Sven Breuner @ 2020-01-28 23:22 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List


Chuck Lever wrote on 27.01.2020 18:45:
>
>> On Jan 27, 2020, at 12:27 PM, Sven Breuner <sven@excelero.com> wrote:
>>
>> Hi Chuck,
>>
>> thanks for looking into this. (Answers inline...)
>>
>> Chuck Lever wrote on 27.01.2020 15:12:
>>>> On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> Hi Sven-
>>>>
>>>>> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
>>>>>
>>>>> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
>>>>>
>>>>> This is my test case:
>>>>> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
>>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
>>>>>
>>>>> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
>>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
>>>>>
>>>>> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
>>>> What is the typical size of the NFS READ I/Os on the wire?
>> The application is fetching each full 10KB file in a single read op (so "read(fd, buf, 10240)" ) and NFS wsize/rsize is 512KB.
> 512KB is not going to matter if every file contains only 10KB.
> This means 10KB READs. The I/O size is going to limit data
> throughput. 20KIOPS seems low for RDMA, but is about right for
> TCP.
RDMA is about 30% faster (26K file reads per sec), but I'm more hoping for an 
order of magnitude increase.
By the way: Is there an NFSoRDMA equivalent for tcp_slot_table_entries to 
increase or is there no such limit in case of RDMA transport?
> If you see wire operations, then the client's page cache is not
> being used at all?
I'm usually benchmarking after fresh client mount or "echo 3 > 
/proc/sys/vm/drop_caches", because the actual production data set will have many 
more files (multiple Terabytes) and thus won't fit in RAM.
>
>
>>>> Are you sure your mpirun workload is generating enough parallelism?
>> Yes, MPI is only used to start the 128 processes and aggregate the performance results in the end. For the actual file read phase, all 128 processes run completely independent without any communication/synchronization. Each process is working in its own subdir with its own set of 10KB files.
>> (Running the same test directly on the local xfs of the NFS server box results in ~350,000 10KB file reads per sec after cache drop and >1 mio 10KB file reads per sec from page cache. Just mentioning this for the sake of completeness to show that this is not hitting a limit on the server side.)
>>> A couple of other thoughts:
>>>
>>> What's the client hardware like? NUMA? Fast memory? CPU count?
>> Client and server are dual socket Intel Xeon E5-2690 v4 @ 2.60GHz (14 cores per socket plus hyper threading), all 4 memory channels per socket populated with fastest possible DIMMs (DDR4 2400).
> One recommendation: Disable HT in the BIOS.
Thanks for the recommendation. Tried, but no significant difference.
>> Also tried pool_mode auto/global/pernode on server side.
> NUMA seems to matter more on the client than on the server.
mpirun can also bind the mdtest processes to NUMA zones, but also no significant 
difference for that.
>
>
>>> Have you configured device interrupt affinity and used tuned
>>> to disable CPU sleep states, etc?
>> Yes, CPU power saving (frequency scaling) disabled. Tried tuned profiles latency-performance and and throughput-performance. Also tried irqbalance and mlnx_affinity.
>>
>> All without any significant effect unfortunately.
> And lspci -vvv confirms you are getting the right PCIe link
> settings on both systems?

Yes. I can also see >11GB/s network throughput between client and server through 
e.g. ib_send_bw.

Actually, the client and server each have two 100Gbit RoCE NICs, but it seems 
like there isn't a way currently to get the NFS client to spread the 
communication across two RDMA interfaces for a single mountpoint.

Assuming the hardware is not the limit, is there any software limit for 
parallelism in the NFS client? So if an application on an NFS client can 
generates 128 concurrent open/read/close (from 128 threads) to 128 different 
files in the NFS mountpoint, will the NFS client then actually send 128 
concurrent open/read/close over the wire or will it e.g. limit for some reason 
to only 4 or 8 concurrent requests and the rest will be queued until one of the 
first 4/8/... requests has received a reply - e.g. because there is only 1 
pending reply allowed per connection and a client does not establish more than 
4/8/... connections to the same server?
And same question on the NFS server side: If a client sends 128 concurrent reads 
over the wire and the knfsd has 128 threads, will it actually work on all those 
128 reads in parallel or is there a limit in the server that e.g. maps requests 
from the same client to a maximum of 4/8/... threads, no matter how many more 
threads the knfsd has available?

Thanks a lot for your help

Sven

>
>
>>> Have you properly configured your 100GbE switch and cards?
>>>
>>> I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
>>> The configuration of the cards and switch is critical to good
>>> performance.
>> Yes, I can absolutely confirm that having this part of the config correct is critical for great performance :-) All configured with PFC and ECN and double-checked for packets to be tagged correctly and lossless in the RoCE case. The topology is simple: Client and server connected to same Mellanox switch, nothing else happening on the switch.
>>>
>>>>> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
>>>>>
>>>>> I've tried:
>>>>> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
>>>>> * TCP & RDMA
>>>>> * Mounted as NFSv3/v4.1/v4.2
>>>>> * Increased tcp_slot_table_entries to 1024
>>>>>
>>>>> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
>>>>>
>>>>> Any pointer to how I can overcome this limit will be greatly appreciated.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Sven
>>>>>
>>>> --
>>>> Chuck Lever
>>> --
>>> Chuck Lever
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Remove single NFS client performance bottleneck: Only 4 nfsd active
  2020-01-28 23:22         ` Sven Breuner
@ 2020-01-29  0:43           ` Chuck Lever
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever @ 2020-01-29  0:43 UTC (permalink / raw)
  To: Sven Breuner; +Cc: Linux NFS Mailing List



> On Jan 28, 2020, at 6:22 PM, Sven Breuner <sven@excelero.com> wrote:
> 
> 
> Chuck Lever wrote on 27.01.2020 18:45:
>> 
>>> On Jan 27, 2020, at 12:27 PM, Sven Breuner <sven@excelero.com> wrote:
>>> 
>>> Hi Chuck,
>>> 
>>> thanks for looking into this. (Answers inline...)
>>> 
>>> Chuck Lever wrote on 27.01.2020 15:12:
>>>>> On Jan 27, 2020, at 9:06 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> Hi Sven-
>>>>> 
>>>>>> On Jan 26, 2020, at 6:41 PM, Sven Breuner <sven@excelero.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm using the kernel NFS client/server and am trying to read as many small files per second as possible from a single NFS client, but seem to run into a bottleneck.
>>>>>> 
>>>>>> Maybe this is just a tunable that I am missing, because the CPUs on client and server side are mostly idle, the 100Gbit (RoCE) network links between client and server are also mostly idle and the NVMe drives in the server are also mostly idle (and the server has enough RAM to easily fit my test data set in the ext4/xfs page cache, but also a 2nd read of the data set from the RAM cache doesn't change the result much).
>>>>>> 
>>>>>> This is my test case:
>>>>>> # Create 1.6M 10KB files through 128 mdtest processes in different directories...
>>>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -C
>>>>>> 
>>>>>> # Read all the files through 128 mdtest processes (the case that matters primarily for my test)...
>>>>>> $ mpirun -hosts localhost -np 128 /path/to/mdtest -F -d /mnt/nfs/mdtest -i 1 -I 100 -z 1 -b 128 -L -u -w 10240 -e 10240 -E
>>>>>> 
>>>>>> The result is about 20,000 file reads per sec, so only ~200MB/s network throughput.
>>>>> What is the typical size of the NFS READ I/Os on the wire?
>>> The application is fetching each full 10KB file in a single read op (so "read(fd, buf, 10240)" ) and NFS wsize/rsize is 512KB.
>> 512KB is not going to matter if every file contains only 10KB.
>> This means 10KB READs. The I/O size is going to limit data
>> throughput. 20KIOPS seems low for RDMA, but is about right for
>> TCP.
> RDMA is about 30% faster (26K file reads per sec), but I'm more hoping for an order of magnitude increase.

Not an unreasonable hope. Here's a simple fio test I ran on my 100GbE systems with NFSv3:

fio-test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1024
...
fio-3.14
Starting 4 processes
fio-test: Laying out IO file (1 file / 1024MiB)
fio-test: Laying out IO file (1 file / 1024MiB)
fio-test: Laying out IO file (1 file / 1024MiB)
fio-test: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=4): [r(4)][100.0%][r=1092MiB/s][r=280k IOPS][eta 00m:00s]
fio-test: (groupid=0, jobs=4): err= 0: pid=3075643: Tue Jan 28 19:27:10 2020
  read: IOPS=280k, BW=1094MiB/s (1147MB/s)(32.0GiB/30005msec)
    slat (nsec): min=1516, max=83293k, avg=7696.22, stdev=218022.20
    clat (usec): min=46, max=143098, avg=14618.22, stdev=11253.38
     lat (usec): min=50, max=143105, avg=14626.26, stdev=11256.38
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    5], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   12], 60.00th=[   14],
     | 70.00th=[   17], 80.00th=[   21], 90.00th=[   28], 95.00th=[   36],
     | 99.00th=[   60], 99.50th=[   71], 99.90th=[   90], 99.95th=[   94],
     | 99.99th=[  108]
   bw (  MiB/s): min=  728, max= 1795, per=99.96%, avg=1093.10, stdev=55.40, samples=240
   iops        : min=186436, max=459578, avg=279833.92, stdev=14183.03, samples=240
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.03%, 500=0.08%, 750=0.08%
  lat (usec)   : 1000=0.08%
  lat (msec)   : 2=0.47%, 4=4.93%, 10=35.53%, 20=38.41%, 50=18.62%
  lat (msec)   : 100=1.75%, 250=0.03%
  cpu          : usr=10.66%, sys=17.51%, ctx=1627222, majf=0, minf=4136
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=8400192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: bw=1094MiB/s (1147MB/s), 1094MiB/s-1094MiB/s (1147MB/s-1147MB/s), io=32.0GiB (34.4GB), run=30005-30005msec


It achieves better than a quarter-million 4KB read IOPS.

However I do not believe there is an open/close with each I/O in this
test; each thread opens one of the test files, and then simply streams
reads to it.


> By the way: Is there an NFSoRDMA equivalent for tcp_slot_table_entries to increase or is there no such limit in case of RDMA transport?

[cel@morisot ~]$ cat /proc/sys/sunrpc/tcp_slot_table_entries
2
[cel@morisot ~]$ cat /proc/sys/sunrpc/tcp_max_slot_table_entries
65536
[cel@morisot ~]$ cat /proc/sys/sunrpc/rdma_slot_table_entries
128
[cel@morisot ~]$

The tcp_slot_table_entries file controls the number of pre-allocated
rpc_rqst's per transport. After 2 are in use, the transport uses
kmalloc and kfree for the others.

The tcp_max_slot_table_entries file caps the total number of rpc_rqst's
allowed per transport. The default setting is already quite large.

The rdma_slot_table_entries file caps the number of RPC-over-RDMA
credits the client supports. The Linux server grants at most 32 credits.
That number has increased to 64 credits in recent kernels.

The kernel also has a control on the server to set the number of
credits it will grant:

[cel@bazille linux]$ cat /proc/sys/sunrpc/svc_rdma/max_requests 
64
[cel@bazille linux]$

These two systems will use no more than 64 credits per connection.


>> If you see wire operations, then the client's page cache is not
>> being used at all?
> I'm usually benchmarking after fresh client mount or "echo 3 > /proc/sys/vm/drop_caches", because the actual production data set will have many more files (multiple Terabytes) and thus won't fit in RAM.

In the server's RAM? If that's the case, then your durable storage
will be the main bottleneck.


>>>>> Are you sure your mpirun workload is generating enough parallelism?
>>> Yes, MPI is only used to start the 128 processes and aggregate the performance results in the end. For the actual file read phase, all 128 processes run completely independent without any communication/synchronization. Each process is working in its own subdir with its own set of 10KB files.
>>> (Running the same test directly on the local xfs of the NFS server box results in ~350,000 10KB file reads per sec after cache drop and >1 mio 10KB file reads per sec from page cache. Just mentioning this for the sake of completeness to show that this is not hitting a limit on the server side.)
>>>> A couple of other thoughts:
>>>> 
>>>> What's the client hardware like? NUMA? Fast memory? CPU count?
>>> Client and server are dual socket Intel Xeon E5-2690 v4 @ 2.60GHz (14 cores per socket plus hyper threading), all 4 memory channels per socket populated with fastest possible DIMMs (DDR4 2400).
>> One recommendation: Disable HT in the BIOS.
> Thanks for the recommendation. Tried, but no significant difference.
>>> Also tried pool_mode auto/global/pernode on server side.
>> NUMA seems to matter more on the client than on the server.
> mpirun can also bind the mdtest processes to NUMA zones, but also no significant difference for that.

Keep the processes on the same node as the NIC and its interrupt vectors.
Again, not likely to give a 10x boost in IOPS, but it will help once you
figure out the main issue.


>>>> Have you configured device interrupt affinity and used tuned
>>>> to disable CPU sleep states, etc?
>>> Yes, CPU power saving (frequency scaling) disabled. Tried tuned profiles latency-performance and and throughput-performance. Also tried irqbalance and mlnx_affinity.
>>> 
>>> All without any significant effect unfortunately.
>> And lspci -vvv confirms you are getting the right PCIe link
>> settings on both systems?
> 
> Yes. I can also see >11GB/s network throughput between client and server through e.g. ib_send_bw.
> 
> Actually, the client and server each have two 100Gbit RoCE NICs, but it seems like there isn't a way currently to get the NFS client to spread the communication across two RDMA interfaces for a single mountpoint.

That's correct. You might try bonding the two NICs.

However, in general a single RDMA NIC should enable much better throughput
than a non-offloaded NIC.


> Assuming the hardware is not the limit, is there any software limit for parallelism in the NFS client?

No architected limit that I'm aware of. Lock contention can certainly be
a problem, though.

Also, if your kernel builds have Kernel Hacking options enabled, that can
have a considerable impact on throughput. Memory, lock, or data structure
debugging options will have an effect.

NFSv4.0 open owners maybe? Though you said your test results do not seem to
depend on the NFS version.

NFSv3 does not have on-the-wire OPEN and CLOSE, but IIUC it does force a
GETATTR operation to check the file's mtime at each open(2) and close(2).
I wonder if the "nocto" mount option would help the NFSv3 results?


> So if an application on an NFS client can generates 128 concurrent open/read/close (from 128 threads) to 128 different files in the NFS mountpoint, will the NFS client then actually send 128 concurrent open/read/close over the wire or will it e.g. limit for some reason to only 4 or 8 concurrent requests and the rest will be queued until one of the first 4/8/... requests has received a reply - e.g. because there is only 1 pending reply allowed per connection and a client does not establish more than 4/8/... connections to the same server?

A connection can have as many concurrent pending replies as there are
outstanding calls.

With TCP, you can use tcpdump to look at wire behavior to see if there
are obvious problems.


> And same question on the NFS server side: If a client sends 128 concurrent reads over the wire and the knfsd has 128 threads, will it actually work on all those 128 reads in parallel or is there a limit in the server that e.g. maps requests from the same client to a maximum of 4/8/... threads, no matter how many more threads the knfsd has available?

Each nfsd thread handles one RPC. If there are 128 ingress NFS READS
and 128 threads, the kernel will try to use all available threads.


> Thanks a lot for your help
> 
> Sven
> 
>> 
>> 
>>>> Have you properly configured your 100GbE switch and cards?
>>>> 
>>>> I have a Mellanox SN2100 here and two hosts with CX-5 Ethernet.
>>>> The configuration of the cards and switch is critical to good
>>>> performance.
>>> Yes, I can absolutely confirm that having this part of the config correct is critical for great performance :-) All configured with PFC and ECN and double-checked for packets to be tagged correctly and lossless in the RoCE case. The topology is simple: Client and server connected to same Mellanox switch, nothing else happening on the switch.
>>>> 
>>>>>> I noticed in "top" that only 4 nfsd processes are active, so I'm wondering why the load is not spread across more of my 64 /proc/fs/nfsd/threads, but even the few nfsd processes that are active use less than 50% of their core each. The CPUs are shown as >90% idle in "top" on client and server during the read phase.
>>>>>> 
>>>>>> I've tried:
>>>>>> * CentOS 7.5 and 7.6 kernels (3.10.0-...) on client and server; and Ubuntu 18 with 4.18 kernel on server side
>>>>>> * TCP & RDMA
>>>>>> * Mounted as NFSv3/v4.1/v4.2
>>>>>> * Increased tcp_slot_table_entries to 1024
>>>>>> 
>>>>>> ...but all that didn't change the fact that only 4 nfsd processes are active on the server and thus I'm getting the same result already if /proc/fs/nfsd/threads is set to only 4 instead of 64.
>>>>>> 
>>>>>> Any pointer to how I can overcome this limit will be greatly appreciated.
>>>>>> 
>>>>>> Thanks in advance
>>>>>> 
>>>>>> Sven
>>>>>> 
>>>>> --
>>>>> Chuck Lever
>>>> --
>>>> Chuck Lever
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-29  0:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-26 23:41 Remove single NFS client performance bottleneck: Only 4 nfsd active Sven Breuner
2020-01-27 14:06 ` Chuck Lever
2020-01-27 14:12   ` Chuck Lever
2020-01-27 17:27     ` Sven Breuner
2020-01-27 17:45       ` Chuck Lever
2020-01-28 23:22         ` Sven Breuner
2020-01-29  0:43           ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.