Re: 'ls -lrt' performance issue on large dir while dir is being modified

From: Dai Ngo <dai.ngo@oracle.com>
To: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified
Date: Thu, 19 Dec 2019 20:01:16 -0800	[thread overview]
Message-ID: <a41af3d6-8280-e315-fb65-a9285bad50ec@oracle.com> (raw)
In-Reply-To: <e04baa28-2460-4ced-e387-618ea32d827c@oracle.com>

Hi Anna, Trond,

I made a mistake with the 5.5 numbers. The VM that runs 5.5 has some
problems. There is no regression with 5.5, here are the new numbers:

Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s  197891:  10m35.789s
Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s  192801:  3m55.003s

My apologies for the mistake.

Now there is no regression with 5.5, I'd like to get your opinion
regarding the change to revert the call from invalidate_mapping_pages
to nfs_zap_mapping in nfs_force_use_readdirplus to prevent the
current 'ls' from restarting the READDIRPLUS3 from cookie 0. I'm
not quite sure about the intention of the prior change from
nfs_zap_mapping to invalidate_mapping_pages so that is why I'm
seeking advise. Or do you have any suggestions to achieve the same?

Thanks,
-Dai

On 12/17/19 4:34 PM, Dai Ngo wrote:
> Hi,
>
> I'd like to report an issue with 'ls -lrt' on NFSv3 client takes
> a very long time to display the content of a large directory
> (100k - 200k files) while the directory is being modified by
> another NFSv3 client.
>
> The problem can be reproduced using 3 systems. One system serves
> as the NFS server, one system runs as the client that doing the
> 'ls -lrt' and another system runs the client that creates files
> on the server.
>     Client1 creates files using this simple script:
>
>> #!/bin/sh
>>
>> if [ $# -lt 2 ]; then
>>         echo "Usage: $0 number_of_files base_filename"
>>         exit
>> fi    nfiles=$1
>> fname=$2
>> echo "creating $nfiles files using filename[$fname]..."
>> i=0         while [ i -lt $nfiles ] ;
>> do            i=`expr $i + 1`
>>         echo "xyz" > $fname$i
>>         echo "$fname$i" done
>
> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a loop.
>
> The network traces and dtrace probes showed numerous READDIRPLUS3
> requests restarting  from cookie 0 which seemed to indicate the
> cached pages of the directory were invalidated causing the pages
> to be refilled starting from cookie 0 until the current requested
> cookie.  The cached page invalidation were tracked to
> nfs_force_use_readdirplus().  To verify, I made the below
> modification, ran the test for various kernel versions and
> captured the results shown below.
>
> The modification is:
>
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index a73e2f8bd8ec..5d4a64555fa7 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -444,7 +444,7 @@ void nfs_force_use_readdirplus(struct inode *dir)
>>      if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>          !list_empty(&nfsi->open_files)) {
>>          set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
>> -        invalidate_mapping_pages(dir->i_mapping, 0, -1);
>> +        nfs_zap_mapping(dir, dir->i_mapping);
>>      }
>>  }
>
> Note that after this change, I did not see READDIRPLUS3 restarting
> with cookie 0 anymore.
>
> Below are the summary results of 'ls -lrt'.  For each kernel version
> to be compared, one row for the original kernel and one row for the
> kernel with the above modification.
>
> I cloned dtrace-linux from here:
> github.com/oracle/dtrace-linux-kernel
>
> dtrace-linux 5.1.0-rc4 [ORI] 89191: 2m59.32s   193071: 6m7.810s
> dtrace-linux 5.1.0-rc4 [MOD] 98771: 1m55.900s  191322: 3m48.668s
>
> I cloned upstream Linux from here:
> git.kernel.org/pub/scm/linux/kernel/git/tovards/linux.git
>
> Upstream Linux 5.5.0-rc1 [ORI] 87891: 5m11.089s  160974: 14m4.384s
> Upstream Linux 5.5.0-rc1 [MOD] 87075: 5m2.057s   161421: 14m33.615s
>
> Please note that these are relative performance numbers and are used
> to illustrate the issue only.
>
> For reference, on the original dtrace-linux it takes about 9s for
> 'ls -ltr' to complete on a directory with 200k files if the directory
> is not modified while 'ls' is running.
>
> The number of the original Upstream Linux is *really* bad, and the
> modification did not seem to have any effect, not sure why...
> it could be something else is going on here.
>
> The cache invalidation in nfs_force_use_readdirplus seems too
> drastic and might need to be reviewed. Even though this change
> helps but it did not get the 'ls' performance to where it's
> expected to be. I think even though READDIRPLUS3 was used, the
> attribute cache was invalidated due to the directory modification,
> causing attribute cache misses resulting in the calls to
> nfs_force_use_readdirplus as shown in this stack trace:
>
>   0  17586     page_cache_tree_delete:entry
>               vmlinux`remove_mapping+0x14
>               vmlinux`invalidate_inode_page+0x7c
>               vmlinux`invalidate_mapping_pages+0x1dd
>               nfs`nfs_force_use_readdirplus+0x47
>               nfs`__dta_nfs_lookup_revalidate_478+0x5dd
>               vmlinux`d_revalidate.part.24+0x10
>               vmlinux`lookup_fast+0x254
>               vmlinux`walk_component+0x49
>               vmlinux`path_lookupat+0x79
>               vmlinux`filename_lookup+0xaf
>               vmlinux`user_path_at_empty+0x36
>               vmlinux`vfs_statx+0x77
>               vmlinux`SYSC_newlstat+0x3d
>               vmlinux`SyS_newlstat+0xe
>               vmlinux`do_syscall_64+0x79
>               vmlinux`entry_SYSCALL_64+0x18d
>
> Besides the overhead of refilling the page caches from cookie 0,
> I think the reason 'ls' still takes so long to compete because the
> client has to send a bunch of additional LOOKUP/ACCESS requests
> over the wire to service the stat(2) calls from 'ls' due to the
> attribute cache misses.
>
> Please let me know you what you think and if there is any addition
> information is needed.
>
> Thanks,
> -Dai
>
>