linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dai Ngo <dai.ngo@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"Anna.Schumaker@netapp.com" <Anna.Schumaker@netapp.com>
Subject: Re: 'ls -lrt' performance issue on large dir while dir is being modified
Date: Sat, 18 Jan 2020 10:03:28 -0800	[thread overview]
Message-ID: <52079beb-1f27-35d8-c92a-6a6b430c7c8f@oracle.com> (raw)
In-Reply-To: <3456dea05ac1a2d82c077146e8638130e313edca.camel@hammerspace.com>


On 1/18/20 9:31 AM, Trond Myklebust wrote:
> On Sat, 2020-01-18 at 12:26 -0500, Chuck Lever wrote:
>>> On Jan 18, 2020, at 10:58 AM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>>
>>> On Fri, 2020-01-17 at 18:29 -0800, Dai Ngo wrote:
>>>> Hi Trond,
>>>>
>>>> On 1/15/20 11:06 AM, Trond Myklebust wrote:
>>>>> On Wed, 2020-01-15 at 18:54 +0000, Trond Myklebust wrote:
>>>>>> On Wed, 2020-01-15 at 10:11 -0800, Dai Ngo wrote:
>>>>>>> Hi Anna, Trond,
>>>>>>>
>>>>>>> Would you please let me know your opinion regarding
>>>>>>> reverting
>>>>>>> the
>>>>>>> change in
>>>>>>> nfs_force_use_readdirplus to call nfs_zap_mapping instead
>>>>>>> of
>>>>>>> invalidate_mapping_pages.
>>>>>>> This change is to prevent the cookie of the READDIRPLUS to
>>>>>>> be
>>>>>>> reset
>>>>>>> to 0 while
>>>>>>> an instance of 'ls' is running and the directory is being
>>>>>>> modified.
>>>>>>>
>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index
>>>>>>>> a73e2f8bd8ec..5d4a64555fa7 100644 --- a/fs/nfs/dir.c +++
>>>>>>>> b/fs/nfs/dir.c @@ -444,7 +444,7 @@ void
>>>>>>>> nfs_force_use_readdirplus(struct inode *dir)      if
>>>>>>>> (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
>>>>>>>> !list_empty(&nfsi->open_files)) {
>>>>>>>> set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags); -
>>>>>>>> invalidate_mapping_pages(dir->i_mapping, 0, -1); +
>>>>>>>> nfs_zap_mapping(dir, dir->i_mapping);      }  }
>>>>>>> Thanks,
>>>>>>> -Dai
>>>>>>>
>>>>>>> On 12/19/19 8:01 PM, Dai Ngo wrote:
>>>>>>>> Hi Anna, Trond,
>>>>>>>>
>>>>>>>> I made a mistake with the 5.5 numbers. The VM that runs
>>>>>>>> 5.5
>>>>>>>> has
>>>>>>>> some
>>>>>>>> problems. There is no regression with 5.5, here are the
>>>>>>>> new
>>>>>>>> numbers:
>>>>>>>>
>>>>>>>> Upstream Linux 5.5.0-rc1 [ORI] 93296: 3m10.917s  197891:
>>>>>>>> 10m35.789s
>>>>>>>> Upstream Linux 5.5.0-rc1 [MOD] 98614: 1m59.649s  192801:
>>>>>>>> 3m55.003s
>>>>>>>>
>>>>>>>> My apologies for the mistake.
>>>>>>>>
>>>>>>>> Now there is no regression with 5.5, I'd like to get your
>>>>>>>> opinion
>>>>>>>> regarding the change to revert the call from
>>>>>>>> invalidate_mapping_pages
>>>>>>>> to nfs_zap_mapping in nfs_force_use_readdirplus to
>>>>>>>> prevent
>>>>>>>> the
>>>>>>>> current 'ls' from restarting the READDIRPLUS3 from cookie
>>>>>>>> 0.
>>>>>>>> I'm
>>>>>>>> not quite sure about the intention of the prior change
>>>>>>>> from
>>>>>>>> nfs_zap_mapping to invalidate_mapping_pages so that is
>>>>>>>> why
>>>>>>>> I'm
>>>>>>>> seeking advise. Or do you have any suggestions to achieve
>>>>>>>> the
>>>>>>>> same?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Dai
>>>>>>>>
>>>>>>>> On 12/17/19 4:34 PM, Dai Ngo wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'd like to report an issue with 'ls -lrt' on NFSv3
>>>>>>>>> client
>>>>>>>>> takes
>>>>>>>>> a very long time to display the content of a large
>>>>>>>>> directory
>>>>>>>>> (100k - 200k files) while the directory is being
>>>>>>>>> modified
>>>>>>>>> by
>>>>>>>>> another NFSv3 client.
>>>>>>>>>
>>>>>>>>> The problem can be reproduced using 3 systems. One
>>>>>>>>> system
>>>>>>>>> serves
>>>>>>>>> as the NFS server, one system runs as the client that
>>>>>>>>> doing
>>>>>>>>> the
>>>>>>>>> 'ls -lrt' and another system runs the client that
>>>>>>>>> creates
>>>>>>>>> files
>>>>>>>>> on the server.
>>>>>>>>>      Client1 creates files using this simple script:
>>>>>>>>>
>>>>>>>>>> #!/bin/sh
>>>>>>>>>>
>>>>>>>>>> if [ $# -lt 2 ]; then
>>>>>>>>>>          echo "Usage: $0 number_of_files
>>>>>>>>>> base_filename"
>>>>>>>>>>          exit
>>>>>>>>>> fi    nfiles=$1
>>>>>>>>>> fname=$2
>>>>>>>>>> echo "creating $nfiles files using
>>>>>>>>>> filename[$fname]..."
>>>>>>>>>> i=0         while [ i -lt $nfiles ] ;
>>>>>>>>>> do            i=`expr $i + 1`
>>>>>>>>>>          echo "xyz" > $fname$i
>>>>>>>>>>          echo "$fname$i" done
>>>>>>>>> Client2 runs 'time ls -lrt /tmp/mnt/bd1 |wc -l' in a
>>>>>>>>> loop.
>>>>>>>>>
>>>>>>>>> The network traces and dtrace probes showed numerous
>>>>>>>>> READDIRPLUS3
>>>>>>>>> requests restarting  from cookie 0 which seemed to
>>>>>>>>> indicate
>>>>>>>>> the
>>>>>>>>> cached pages of the directory were invalidated causing
>>>>>>>>> the
>>>>>>>>> pages
>>>>>>>>> to be refilled starting from cookie 0 until the current
>>>>>>>>> requested
>>>>>>>>> cookie.  The cached page invalidation were tracked to
>>>>>>>>> nfs_force_use_readdirplus().  To verify, I made the
>>>>>>>>> below
>>>>>>>>> modification, ran the test for various kernel versions
>>>>>>>>> and
>>>>>>>>> captured the results shown below.
>>>>>>>>>
>>>>>>>>> The modification is:
>>>>>>>>>
>>>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>>>>>> index a73e2f8bd8ec..5d4a64555fa7 100644
>>>>>>>>>> --- a/fs/nfs/dir.c
>>>>>>>>>> +++ b/fs/nfs/dir.c
>>>>>>>>>> @@ -444,7 +444,7 @@ void
>>>>>>>>>> nfs_force_use_readdirplus(struct
>>>>>>>>>> inode
>>>>>>>>>> *dir)
>>>>>>>>>>       if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)
>>>>>>>>>> &&
>>>>>>>>>>           !list_empty(&nfsi->open_files)) {
>>>>>>>>>>           set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi-
>>>>>>>>>>> flags);
>>>>>>>>>> -        invalidate_mapping_pages(dir->i_mapping, 0,
>>>>>>>>>> -1);
>>>>>>>>>> +        nfs_zap_mapping(dir, dir->i_mapping);
>>>>>>>>>>       }
>>>>>>>>>>   }
>>>>>> This change is only reverting part of commit 79f687a3de9e. My
>>>>>> problem
>>>>>> with that is as follows:
>>>>>>
>>>>>> RFC1813 states that NFSv3 READDIRPLUS cookies and verifiers
>>>>>> must
>>>>>> match
>>>>>> those returned by previous READDIRPLUS calls, and READDIR
>>>>>> cookies
>>>>>> and
>>>>>> verifiers must match those returned by previous READDIR
>>>>>> calls. It
>>>>>> says
>>>>>> nothing about being able to assume cookies from READDIR and
>>>>>> READDIRPLUS
>>>>>> calls are interchangeable. So the only reason I can see for
>>>>>> the
>>>>>> invalidate_mapping_pages() is to ensure that we do separate
>>>>>> the
>>>>>> two
>>>>>> cookie caches.
>>>> If I understand your concern correctly that in NFSv3 the client
>>>> must
>>>> maintain valid cookies and cookie verifiers when switching
>>>> between
>>>> READDIR and READDIRPLUS, or vice sersa, then I think the current
>>>> client
>>>> code handles this condition ok.
>>>>
>>>> On the client, both READDIR and READDIRPLUS requests use the
>>>> cookie
>>>> values
>>>> from the same cached pages of the directory so I don't think they
>>>> can
>>>> be
>>>> out of sync when the client switches between READDIRPLUS and
>>>> READDIR
>>>> requests for different nfs_readdir calls.
>>>>
>>>> In fact, currently the first nfs_readdir uses READDIRPLUS's to
>>>> fill
>>>> read
>>>> the entries and if there is no LOOKUP/GETATTR on one of the
>>>> directory
>>>> entries then the client reverts to READDIR's for subsequent
>>>> nfs_readdir
>>>> calls without invalidating any cached pages of the directory. If
>>>> there
>>>> are LOOKUP/GETATTR done on one of the directory entries then
>>>> nfs_advise_use_readdirplus is called which forces the client to
>>>> use
>>>> READDIRPLUS again for the next nfs_readdir.
>>>>
>>>> Unless the user mounts the export with 'nordirplus' option then
>>>> the
>>>> client uses only READDIRs for all requests, no switching takes
>>>> place.
>>> I don't understand your point.
>> The original point was that the directory's page cache seems to
>> be cleared a little too often (quite apart from switching between
>> READDIRPLUS and READDIR).
>>
>> I think Dai is saying that cache clearing is appropriate to defer
>> when the directory's mtime has changed but the READ method remains
>> the same. Otherwise repeatedly adding a new file to a very large
>> directory that is being read can trigger a situation where the
>> reading getdents loop never completes.
>>
> Fair enough, but the section of code he was touching with his patch
> should be only about switching between READDIR/PLUS methods.

The code that I'd to patch, nfs_force_use_readdirplus, is called when
there is an attribute cache miss which causes the client to make sure
READDIRPLUS is being used on the directory, regardless which READ method
is currently being used using for the directory. So it's possible that
the call to nfs_force_use_readdirplus will cause the next getdent/nfs_readdir
to switch to from READDIR to READDIRPLUS.

>
> The mtime monitoring happens elsewhere and is already set up to
> invalidate the cache only on opendir()/rewinddir().
>
>> My two cents Euro.
>>
>>
>>> The issue is that
>>> nfs_advise_use_readdirplus() can cause the behaviour to switch
>>> between
>>> use of READDIRPLUS and use of READDIR from one syscall to
>>> getdents() to
>>> the next.
>>> If the client is using the same page cache, across those syscalls,
>>> then
>>> it will end up caching a mixture of cookies. Furthermore, since the
>>> cookie that is used as an argument to the next call to
>>> READDIR/READDIRPLUS is taken from that page cache, then we can end
>>> up
>>> calling READDIRPLUS with a cookie that came from READDIR and vice
>>> versa.
>>>
>>> As I said, I'm not convinced that is legal in RFC1813 (NFSv3).

I think this is the contention point: the spec did not explicitly
mention anything regarding mixing of cookies from READDIR & READDIRPLUS.

However, as I mentioned, the current client implementation already mixing
cookies between READDIRPLUS and READDIR, everytime the user does a simple
'ls' on a large directory, without invalidating any mapping.

Also, as Chuck mentioned, we're not aware of any server implementation
that has problems with this mixing of cookies.

-Dai

>>>
>>> That is why we want to clear the page cache when we swap between
>>> use of
>>> READDIR and use of READDIRPLUS for the case of NFSv3.
>> Just curious, are you aware of an NFSv3 server implementation that
>> would have a problem with a client that mixes the cookies?
>>
>>
>>>> Thanks,
>>>> -Dai
>>>>
>>>>>> OTOH, for NFSv4, there is no separate READDIRPLUS function,
>>>>>> so
>>>>>> there
>>>>>> really does not appear to be any reason to clear the page
>>>>>> cache
>>>>>> at
>>>>>> all
>>>>>> as we're switching between requesting attributes or not.
>>>>>>
>>>>> Sorry... To spell out my objection to this change more clearly:
>>>>> The
>>>>> call to nfs_zap_mapping() makes no sense in either case.
>>>>>   * It defers the cache invalidation until the next call to
>>>>>     rewinddir()/opendir(), so it does not address the NFSv3
>>>>> concern.
>>>>>   * It would appear to be entirely superfluous for the NFSv4
>>>>> case.
>>>>>
>>>>> So a change that might be acceptable would be to keep the
>>>>> existing
>>>>> call
>>>>> to invalidate_mapping_pages() for NFSv3, but to remove it for
>>>>> NFSv4.
>>>>>
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@hammerspace.com
>> --
>> Chuck Lever
>>
>>
>>

  reply	other threads:[~2020-01-18 18:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-18  0:34 'ls -lrt' performance issue on large dir while dir is being modified Dai Ngo
2019-12-20  4:01 ` Dai Ngo
2020-01-15 18:11   ` Dai Ngo
2020-01-15 18:54     ` Trond Myklebust
2020-01-15 19:06       ` Trond Myklebust
2020-01-15 19:28         ` Dai Ngo
2020-01-18  2:29         ` Dai Ngo
2020-01-18 15:58           ` Trond Myklebust
2020-01-18 17:26             ` Chuck Lever
2020-01-18 17:31               ` Trond Myklebust
2020-01-18 18:03                 ` Dai Ngo [this message]
2020-01-20 20:52                   ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52079beb-1f27-35d8-c92a-6a6b430c7c8f@oracle.com \
    --to=dai.ngo@oracle.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).