linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "dai.ngo@oracle.com" <dai.ngo@oracle.com>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server
Date: Tue, 30 Nov 2021 04:08:29 +0000	[thread overview]
Message-ID: <c8eef4ab9cb7acdf506d35b7910266daa9ded18c.camel@hammerspace.com> (raw)
In-Reply-To: <B42B0F9C-57E2-4F58-8DBD-277636B92607@oracle.com>

On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> 
> > On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > 
> > 
> > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > 
> > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > wrote:
> > > > 
> > > > 
> > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > Hello Dai!
> > > > > 
> > > > > 
> > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > Hi Bruce,
> > > > > > > > 
> > > > > > > > On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
> > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800,
> > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > Just a reminder that this patch is still
> > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > Yeah, I was procrastinating and hoping yo'ud
> > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself and
> > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to see if
> > > > > > > > > > > > the problem you've
> > > > > > > > > > > > seen still there.
> > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with
> > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > 
> > > > > > > > > > > Nfs4.1 results are the same for both courteous
> > > > > > > > > > > and
> > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned, 169
> > > > > > > > > > > > Passed
> > > > > > > > > > > Results of nfs4.0 with non-courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned, 577
> > > > > > > > > > > > Passed
> > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > 
> > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned, 575
> > > > > > > > > > > > Passed
> > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > 
> > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by
> > > > > > > > > > > itself.
> > > > > > > > > > Could well be a bug in the tests, I don't know.
> > > > > > > > > The reason OPEN18 failed was because the test timed
> > > > > > > > > out waiting for
> > > > > > > > > the reply of an OPEN call. The RPC connection used
> > > > > > > > > for the test was
> > > > > > > > > configured with 15 secs timeout. Note that OPEN18
> > > > > > > > > only fails when
> > > > > > > > > the tests were run with 'all' option, this test
> > > > > > > > > passes if it's run
> > > > > > > > > by itself.
> > > > > > > > > 
> > > > > > > > > With courteous server, by the time OPEN18 runs, there
> > > > > > > > > are about 1026
> > > > > > > > > courtesy 4.0 clients on the server and all of these
> > > > > > > > > clients have opened
> > > > > > > > > the same file X with WRITE access. These clients were
> > > > > > > > > created by the
> > > > > > > > > previous tests. After each test completed, since 4.0
> > > > > > > > > does not have
> > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > immediately on the
> > > > > > > > > server and are allowed to become courtesy clients.
> > > > > > > > > 
> > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st test
> > > > > > > > > started), it
> > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
> > > > > > > > > which causes the
> > > > > > > > > server to check for conflicts with courtesy clients.
> > > > > > > > > The loop that
> > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > conflict took less
> > > > > > > > > than 1 sec. But it took about 55 secs, on my VM, for
> > > > > > > > > the server
> > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > 
> > > > > > > > > I modified pynfs to configure the 4.0 RPC connection
> > > > > > > > > with 60 seconds
> > > > > > > > > timeout and OPEN18 now consistently passed. The 4.0
> > > > > > > > > test results are
> > > > > > > > > now the same for courteous and non-courteous server:
> > > > > > > > > 
> > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > 
> > > > > > > > > Note that 4.1 tests do not suffer this timeout
> > > > > > > > > problem because the
> > > > > > > > > 4.1 clients and sessions are destroyed after each
> > > > > > > > > test completes.
> > > > > > > > Do you want me to send the patch to increase the
> > > > > > > > timeout for pynfs?
> > > > > > > > or is there any other things you think we should do?
> > > > > > > I don't know.
> > > > > > > 
> > > > > > > 55 seconds to clean up 1026 clients is about 50ms per
> > > > > > > client, which is
> > > > > > > pretty slow.  I wonder why.  I guess it's probably
> > > > > > > updating the stable
> > > > > > > storage information.  Is /var/lib/nfs/ on your server
> > > > > > > backed by a hard
> > > > > > > drive or an SSD or something else?
> > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM and
> > > > > > 64GB of hard
> > > > > > disk. I think a production system that supports this many
> > > > > > clients should
> > > > > > have faster CPUs, faster storage.
> > > > > > 
> > > > > > > I wonder if that's an argument for limiting the number of
> > > > > > > courtesy
> > > > > > > clients.
> > > > > > I think we might want to treat 4.0 clients a bit different
> > > > > > from 4.1
> > > > > > clients. With 4.0, every client will become a courtesy
> > > > > > client after
> > > > > > the client is done with the export and unmounts it.
> > > > > It should be safe for a server to purge a client's lease
> > > > > immediately
> > > > > if there is no open or lock state associated with it.
> > > > In this case, each client has opened files so there are open
> > > > states
> > > > associated with them.
> > > > 
> > > > > When an NFSv4.0 client unmounts, all files should be closed
> > > > > at that
> > > > > point,
> > > > I'm not sure pynfs does proper clean up after each subtest, I
> > > > will
> > > > check. There must be state associated with the client in order
> > > > for
> > > > it to become courtesy client.
> > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > courteous
> > > server.
> > > 
> > > 
> > > > > so the server can wait for the lease to expire and purge it
> > > > > normally. Or am I missing something?
> > > > When 4.0 client lease expires and there are still states
> > > > associated
> > > > with the client then the server allows this client to become
> > > > courtesy
> > > > client.
> > > I think the same thing happens if an NFSv4.1 client neglects to
> > > send
> > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
> > > broken
> > > or malicious, but the server faces the same issue of protecting
> > > itself from a DoS attack.
> > > 
> > > IMO you should consider limiting the number of courteous clients
> > > the server can hold onto. Let's say that number is 1000. When the
> > > server wants to turn a 1001st client into a courteous client, it
> > > can simply expire and purge the oldest courteous client on its
> > > list. Otherwise, over time, the 24-hour expiry will reduce the
> > > set of courteous clients back to zero.
> > > 
> > > What do you think?
> > 
> > Limiting the number of courteous clients to handle the cases of
> > broken/malicious 4.1 clients seems reasonable as the last resort.
> > 
> > I think if a malicious 4.1 clients could mount the server's export,
> > opens a file (to create state) and repeats the same with a
> > different
> > client id then it seems like some basic security was already
> > broken;
> > allowing unauthorized clients to mount server's exports.
> 
> You can do this today with AUTH_SYS. I consider it a genuine attack
> surface.
> 
> 
> > I think if we have to enforce a limit, then it's only for handling
> > of seriously buggy 4.1 clients which should not be the norm. The
> > issue with this is how to pick an optimal number that is suitable
> > for the running server which can be a very slow or a very fast
> > server.
> > 
> > Note that even if we impose an limit, that does not completely
> > solve
> > the problem with pynfs 4.0 test since its RPC timeout is configured
> > with 15 secs which just enough to expire 277 clients based on 53ms
> > for each client, unless we limit it ~270 clients which I think it's
> > too low.
> > 
> > This is what I plan to do:
> > 
> > 1. do not support 4.0 courteous clients, for sure.
> 
> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
> protocol at this time, and the same exposure exists for 4.1, it’s
> just a little harder to exploit.
> 
> If you submit the courteous server patch without support for 4.0, I
> think it needs to include a plan for how 4.0 will be added later.
> 
> > 

Why is there a problem here? The requirements are the same for 4.0 and
4.1 (or 4.2). If the lease under which the courtesy lock was
established has expired, then that courtesy lock must be released if
some other client requests a lock that conflicts with the cached lock
(unless the client breaks the courtesy framework by renewing that
original lease before the conflict occurs). Otherwise, it is completely
up to the server when it decides to actually release the lock.

For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
server when the client is actually done with the lease, making it easy
to determine when it is safe to release all the courtesy locks. However
if the client does not send DESTROY_CLIENTID, then we're in the same
situation with 4.x (x>0) as we would be with bog standard NFSv4.0. The
lease has expired, and so the courtesy locks are liable to being
dropped.

At Hammerspace we have implemented courtesy locks, and our strategy is
that when a conflict occurs, we drop the entire set of courtesy locks
so that we don't have to deal with the "some locks were revoked"
scenario. The reason is that when we originally implemented courtesy
locks, the Linux NFSv4 client support for lock revocation was a lot
less sophisticated than today. My suggestion is that you might
therefore consider starting along this path, and then refining the
support to make revocation more nuanced once you are confident that the
coarser strategy is working as expected.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2021-11-30  4:08 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-29  0:56 [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server Dai Ngo
2021-09-29  0:56 ` [PATCH RFC v5 1/2] fs/lock: add new callback, lm_expire_lock, to lock_manager_operations Dai Ngo
2021-09-29  0:56 ` [PATCH RFC v5 2/2] nfsd: Initial implementation of NFSv4 Courteous Server Dai Ngo
2021-10-01 20:53 ` [PATCH RFC v5 0/2] " J. Bruce Fields
2021-10-01 21:41   ` dai.ngo
2021-10-01 23:03     ` J. Bruce Fields
2021-11-16 23:06     ` dai.ngo
2021-11-17 14:14       ` J. Bruce Fields
2021-11-17 17:59         ` dai.ngo
2021-11-17 21:46           ` dai.ngo
2021-11-18  0:34             ` J. Bruce Fields
2021-11-22  3:04               ` dai.ngo
2021-11-29 17:13                 ` dai.ngo
2021-11-29 17:30                   ` J. Bruce Fields
2021-11-29 18:32                     ` dai.ngo
2021-11-29 19:03                       ` Chuck Lever III
2021-11-29 19:13                         ` Bruce Fields
2021-11-29 19:39                           ` dai.ngo
2021-11-29 19:36                         ` dai.ngo
2021-11-29 21:01                           ` dai.ngo
2021-11-29 21:10                           ` Chuck Lever III
2021-11-30  0:11                             ` dai.ngo
2021-11-30  1:42                               ` Chuck Lever III
2021-11-30  4:08                                 ` Trond Myklebust [this message]
2021-11-30  4:47                                   ` Chuck Lever III
2021-11-30  4:57                                     ` Trond Myklebust
2021-11-30  7:22                                       ` dai.ngo
2021-11-30 13:37                                         ` Trond Myklebust
2021-12-01  3:52                                           ` dai.ngo
2021-12-01 14:19                                             ` bfields
2021-11-30 15:36                                         ` Chuck Lever III
2021-11-30 16:05                                           ` Bruce Fields
2021-11-30 16:14                                             ` Trond Myklebust
2021-11-30 19:01                                               ` bfields
2021-11-30  7:13                                 ` dai.ngo
2021-11-30 15:32                                   ` Bruce Fields
2021-12-01  3:50                                     ` dai.ngo
2021-12-01 14:36                                       ` Bruce Fields
2021-12-01 14:51                                         ` Bruce Fields
2021-12-01 18:47                                           ` dai.ngo
2021-12-01 19:25                                             ` Bruce Fields
2021-12-02 17:53                                           ` Chuck Lever III
2021-12-01 17:42                                         ` Bruce Fields
2021-12-01 18:03                                           ` Bruce Fields
2021-12-01 19:50                                             ` Bruce Fields
2021-12-03 21:22                                               ` Bruce Fields
2021-12-03 21:55                                                 ` [PATCH] nfsdcld: use WAL journal for faster commits Bruce Fields
2021-12-03 22:07                                                   ` Chuck Lever III
2021-12-03 22:39                                                     ` Bruce Fields
2021-12-04  0:35                                                       ` Chuck Lever III
2021-12-04  1:24                                                         ` Bruce Fields
2021-12-06 15:46                                                           ` Chuck Lever III
2022-01-04 22:24                                                             ` Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8eef4ab9cb7acdf506d35b7910266daa9ded18c.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).