All of lore.kernel.org
 help / color / mirror / Atom feed
From: Scott Mayhew <smayhew@redhat.com>
To: linux-nfs@vger.kernel.org
Subject: [nfs-utils PATCH v3 0/2] Two rpc.gssd improvements
Date: Thu, 27 May 2021 15:11:00 -0400	[thread overview]
Message-ID: <20210527191102.590275-1-smayhew@redhat.com> (raw)

Changes since v2:

- Cancellation of timed-out upcall threads is no longer the default.

Changes since v1:

- Replaced the upcall_thread_info.cancelled field with a flags field,
  to facilitate having the watchdog thread print an error message only
  once for each timed-out upcall thread.
- Removed the "created thread id" log message.
- Added missing break when parsing the "-C" option.
- Added some comments.

These patches provide the following improvements for rpc.gssd:
1) deal with failed thread creation
2) add a timeout for upcall threads

Both of these issues can leave kernel mount processes hanging
indefinitely.  A timeout was originally proposed in the kernel
(https://lore.kernel.org/linux-nfs/20180618172542.45519-1-steved@redhat.com/)
but this approach was rejected by Trond:

    I'm saying that we can do this entirely in userland without any kernel
    changes. As long as that hasn't been attempted and proven to be flawed,
    then there is no reason to accept any kernel patches.

So this is my attempt at doing the timeout in userland.

The first patch was tested using a program that intercepts clone() and
changes the return code to -EAGAIN.

For the second patch, I have two different tests I've been running:

1) In an IPA domain in our lab, I have a server running 100 kerberized
nfsd containers.  The client has mountpoints to all 100 of those servers
defined in its /etc/fstab.  I run 'systemctl start remote-fs.target' to
kick off all those mounts in parallel, while running the following
systemtap script to periodically mess with the mount processes:

---8<---
global i

probe begin { i=0 }

probe process("/lib64/libgssapi_krb5.so.2").function("gss_acquire_cred")
{
        if (++i % 100 == 0) {
                printf("delay (i=%d)\n", i)
                mdelay(30000)
        }
}
---8<---

I actually run the test in a loop... the driver script looks like this:

---8<---
#!/bin/bash
let i=1
while :; do
        echo "Round $i"
        echo "Mounting"
        systemctl start remote-fs.target
        echo -n "Waiting on mount.nfs processes to complete "
        while pgrep mount.nfs >/dev/null; do
                echo -n "."
                sleep 1
        done
        echo -e "\nNumber of nfs4 mounts: $(grep -c nfs4 /proc/mounts)"
        echo -e "Unmounting"
        umount -a -t nfs4
        if ! pgrep gssd >/dev/null; then
                echo "gssd is not running - check for crash"
                break
        fi
        echo "Sleeping 5 seconds"
        sleep 5
        let i=$i+1
done
---8<---

2) In an AD environment in our lab, I added 1000 test users.  On a
client machine I have all those users run a script that writes to files
on a NetApp SVM and while that script is running I trigger a LIF
migration on the filer.  That forces all those users to establish new
creds with the SVM.

That test looks basically like this
# for i in `seq 1 1000`; do su - testuser$i -c "echo 'PASSWORD'|kinit"; done
# for i in `seq 1 1000`; do su - testuser$i -c "date >/mnt/t/tmp/testuser$i-testfile" & done
# for i in `seq 1 1000`; do su - testuser$i -c test.sh & done

where test.sh is a simple script that writes the date to a file in a
loop:

---8<---
#!/bin/bash
filename=/mnt/t/tmp/$(whoami)-testfile
for i in $(seq 1 300)
do
	date >$filename
	sleep 1
done
---8<---

While the test users are running the script I run one of the following
commands on the NetApp filer:

network interface migrate -vserver VSERVER -lif LIF -destination-node NODE
network interface revert -vserver VSERVER -lif LIF

-Scott


Scott Mayhew (2):
  gssd: deal with failed thread creation
  gssd: add timeout for upcall threads

 nfs.conf               |   2 +
 utils/gssd/gssd.c      | 256 +++++++++++++++++++++++-----------
 utils/gssd/gssd.h      |  29 +++-
 utils/gssd/gssd.man    |  31 ++++-
 utils/gssd/gssd_proc.c | 306 ++++++++++++++++++++++++++++++++++-------
 5 files changed, 491 insertions(+), 133 deletions(-)

-- 
2.30.2


             reply	other threads:[~2021-05-27 19:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 19:11 Scott Mayhew [this message]
2021-05-27 19:11 ` [nfs-utils PATCH v3 1/2] gssd: deal with failed thread creation Scott Mayhew
2021-06-02 19:54   ` Olga Kornievskaia
2021-06-02 20:22     ` Scott Mayhew
2021-06-10 16:12   ` Steve Dickson
2021-05-27 19:11 ` [nfs-utils PATCH v3 2/2] gssd: add timeout for upcall threads Scott Mayhew
2021-06-02 20:01   ` Olga Kornievskaia
2021-06-02 20:33     ` Olga Kornievskaia
2021-06-02 20:34     ` Scott Mayhew
2021-06-10 16:13   ` Steve Dickson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210527191102.590275-1-smayhew@redhat.com \
    --to=smayhew@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.