Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Mike Javorski <mike.javorski@gmail.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Neil Brown <neilb@suse.de>
Subject: Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9)
Date: Mon, 16 Aug 2021 13:21:34 +0000	[thread overview]
Message-ID: <87777C39-BDDA-4E1E-83FA-5B46918A66D3@oracle.com> (raw)
In-Reply-To: <162907681945.1695.10796003189432247877@noble.neil.brown.name>

> On Aug 15, 2021, at 9:20 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Sun, 15 Aug 2021, Mike Javorski wrote:
>> I managed to get a cap with several discreet freezes in it, and I
>> included a chunk with 5 of them in the span of ~3000 frames. I added
>> packet comments at each frame that the tshark command reported as > 1
>> sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
>> details. This is with kernel 5.13.10 provided by Arch (See
>> https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
>> for diff vs mainline, nothing NFS/RPC related I can identify).
>> 
>> I tried unsuccessfully to get any failures with the 5.12.15 kernel.
>> 
>> https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
>> 
>> File should be downloadable anonymously.
> 
> Got it, thanks.
> 
> There are 3 RPC replies that are more than 1 second after the request.
> The replies are in frames 983, 1005, 1777 These roughly correspond to
> where you added the "Freeze" annotation (I didn't know you could do that!).
> 
> 983:
>  This is the (Start of the) reply to a READ Request in frame 980.
>  It is a 128K read.  The whole reply has arrived by frame 1004, 2ms
>  later.
>  The request (Frame 980) is followed 13ms later by a TCP retransmit
>  of the request and the (20usec later) a TCP ACK from the server.
> 
>  The fact that the client needed to retransmit seems a little odd
>  but as it is the only retransmit in the whole capture, I don't think
>  we can read much into it.
> 
> 1005:
>  This is the reply to a 128K READ request in frame 793 - earlier than
>  previous one.
>  So there were two READ requests, then a 2 second delay then both
>  replies in reverse order.
> 
> 1777:
>  This is a similar READ reply - to 1761.
>  There were READs in 1760, 1761, and 1775
>  1760 is replied to almost immediately
>  1761 gets a reply in 4 seconds (1777)
>  1775 never gets a reply (in the available packet capture).
> 
> Looking at other delays ... most READs get a reply in under a millisec.
> There are about 20 where the reply is more than 1ms - the longest delay
> not already mentioned is 90ms with reply 1857.
> The pattern here is 
>   READ req (1)
>   GETATTR req
>   GETATTR reply
>   READ req (2)
>   READ reply (1)
>  pause
>   READ reply (2)
> 
> I suspect this is the same problem occurring, but it isn't so
> noticeable.
> 
> My first thought was that the reply might be getting stuck in the TCP
> transmit queue on the server, but checking the TSval in the TCP
> timestamp option shows that - for frame 983 which shows a 2second delay
> - the TSval is also 2seconds later than the previous packet.  So the
> delay happens before the TCP-level decision to create the packet.
> 
> So I cannot see any real evidence to suggest a TCP-level problem.
> The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> caused by an NFSd problem.
> 
> So my guess is that the delay comes from the filesystem.  Maybe.
> What filesystem are you exporting?
> 
> How can we check this? Probably by turning on nfsd tracing.
> There are a bunch of tracepoints that related to reading:
> 
> 	trace_nfsd_read_start
> 	trace_nfsd_read_done
> 	trace_nfsd_read_io_done
> 	trace_nfsd_read_err
> 	trace_nfsd_read_splice
> 	trace_nfsd_read_vector
> 	trace_nfsd_read_start
> 	trace_nfsd_read_done
> 
> Maybe enabling them might be useful as you should be able to see if the
> delay was within one read request, or between two read requests.
> But I don't have much (if any) experience in enabling trace points.  I
> really should try that some day.  Maybe you can find guidance on using
> these tracepoint somewhere ... or maybe you already know how :-)

Try something like:

# trace-cmd record -e nfsd:nfsd_read_\*

When the test is done, ^C the trace-cmd program. The trace records
are stored in a file called trace.dat. You can view them with:

# trace-cmd report | less

The trace.dat file is portable. It carries the format specifiers
for all records events, so the trace records can be viewed on 
other systems.

--
Chuck Lever