Re: [PATCH] cifs: retry lookup and readdir when EAGAIN is returned.

From: Thiago Rafael Becker <trbecker@gmail.com>
To: "Aurélien Aptel" <aaptel@suse.com>
Cc: ronnie sahlberg <ronniesahlberg@gmail.com>,
	linux-cifs <linux-cifs@vger.kernel.org>,
	Steve French <sfrench@samba.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] cifs: retry lookup and readdir when EAGAIN is returned.
Date: Wed, 16 Jun 2021 12:40:26 -0300	[thread overview]
Message-ID: <YMobapLxHv/BPzFP@nyarly.rlyeh.local> (raw)
In-Reply-To: <87zgvqqg09.fsf@suse.com>

Thanks for the comments, I'm working on them.

On Wed, Jun 16, 2021 at 12:07:50PM +0200, Aurélien Aptel wrote:

> I guess it looks ok as a quick fix for the issue at hand but:
> - should we check for more error codes (use is_retryable_error()?)
> - all syscalls will need something similar, the session can be closed at
>   any point
>   * either we add a loop everywhere
>   * or we change cifs_send_receive to retry (most?) calls
>   * might be worth looking at what nfs does here

Some syscall can return EAGAIN, so we would need to check which
operation is retryable and if the error is retryable. I don't know if
all Linux syscalls mappable to smb2 operations in the wire, but it may
be worth mapping.

NFS requeues the calls, and fails after a configurable timeout
for soft mounts, or issues an error message and requeues the request
for hard mounts (retrans and timeo mount options).

> - Should this be done for both soft (default) and hard mounts? I guess
>   for hard we would retry indefinitely

Good point, but the correct option is to retry on hard mounts until the
operation succeeds.

NFS hard mounts create issues like being unable to do a clean shutdown on
a server failure, because oustanding I/O blocks it. The NFS has atempted
to fix this by adding options to kill all outstanding I/O, but I'm not
current on the efforts/issues in this front.

So this would create the same issue with outstanding mounts on cifs 
shares, so a similar solution to NFS may be adapted/created in the future.

One thing that I forgot to mention in the patch is that this uncovered a
bug in compuond requests, where wait_for_compund_request will block the
readdir request with EDEADLK, if the share needs reconnect. I'm looking
into this patch, and will submmit it separatelly, as this looks like a
corner case uncovered by the patch and specific conditions in the tests.

> Cheers,
Best regards,
Thiago