All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential leak in file put
@ 2023-07-19 12:10 Shyam Prasad N
  2023-07-20 14:30 ` Tom Talpey
  0 siblings, 1 reply; 5+ messages in thread
From: Shyam Prasad N @ 2023-07-19 12:10 UTC (permalink / raw)
  To: CIFS, Steve French, ronnie sahlberg, Bharath SM, Paulo Alcantara,
	Tom Talpey, Enzo Matsumiya

Hi all,

cifs.ko seems to be leaking handles occasionally when put under some
stress testing.
I was scanning the code for potential places where we could be
leaking, and one particular instance caught my attention.

In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
remove it from the lists and then send the close request over the
wire...
        if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
                struct TCP_Server_Info *server = tcon->ses->server;
                unsigned int xid;

                xid = get_xid();
                if (server->ops->close_getattr)
                        server->ops->close_getattr(xid, tcon, cifs_file);
                else if (server->ops->close)
                        server->ops->close(xid, tcon, &cifs_file->fid);
                _free_xid(xid);
        }

But as you can see above, we do not have return value from the above handlers.
What would happen if the above close_getattr or close calls fail?
Particularly, what would happen if the failure happens even before the
request is put in flight?
In this stress testing we have the server giving out lesser credits.
So with the testing, the credit counters on the active connections are
expected to be low in general.
I'm assuming that the above condition will leak handles.

I was thinking about making a change to let the above handlers return
rc rather than void. And then to check the status.
If failure, create a delayed work for closing the remote handle. But
I'm not convinced that this is the right approach.
I'd like to know more comments on this.

-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential leak in file put
  2023-07-19 12:10 Potential leak in file put Shyam Prasad N
@ 2023-07-20 14:30 ` Tom Talpey
  2023-07-21 11:13   ` Shyam Prasad N
  0 siblings, 1 reply; 5+ messages in thread
From: Tom Talpey @ 2023-07-20 14:30 UTC (permalink / raw)
  To: Shyam Prasad N, CIFS, Steve French, ronnie sahlberg, Bharath SM,
	Paulo Alcantara, Enzo Matsumiya

On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
> Hi all,
> 
> cifs.ko seems to be leaking handles occasionally when put under some
> stress testing.
> I was scanning the code for potential places where we could be
> leaking, and one particular instance caught my attention.
> 
> In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
> remove it from the lists and then send the close request over the
> wire...
>          if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
>                  struct TCP_Server_Info *server = tcon->ses->server;
>                  unsigned int xid;
> 
>                  xid = get_xid();
>                  if (server->ops->close_getattr)
>                          server->ops->close_getattr(xid, tcon, cifs_file);
>                  else if (server->ops->close)
>                          server->ops->close(xid, tcon, &cifs_file->fid);
>                  _free_xid(xid);
>          }
> 
> But as you can see above, we do not have return value from the above handlers.

Yeah, that's a problem. The most obvious is if the network becomes
partitioned and the close(s) fail. Are you injecting that kind of
error?

Still, the logic is going to grow some serious hair if we start
retrying every error. What will bound the retries, and what kind
of error(s) might be considered fatal? Tying up credits, message
id's, handles, etc can be super problematic.

Also, have you considered using some sort of laundromat? Or is it
critical that these closes happen before proceeding?

Tom.

> What would happen if the above close_getattr or close calls fail?
> Particularly, what would happen if the failure happens even before the
> request is put in flight?
> In this stress testing we have the server giving out lesser credits.
> So with the testing, the credit counters on the active connections are
> expected to be low in general.
> I'm assuming that the above condition will leak handles.
> 
> I was thinking about making a change to let the above handlers return
> rc rather than void. And then to check the status.
> If failure, create a delayed work for closing the remote handle. But
> I'm not convinced that this is the right approach.
> I'd like to know more comments on this.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential leak in file put
  2023-07-20 14:30 ` Tom Talpey
@ 2023-07-21 11:13   ` Shyam Prasad N
  2023-07-21 13:39     ` Tom Talpey
  0 siblings, 1 reply; 5+ messages in thread
From: Shyam Prasad N @ 2023-07-21 11:13 UTC (permalink / raw)
  To: Tom Talpey
  Cc: CIFS, Steve French, ronnie sahlberg, Bharath SM, Paulo Alcantara,
	Enzo Matsumiya

Hi Tom,

Thanks for these points.

On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
> > Hi all,
> >
> > cifs.ko seems to be leaking handles occasionally when put under some
> > stress testing.
> > I was scanning the code for potential places where we could be
> > leaking, and one particular instance caught my attention.
> >
> > In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
> > remove it from the lists and then send the close request over the
> > wire...
> >          if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
> >                  struct TCP_Server_Info *server = tcon->ses->server;
> >                  unsigned int xid;
> >
> >                  xid = get_xid();
> >                  if (server->ops->close_getattr)
> >                          server->ops->close_getattr(xid, tcon, cifs_file);
> >                  else if (server->ops->close)
> >                          server->ops->close(xid, tcon, &cifs_file->fid);
> >                  _free_xid(xid);
> >          }
> >
> > But as you can see above, we do not have return value from the above handlers.
>
> Yeah, that's a problem. The most obvious is if the network becomes
> partitioned and the close(s) fail. Are you injecting that kind of
> error?

So this was revealed with a stress testing setup where the mount was
done against a server that gave out only 512 credits per connection.
The connection was pretty much starved for credits, which threw up
out-of-credits errors occasionally.
I've confirmed that when it happens for close, there are handle leaks.

>
> Still, the logic is going to grow some serious hair if we start
> retrying every error. What will bound the retries, and what kind
> of error(s) might be considered fatal? Tying up credits, message
> id's, handles, etc can be super problematic.
>
> Also, have you considered using some sort of laundromat? Or is it
> critical that these closes happen before proceeding?
>

Steve and I discussed this yesterday. One option that came out was...
We could cleanup the handle locally and then keep retrying the server
close as a delayed work a fixed number of times.
If a specified limit is exceeded, reconnect the session so that we start afresh.
I guess this is what you meant by laundromat?

> Tom.
>
> > What would happen if the above close_getattr or close calls fail?
> > Particularly, what would happen if the failure happens even before the
> > request is put in flight?
> > In this stress testing we have the server giving out lesser credits.
> > So with the testing, the credit counters on the active connections are
> > expected to be low in general.
> > I'm assuming that the above condition will leak handles.
> >
> > I was thinking about making a change to let the above handlers return
> > rc rather than void. And then to check the status.
> > If failure, create a delayed work for closing the remote handle. But
> > I'm not convinced that this is the right approach.
> > I'd like to know more comments on this.
> >



-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential leak in file put
  2023-07-21 11:13   ` Shyam Prasad N
@ 2023-07-21 13:39     ` Tom Talpey
  2023-07-26 17:49       ` Shyam Prasad N
  0 siblings, 1 reply; 5+ messages in thread
From: Tom Talpey @ 2023-07-21 13:39 UTC (permalink / raw)
  To: Shyam Prasad N
  Cc: CIFS, Steve French, ronnie sahlberg, Bharath SM, Paulo Alcantara,
	Enzo Matsumiya

On 7/21/2023 7:13 AM, Shyam Prasad N wrote:
> Hi Tom,
> 
> Thanks for these points.
> 
> On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@talpey.com> wrote:
>>
>> On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
>>> Hi all,
>>>
>>> cifs.ko seems to be leaking handles occasionally when put under some
>>> stress testing.
>>> I was scanning the code for potential places where we could be
>>> leaking, and one particular instance caught my attention.
>>>
>>> In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
>>> remove it from the lists and then send the close request over the
>>> wire...
>>>           if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
>>>                   struct TCP_Server_Info *server = tcon->ses->server;
>>>                   unsigned int xid;
>>>
>>>                   xid = get_xid();
>>>                   if (server->ops->close_getattr)
>>>                           server->ops->close_getattr(xid, tcon, cifs_file);
>>>                   else if (server->ops->close)
>>>                           server->ops->close(xid, tcon, &cifs_file->fid);
>>>                   _free_xid(xid);
>>>           }
>>>
>>> But as you can see above, we do not have return value from the above handlers.
>>
>> Yeah, that's a problem. The most obvious is if the network becomes
>> partitioned and the close(s) fail. Are you injecting that kind of
>> error?
> 
> So this was revealed with a stress testing setup where the mount was
> done against a server that gave out only 512 credits per connection.
> The connection was pretty much starved for credits, which threw up
> out-of-credits errors occasionally.
> I've confirmed that when it happens for close, there are handle leaks.

Interesting. 512 credits doesn't seem like starvation, but it will
definitely stress a high workload. Good!

>> Still, the logic is going to grow some serious hair if we start
>> retrying every error. What will bound the retries, and what kind
>> of error(s) might be considered fatal? Tying up credits, message
>> id's, handles, etc can be super problematic.
>>
>> Also, have you considered using some sort of laundromat? Or is it
>> critical that these closes happen before proceeding?
>>
> 
> Steve and I discussed this yesterday. One option that came out was...
> We could cleanup the handle locally and then keep retrying the server
> close as a delayed work a fixed number of times.
> If a specified limit is exceeded, reconnect the session so that we start afresh.

Sounds reasonable, but things might get a little tricky if the
server-side handle has a lease or some other state still attached.
A subsequent client create on the same file might encounter an
unexpected conflict? It's critical to think that through.

> I guess this is what you meant by laundromat?

So by laundromat I meant background processing to handle the
close. It would have some sort of queued work list, and it
would process the work items and wash-dry-fold the results.

The main motivation would be to release the thread that fell
into the refcnt == 0 so the calling application may continue.
Stealing the thread for a full server roundtrip might be
worth avoiding, in all cases.

OTOH, if the tricky stuff above is risky or wrong, then forget
the laundromat and don't return until the close is done. But
then, think about ^C!

Tom.

>>> What would happen if the above close_getattr or close calls fail?
>>> Particularly, what would happen if the failure happens even before the
>>> request is put in flight?
>>> In this stress testing we have the server giving out lesser credits.
>>> So with the testing, the credit counters on the active connections are
>>> expected to be low in general.
>>> I'm assuming that the above condition will leak handles.
>>>
>>> I was thinking about making a change to let the above handlers return
>>> rc rather than void. And then to check the status.
>>> If failure, create a delayed work for closing the remote handle. But
>>> I'm not convinced that this is the right approach.
>>> I'd like to know more comments on this.
>>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Potential leak in file put
  2023-07-21 13:39     ` Tom Talpey
@ 2023-07-26 17:49       ` Shyam Prasad N
  0 siblings, 0 replies; 5+ messages in thread
From: Shyam Prasad N @ 2023-07-26 17:49 UTC (permalink / raw)
  To: Tom Talpey
  Cc: CIFS, Steve French, ronnie sahlberg, Bharath SM, Paulo Alcantara,
	Enzo Matsumiya

On Fri, Jul 21, 2023 at 7:21 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 7/21/2023 7:13 AM, Shyam Prasad N wrote:
> > Hi Tom,
> >
> > Thanks for these points.
> >
> > On Thu, Jul 20, 2023 at 8:21 PM Tom Talpey <tom@talpey.com> wrote:
> >>
> >> On 7/19/2023 8:10 AM, Shyam Prasad N wrote:
> >>> Hi all,
> >>>
> >>> cifs.ko seems to be leaking handles occasionally when put under some
> >>> stress testing.
> >>> I was scanning the code for potential places where we could be
> >>> leaking, and one particular instance caught my attention.
> >>>
> >>> In _cifsFileInfo_put, when the ref count of a cifs_file reaches 0, we
> >>> remove it from the lists and then send the close request over the
> >>> wire...
> >>>           if (!tcon->need_reconnect && !cifs_file->invalidHandle) {
> >>>                   struct TCP_Server_Info *server = tcon->ses->server;
> >>>                   unsigned int xid;
> >>>
> >>>                   xid = get_xid();
> >>>                   if (server->ops->close_getattr)
> >>>                           server->ops->close_getattr(xid, tcon, cifs_file);
> >>>                   else if (server->ops->close)
> >>>                           server->ops->close(xid, tcon, &cifs_file->fid);
> >>>                   _free_xid(xid);
> >>>           }
> >>>
> >>> But as you can see above, we do not have return value from the above handlers.
> >>
> >> Yeah, that's a problem. The most obvious is if the network becomes
> >> partitioned and the close(s) fail. Are you injecting that kind of
> >> error?
> >
> > So this was revealed with a stress testing setup where the mount was
> > done against a server that gave out only 512 credits per connection.
> > The connection was pretty much starved for credits, which threw up
> > out-of-credits errors occasionally.
> > I've confirmed that when it happens for close, there are handle leaks.
>
> Interesting. 512 credits doesn't seem like starvation, but it will
> definitely stress a high workload. Good!
>
> >> Still, the logic is going to grow some serious hair if we start
> >> retrying every error. What will bound the retries, and what kind
> >> of error(s) might be considered fatal? Tying up credits, message
> >> id's, handles, etc can be super problematic.
> >>
> >> Also, have you considered using some sort of laundromat? Or is it
> >> critical that these closes happen before proceeding?
> >>
> >
> > Steve and I discussed this yesterday. One option that came out was...
> > We could cleanup the handle locally and then keep retrying the server
> > close as a delayed work a fixed number of times.
> > If a specified limit is exceeded, reconnect the session so that we start afresh.
>
> Sounds reasonable, but things might get a little tricky if the
> server-side handle has a lease or some other state still attached.
> A subsequent client create on the same file might encounter an
> unexpected conflict? It's critical to think that through.

You make a good point. I did think about this.
There are several cases here.
1. If the subsequent open is from another client B: This would force
the server to send a lease break to this client A. But since the
client has closed all local data for the handle (only the server close
is pending), the lease break would go unrecognized. That is not a good
thing to do. So we'll need to keep zombie handles lying around (and in
the list) till the file is actually closed on the server. We just need
to make sure we mark these handles as zombies so future opens to the
file do not reuse this handle.
2. If the subsequent open is from the same client A: If the client
cannot find the handle, it can try an open on the server. However,
that open could fail depending on the open mode by earlier handle. So
I think we could use the above idea of zombie handles here too, where
the new opens could be failed with a retriable error (EAGAIN?).

Do you see a problem with this approach?

>
> > I guess this is what you meant by laundromat?
>
> So by laundromat I meant background processing to handle the
> close. It would have some sort of queued work list, and it
> would process the work items and wash-dry-fold the results.
>
> The main motivation would be to release the thread that fell
> into the refcnt == 0 so the calling application may continue.
> Stealing the thread for a full server roundtrip might be
> worth avoiding, in all cases.
>
> OTOH, if the tricky stuff above is risky or wrong, then forget
> the laundromat and don't return until the close is done. But
> then, think about ^C!
>
> Tom.
>
> >>> What would happen if the above close_getattr or close calls fail?
> >>> Particularly, what would happen if the failure happens even before the
> >>> request is put in flight?
> >>> In this stress testing we have the server giving out lesser credits.
> >>> So with the testing, the credit counters on the active connections are
> >>> expected to be low in general.
> >>> I'm assuming that the above condition will leak handles.
> >>>
> >>> I was thinking about making a change to let the above handlers return
> >>> rc rather than void. And then to check the status.
> >>> If failure, create a delayed work for closing the remote handle. But
> >>> I'm not convinced that this is the right approach.
> >>> I'd like to know more comments on this.
> >>>
> >
> >
> >



-- 
Regards,
Shyam

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-07-26 17:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-19 12:10 Potential leak in file put Shyam Prasad N
2023-07-20 14:30 ` Tom Talpey
2023-07-21 11:13   ` Shyam Prasad N
2023-07-21 13:39     ` Tom Talpey
2023-07-26 17:49       ` Shyam Prasad N

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.