All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] iscsi hangs during login attempt
@ 2022-09-14 18:00 Paul Dagnelie
  2022-09-19 17:01 ` Maurizio Lombardi
  2022-09-20 15:49 ` Maurizio Lombardi
  0 siblings, 2 replies; 7+ messages in thread
From: Paul Dagnelie @ 2022-09-14 18:00 UTC (permalink / raw)
  To: target-devel; +Cc: Sumedh Bala, Sebastien Roy

Hi all,

We've encountered an issue a few times now that results in a hung
iscsi session that cannot be recovered, and ultimately leads to the
system being wedged. I've done some initial analysis on what the issue
might be, but I would appreciate some more experienced eyes on it and
suggestions about the right course of action.

The machines in question are running Linux 5.4 (plus some other
changes; it's a custom kernel built on top of Ubuntu 20.04, but I
don't see any relevant patches anywhere in the stack, so I decided to
come straight here). The issue presents with iscsi login attempts
timing out repeatedly. We took a core dump of the machine, and
analysis of the stack traces showed that the iscsi_target_login_thread
is waiting for another thread to finish with the np_login_sem. That
semaphore is being held by a thread in iscsi_target_do_login_rx, which
is currently waiting for a session to be reinstated.
iscsit_cause_connection_reinstatement signals the rx and tx threads,
and then waits for the conn_wait_comp completion, which is signalled
in iscsit_close_connection. That appears to be called by the tx and rx
threads when they exit.  After puzzling through the core for a bit, I
found the kernel threads in question, and they appear to be calmly
waiting in the normal blocking path waiting for IOs to come in for
them to respond to. I would think that if they were in that state when
the SIGINT came in they would have exited properly.

My theory, after examining the code, is if two connection requests
were received from one initiator in rapid succession, it seems like
the second one would use the connection reinstatement logic. It may be
possible that if a reinstatement happens fast enough after the initial
login, the rx and tx threads would not yet have marked themselves as
able to receive the SIGINT that the connection reinstatement logic
uses to prompt them to close the connection so a new one can be
created. As I understand how signals are processed for kernel threads,
this would result in the signal being dropped. After that happens, the
kernel threads would never exit. If that is possible, then this could
explain the issue we’re seeing.

We've enabled some of the pr_debug statements on the systems in
question, but the issue hasn't recurred on any system yet. In
addition, it's worth noting that this issue did present itself shortly
after patching the initiating windows systems, which would presumably
result in one or more connection reinstatements. Does this theory seem
plausible? We haven't managed to reproduce it in-house or with
debugging statements enabled yet, but if it is the root cause it seems
to me the best fix would be to add or use an existing flag that is set
during reconnection (before the signal is sent), and have the rx and
tx threads check it after enabling signals to close the window for the
race.

-- 
Paul Dagnelie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
  2022-09-14 18:00 [BUG] iscsi hangs during login attempt Paul Dagnelie
@ 2022-09-19 17:01 ` Maurizio Lombardi
  2022-09-20 15:49 ` Maurizio Lombardi
  1 sibling, 0 replies; 7+ messages in thread
From: Maurizio Lombardi @ 2022-09-19 17:01 UTC (permalink / raw)
  To: Paul Dagnelie; +Cc: target-devel, Sumedh Bala, Sebastien Roy

Hello,

st 14. 9. 2022 v 20:00 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
>
> iscsit_cause_connection_reinstatement signals the rx and tx threads,
> and then waits for the conn_wait_comp completion, which is signalled
> in iscsit_close_connection. That appears to be called by the tx and rx
> threads when they exit.  After puzzling through the core for a bit, I
> found the kernel threads in question, and they appear to be calmly
> waiting in the normal blocking path waiting for IOs to come in for
> them to respond to. I would think that if they were in that state when
> the SIGINT came in they would have exited properly.

I received a few bug reports and crash dumps during the past months that
appear to be about  the same problem.
Our QA is able to reproduce it so I will test if it's really the same issue
and if your hypothesis is correct, I am going to try to come up with a patch.

Thanks,
Maurizio


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
  2022-09-14 18:00 [BUG] iscsi hangs during login attempt Paul Dagnelie
  2022-09-19 17:01 ` Maurizio Lombardi
@ 2022-09-20 15:49 ` Maurizio Lombardi
  2022-09-20 17:51   ` Paul Dagnelie
  1 sibling, 1 reply; 7+ messages in thread
From: Maurizio Lombardi @ 2022-09-20 15:49 UTC (permalink / raw)
  To: Paul Dagnelie; +Cc: target-devel, Sumedh Bala, Sebastien Roy

st 14. 9. 2022 v 20:00 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
> We haven't managed to reproduce it in-house or with
> debugging statements enabled yet, but if it is the root cause it seems
> to me the best fix would be to add or use an existing flag that is set
> during reconnection (before the signal is sent), and have the rx and
> tx threads check it after enabling signals to close the window for the
> race.

FYI,

we are going to test this patch:
http://bsdbackstore.eu/misc/0001-target-login-should-wait-until-tx-rx-threads-have-pr.patch

Maurizio


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
  2022-09-20 15:49 ` Maurizio Lombardi
@ 2022-09-20 17:51   ` Paul Dagnelie
  2022-09-26 17:03     ` Maurizio Lombardi
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Dagnelie @ 2022-09-20 17:51 UTC (permalink / raw)
  To: Maurizio Lombardi; +Cc: target-devel, Sumedh Bala, Sebastien Roy

That looks like a promising fix, thanks for getting on this so
quickly! If I can ask, what was the workload/test that you used that
reproduced it internally so quickly?

On Tue, Sep 20, 2022 at 8:50 AM Maurizio Lombardi <mlombard@redhat.com> wrote:
>
> st 14. 9. 2022 v 20:00 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
> > We haven't managed to reproduce it in-house or with
> > debugging statements enabled yet, but if it is the root cause it seems
> > to me the best fix would be to add or use an existing flag that is set
> > during reconnection (before the signal is sent), and have the rx and
> > tx threads check it after enabling signals to close the window for the
> > race.
>
> FYI,
>
> we are going to test this patch:
> http://bsdbackstore.eu/misc/0001-target-login-should-wait-until-tx-rx-threads-have-pr.patch
>
> Maurizio
>


-- 
Paul Dagnelie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
  2022-09-20 17:51   ` Paul Dagnelie
@ 2022-09-26 17:03     ` Maurizio Lombardi
       [not found]       ` <CAEU3=8GPcCagm_VW3p9miw1ALr-y7=SvONLxLBBeWft-R2o3YA@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Maurizio Lombardi @ 2022-09-26 17:03 UTC (permalink / raw)
  To: Paul Dagnelie; +Cc: target-devel, Sumedh Bala, Sebastien Roy

Hello,

út 20. 9. 2022 v 19:51 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
>
> That looks like a promising fix, thanks for getting on this so
> quickly! If I can ask, what was the workload/test that you used that
> reproduced it internally so quickly?

Sorry maybe I didn't explain myself clearly enough.
I wanted to say that some months ago our QA submitted some
crash reports that look very similar to the problem you described here.
They have a reproducer (I will ask for details about how it works and
I'll let you know)
but unfortunately the system crashes quite rarely; in other words,
it's not quick at all.

Right now they are trying to reproduce the crash with a recent kernel
so I can look at the crash dump again.

In the meanwhile, did you have any luck with reproducing it?

Maurizio


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
       [not found]       ` <CAEU3=8GPcCagm_VW3p9miw1ALr-y7=SvONLxLBBeWft-R2o3YA@mail.gmail.com>
@ 2022-10-31 21:13         ` Paul Dagnelie
  2022-11-04 10:34           ` Maurizio Lombardi
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Dagnelie @ 2022-10-31 21:13 UTC (permalink / raw)
  To: Sumedh Bala; +Cc: Maurizio Lombardi, target-devel, Sumedh Bala, Sebastien Roy

Hey Maurizio,

So far no luck here on reproducing the issue. Do you have any updates
on this issue?

On Tue, Oct 11, 2022 at 10:23 AM Sumedh Bala <sumedh.bala@delphix.com> wrote:
>
> We have seen this a few times but have not been able to reproduce it at will.
>
> -Sumedh
>
> On Mon, Sep 26, 2022 at 10:03 AM Maurizio Lombardi <mlombard@redhat.com> wrote:
>>
>> Hello,
>>
>> út 20. 9. 2022 v 19:51 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
>> >
>> > That looks like a promising fix, thanks for getting on this so
>> > quickly! If I can ask, what was the workload/test that you used that
>> > reproduced it internally so quickly?
>>
>> Sorry maybe I didn't explain myself clearly enough.
>> I wanted to say that some months ago our QA submitted some
>> crash reports that look very similar to the problem you described here.
>> They have a reproducer (I will ask for details about how it works and
>> I'll let you know)
>> but unfortunately the system crashes quite rarely; in other words,
>> it's not quick at all.
>>
>> Right now they are trying to reproduce the crash with a recent kernel
>> so I can look at the crash dump again.
>>
>> In the meanwhile, did you have any luck with reproducing it?
>>
>> Maurizio
>>


-- 
Paul Dagnelie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] iscsi hangs during login attempt
  2022-10-31 21:13         ` Paul Dagnelie
@ 2022-11-04 10:34           ` Maurizio Lombardi
  0 siblings, 0 replies; 7+ messages in thread
From: Maurizio Lombardi @ 2022-11-04 10:34 UTC (permalink / raw)
  To: Paul Dagnelie; +Cc: Sumedh Bala, target-devel, Sumedh Bala, Sebastien Roy

Hi Paul,

po 31. 10. 2022 v 22:13 odesílatel Paul Dagnelie <pcd@delphix.com> napsal:
>
> Hey Maurizio,
>
> So far no luck here on reproducing the issue. Do you have any updates
> on this issue?

Sorry but no luck so far. Actually I did some experiments to try to
reproduce it but I have been
slowed down by another bug:
https://marc.info/?l=target-devel&m=166755512117326&w=2

Maurizio


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-11-04 10:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 18:00 [BUG] iscsi hangs during login attempt Paul Dagnelie
2022-09-19 17:01 ` Maurizio Lombardi
2022-09-20 15:49 ` Maurizio Lombardi
2022-09-20 17:51   ` Paul Dagnelie
2022-09-26 17:03     ` Maurizio Lombardi
     [not found]       ` <CAEU3=8GPcCagm_VW3p9miw1ALr-y7=SvONLxLBBeWft-R2o3YA@mail.gmail.com>
2022-10-31 21:13         ` Paul Dagnelie
2022-11-04 10:34           ` Maurizio Lombardi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.