Hello Bart

Firstly let me start with : You have always been kind, patient and helpful to me and myself the same to you so I am not keen to get in the middle of this.

But its not true about Red Hat because I work very hard on this and I very often find bugs you are not seeing so Red Hat is adding value here.

I emailed you a number of times asking if you can provide me the exact steps, but not via your srp-test suite.

I have a setup that is not conducive to running your loop disconnects etc. and if you are seeing a stall on multiple loops of 02-mq I should be able to reproduce it with out having to run your test suite.

Please let me know how I can help

Laurence

On Thu, Jan 18, 2018 at 4:39 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 3:58P -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
>
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > >
> > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> >
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
>
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
With "incorrect" I meant that queue lockups are introduced that make user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).

It is not correct that I'm the only one who is able to run that software.
Anyone who is willing to merge the latest SRP initiator and target driver
patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test
software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about
> "regressions".

Sorry Mike but that's something I consider as an unfair comment. If Ming and
you work on patches together, it's your job to make sure that no regressions
are introduced. Instead of blaming me because I report these regressions you
should be grateful that I take the time and effort to report these regressions
early. And since you are employed by a large organization that sells Linux
support services, your employer should invest in developing test cases that
reach a higher coverage of the dm, SCSI and block layer code. I don't think
that it's normal that my tests discovered several issues that were not
discovered by Red Hat's internal test suite. That's something Red Hat has to
address.

Bart.