From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1516311554.2676.50.camel@wdc.com> References: <20180118024124.8079-1-ming.lei@redhat.com> <20180118170353.GB19734@redhat.com> <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180118204856.GA31679@redhat.com> <1516309128.2676.38.camel@wdc.com> <20180118212327.GB31679@redhat.com> <1516311554.2676.50.camel@wdc.com> From: Laurence Oberman Date: Thu, 18 Jan 2018 16:45:15 -0500 Message-ID: Subject: Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle To: Bart Van Assche Cc: "snitzer@redhat.com" , "dm-devel@redhat.com" , "linux-kernel@vger.kernel.org" , "hch@infradead.org" , "linux-block@vger.kernel.org" , "osandov@fb.com" , "axboe@kernel.dk" , "ming.lei@redhat.com" Content-Type: multipart/alternative; boundary="001a1140642a267eba056313e166" List-ID: --001a1140642a267eba056313e166 Content-Type: text/plain; charset="UTF-8" Hello Bart Firstly let me start with : You have always been kind, patient and helpful to me and myself the same to you so I am not keen to get in the middle of this. But its not true about Red Hat because I work very hard on this and I very often find bugs you are not seeing so Red Hat is adding value here. I emailed you a number of times asking if you can provide me the exact steps, but not via your srp-test suite. I have a setup that is not conducive to running your loop disconnects etc. and if you are seeing a stall on multiple loops of 02-mq I should be able to reproduce it with out having to run your test suite. Please let me know how I can help Laurence On Thu, Jan 18, 2018 at 4:39 PM, Bart Van Assche wrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and > scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. > > > Please try to do more work analyzing the test case that only you can > > easily run (due to srp_test being a PITA). > > It is not correct that I'm the only one who is able to run that software. > Anyone who is willing to merge the latest SRP initiator and target driver > patches in his or her tree can run that software in > any VM. I'm working hard > on getting the patches upstream that make it possible to run the srp-test > software on a setup that is not equipped with InfiniBand hardware. > > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming > and > you work on patches together, it's your job to make sure that no > regressions > are introduced. Instead of blaming me because I report these regressions > you > should be grateful that I take the time and effort to report these > regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has > to > address. > > Bart. --001a1140642a267eba056313e166 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Bart

Firstly let me start with : = You have always been kind, patient and helpful to me and myself the same to= you so I am not keen to get in the middle of this.

But its not true about Red Hat because I work very hard on this and I ver= y often find bugs you are not seeing so Red Hat is adding value here.
=
I emailed you a number of times asking if you can provide me the exact= steps, but not via your srp-test suite.

I have a = setup that is not conducive to running your loop disconnects etc. and if yo= u are seeing a stall on multiple loops of 02-mq I should be able to reprodu= ce it with out having to run your test suite.=C2=A0

Please let me know how I can help=C2=A0

Laurence=

On Th= u, Jan 18, 2018 at 4:39 PM, Bart Van Assche <Bart.VanAssche@wdc.com= > wrote:
O= n Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at=C2=A0 3:58P -0500,
> Bart Van Assche <Bart.Van= Assche@wdc.com> wrote:
>
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is= regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (blk_mq_hctx_stopped(hct= x) || blk_queue_quiesced(q))
> >
> > These lockups were all triggered by incor= rect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
>
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?<= br> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_qu= eue()?

In what I wrote I was referring to both dm_mq_queue_rq() and scsi_qu= eue_rq().
With "incorrect" I meant that queue lockups are introduced that m= ake user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA).

It is not correct that I'm the only one who is able to run that = software.
Anyone who is willing to merge the latest SRP initiator and target driver patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about
> "regressions".

Sorry Mike but that's something I consider as an unfair comment.= If Ming and
you work on patches together, it's your job to make sure that no regres= sions
are introduced. Instead of blaming me because I report these regressions yo= u
should be grateful that I take the time and effort to report these regressi= ons
early. And since you are employed by a large organization that sells Linux<= br> support services, your employer should invest in developing test cases that=
reach a higher coverage of the dm, SCSI and block layer code. I don't t= hink
that it's normal that my tests discovered several issues that were not<= br> discovered by Red Hat's internal test suite. That's something Red H= at has to
address.

Bart.

--001a1140642a267eba056313e166-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Laurence Oberman Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Date: Thu, 18 Jan 2018 16:45:15 -0500 Message-ID: References: <20180118024124.8079-1-ming.lei@redhat.com> <20180118170353.GB19734@redhat.com> <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180118204856.GA31679@redhat.com> <1516309128.2676.38.camel@wdc.com> <20180118212327.GB31679@redhat.com> <1516311554.2676.50.camel@wdc.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7681521669711163265==" Return-path: In-Reply-To: <1516311554.2676.50.camel@wdc.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Bart Van Assche Cc: "axboe@kernel.dk" , "linux-block@vger.kernel.org" , "snitzer@redhat.com" , "linux-kernel@vger.kernel.org" , "ming.lei@redhat.com" , "hch@infradead.org" , "dm-devel@redhat.com" , "osandov@fb.com" List-Id: dm-devel.ids --===============7681521669711163265== Content-Type: multipart/alternative; boundary="001a1140642a267eba056313e166" --001a1140642a267eba056313e166 Content-Type: text/plain; charset="UTF-8" Hello Bart Firstly let me start with : You have always been kind, patient and helpful to me and myself the same to you so I am not keen to get in the middle of this. But its not true about Red Hat because I work very hard on this and I very often find bugs you are not seeing so Red Hat is adding value here. I emailed you a number of times asking if you can provide me the exact steps, but not via your srp-test suite. I have a setup that is not conducive to running your loop disconnects etc. and if you are seeing a stall on multiple loops of 02-mq I should be able to reproduce it with out having to run your test suite. Please let me know how I can help Laurence On Thu, Jan 18, 2018 at 4:39 PM, Bart Van Assche wrote: > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote: > > On Thu, Jan 18 2018 at 3:58P -0500, > > Bart Van Assche wrote: > > > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote: > > > > For Bart's test the underlying scsi-mq driver is what is regularly > > > > hitting this case in __blk_mq_try_issue_directly(): > > > > > > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) > > > > > > These lockups were all triggered by incorrect handling of > > > .queue_rq() returning BLK_STS_RESOURCE. > > > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE? > > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()? > > In what I wrote I was referring to both dm_mq_queue_rq() and > scsi_queue_rq(). > With "incorrect" I meant that queue lockups are introduced that make user > space processes unkillable. That's a severe bug. > > > Please try to do more work analyzing the test case that only you can > > easily run (due to srp_test being a PITA). > > It is not correct that I'm the only one who is able to run that software. > Anyone who is willing to merge the latest SRP initiator and target driver > patches in his or her tree can run that software in > any VM. I'm working hard > on getting the patches upstream that make it possible to run the srp-test > software on a setup that is not equipped with InfiniBand hardware. > > > We have time to get this right, please stop hyperventilating about > > "regressions". > > Sorry Mike but that's something I consider as an unfair comment. If Ming > and > you work on patches together, it's your job to make sure that no > regressions > are introduced. Instead of blaming me because I report these regressions > you > should be grateful that I take the time and effort to report these > regressions > early. And since you are employed by a large organization that sells Linux > support services, your employer should invest in developing test cases that > reach a higher coverage of the dm, SCSI and block layer code. I don't think > that it's normal that my tests discovered several issues that were not > discovered by Red Hat's internal test suite. That's something Red Hat has > to > address. > > Bart. --001a1140642a267eba056313e166 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Bart

Firstly let me start with : = You have always been kind, patient and helpful to me and myself the same to= you so I am not keen to get in the middle of this.

But its not true about Red Hat because I work very hard on this and I ver= y often find bugs you are not seeing so Red Hat is adding value here.
=
I emailed you a number of times asking if you can provide me the exact= steps, but not via your srp-test suite.

I have a = setup that is not conducive to running your loop disconnects etc. and if yo= u are seeing a stall on multiple loops of 02-mq I should be able to reprodu= ce it with out having to run your test suite.=C2=A0

Please let me know how I can help=C2=A0

Laurence=

On Th= u, Jan 18, 2018 at 4:39 PM, Bart Van Assche <Bart.VanAssche@wdc.com= > wrote:
O= n Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at=C2=A0 3:58P -0500,
> Bart Van Assche <Bart.Van= Assche@wdc.com> wrote:
>
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is= regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (blk_mq_hctx_stopped(hct= x) || blk_queue_quiesced(q))
> >
> > These lockups were all triggered by incor= rect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
>
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?<= br> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_qu= eue()?

In what I wrote I was referring to both dm_mq_queue_rq() and scsi_qu= eue_rq().
With "incorrect" I meant that queue lockups are introduced that m= ake user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can > easily run (due to srp_test being a PITA).

It is not correct that I'm the only one who is able to run that = software.
Anyone who is willing to merge the latest SRP initiator and target driver patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about
> "regressions".

Sorry Mike but that's something I consider as an unfair comment.= If Ming and
you work on patches together, it's your job to make sure that no regres= sions
are introduced. Instead of blaming me because I report these regressions yo= u
should be grateful that I take the time and effort to report these regressi= ons
early. And since you are employed by a large organization that sells Linux<= br> support services, your employer should invest in developing test cases that=
reach a higher coverage of the dm, SCSI and block layer code. I don't t= hink
that it's normal that my tests discovered several issues that were not<= br> discovered by Red Hat's internal test suite. That's something Red H= at has to
address.

Bart.

--001a1140642a267eba056313e166-- --===============7681521669711163265== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============7681521669711163265==--