All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] multipathd: check and cleanup zombie paths
       [not found]           ` <CEB9978CF3252343BE3C67AC9F0086A34295CF62@H3CMLB14-EX.srv.huawei-3com.com>
@ 2018-03-08 15:54             ` Xose Vazquez Perez
  2018-03-09  6:11               ` Chongyun Wu
       [not found]             ` <20180308154435.GB14513@octiron.msp.redhat.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Xose Vazquez Perez @ 2018-03-08 15:54 UTC (permalink / raw)
  To: Chongyun Wu, Martin Wilck, Benjamin Marzinski,
	'Christophe Varoqui', 'Hannes Reinecke'
  Cc: Guozhonghua, Changwei Ge, Changlimin, device-mapper development

On 03/08/2018 09:03 AM, Chongyun Wu wrote:

[add dm-devel@redhat.com]

> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV
> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>    |- 3:0:0:3 sdk 8:160 active ready running
>    |- 4:0:0:3 sdn 8:208 active ready running
>    |- 3:0:0:6 sdo 8:224 failed faulty running
>    `- 4:0:0:6 sdp 8:240 failed faulty running
3PAR arrays are able to use ALUA, but with *all ports*
*across all controllers* in *single Target Port Group* :

31000000000000a000000000000001000 dm-0 3PARdata,VV
size=50G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:0 sda 8:0   active ready running
  |- 0:0:1:0 sdf 8:80  active ready running
  |- 1:0:0:0 sdk 8:160 active ready running
  `- 1:0:1:0 sdp 8:240 active ready running

And it's recommended by the manufacturer.


From the StoreServ Management Console "Host:" should be changed to
"Generic-ALUA" "Persona 2" "(UARepLun, SESLun, ALUA)".

multipath-tools( *upstream* ) is already configured by default to use ALUA
with 3PARdata, since c1b7f7f7: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=c1b7f7f7

Run "multipath -d -v3" to see the config by lun.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-08 15:54             ` [PATCH] multipathd: check and cleanup zombie paths Xose Vazquez Perez
@ 2018-03-09  6:11               ` Chongyun Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Chongyun Wu @ 2018-03-09  6:11 UTC (permalink / raw)
  To: Xose Vazquez Perez, Martin Wilck, Benjamin Marzinski,
	'Christophe Varoqui', 'Hannes Reinecke'
  Cc: Guozhonghua, Changwei Ge, Changlimin, device-mapper development

On 2018/3/8 23:54, Xose Vazquez Perez wrote:
> On 03/08/2018 09:03 AM, Chongyun Wu wrote:
> 
> [add dm-devel@redhat.com]
> 
>> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV
>> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw
>> `-+- policy='round-robin 0' prio=1 status=active
>>     |- 3:0:0:3 sdk 8:160 active ready running
>>     |- 4:0:0:3 sdn 8:208 active ready running
>>     |- 3:0:0:6 sdo 8:224 failed faulty running
>>     `- 4:0:0:6 sdp 8:240 failed faulty running
> 3PAR arrays are able to use ALUA, but with *all ports*
> *across all controllers* in *single Target Port Group* :
> 
> 31000000000000a000000000000001000 dm-0 3PARdata,VV
> size=50G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
> `-+- policy='service-time 0' prio=50 status=active
>    |- 0:0:0:0 sda 8:0   active ready running
>    |- 0:0:1:0 sdf 8:80  active ready running
>    |- 1:0:0:0 sdk 8:160 active ready running
>    `- 1:0:1:0 sdp 8:240 active ready running
> 
> And it's recommended by the manufacturer.
> 
> 
>  From the StoreServ Management Console "Host:" should be changed to
> "Generic-ALUA" "Persona 2" "(UARepLun, SESLun, ALUA)".
> 
> multipath-tools( *upstream* ) is already configured by default to use ALUA
> with 3PARdata, since c1b7f7f7: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=c1b7f7f7
> 
> Run "multipath -d -v3" to see the config by lun.
> 
Hi Xose Vazquez Perez,

Thanks for your reminding, I will check it, thanks again~

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
       [not found]             ` <20180308154435.GB14513@octiron.msp.redhat.com>
@ 2018-03-09  6:47               ` Chongyun Wu
  2018-03-09 10:47                 ` Xose Vazquez Perez
  2018-03-09 16:22                 ` Benjamin Marzinski
  0 siblings, 2 replies; 18+ messages in thread
From: Chongyun Wu @ 2018-03-09  6:47 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: 'Xose Vazquez Perez',
	Guozhonghua, dm-devel, Changwei Ge, Changlimin, Martin Wilck

On 2018/3/8 23:45, Benjamin Marzinski wrote:
> On Thu, Mar 08, 2018 at 08:03:50AM +0000, Chongyun Wu wrote:
>> On 2018/3/7 20:45, Martin Wilck wrote:
>>> On Wed, 2018-03-07 at 01:45 +0000, Chongyun Wu wrote:
>>>>
>>>> Hi Martin,
>>>> Your analysis is correct. Did you have any good idea to deal with
>>>> this
>>>> issue?
>>>
>>> Could you maybe explain what was causing the issue in the first place?
>>> Did you reconfigure the storage in any particular way?
>>>
>>> If yes, I think "multipathd reconfigure" would be the correct way to
>>> deal with the problem. It re-reads everything, so it should get rid of
>>> the stale paths.
>>>
>>> Regards
>>> Martin
>>>
>>
>> I have used "multipathd reconfigure", but the zombie(or stale) still
>> here, even restart multipath-tools also can't clean those zombie paths.
>>
>> issue reproduce steps:
>> (1)export the LUN(LUN1) to the server(host1) form LUN value *6* in the
>> storage array;
>> (2)scan out LUN1 in host1 and create multipath;
>> (3)delete multipath in host1;
>> (4)unexport LUN1 to host1 in the storage array;
>> (5)export the LUN(LUN1) to the server(host1) form LUN value *3* in the
>> storage array;
>> (6)scan out LUN1 in host1 and create multipath, will see the zombie path
>> like below:
>> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV
>> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw
>> `-+- policy='round-robin 0' prio=1 status=active
>>     |- 3:0:0:3 sdk 8:160 active ready running
>>     |- 4:0:0:3 sdn 8:208 active ready running
>>     |- 3:0:0:6 sdo 8:224 failed faulty running
>>     `- 4:0:0:6 sdp 8:240 failed faulty running
>> those zombie paths actually case by cancel the old export relation in
>> the storage array and change to a new export relation(given a different
>> LUN value, kernel will create a new device for it), the old device stay
>> in the system which I called zombie path or stable paths.
>>
>> I'm sorry that my first description isn't so clear and can be
>> misleading. The description *a lun can't be exported from a different
>> lun number to a host at the same time* actually not the reference to
>> found zombie paths. I have tested the storage haven't such restrict we
>> can export one LUN to server from different LUN number at the same time.
>> But my patch not care about this scenario, because the path which export
>> many times from different LUN number in the storage array  at the same
>> time will have the same path status(either faild or active).
> 
> If there are multiple routes to the storage, Some of them can be down,
> even if everything is fine on the storage.  This will cause some paths
> to be up and some to be down, regardless of the state of the LUN. In
> every other multipath case but this one, there is just one LUN, and not
> all the paths have the same state.
> 
> Ideally, there would be a way to determine if a path is a zombie, simply
> by looking at it alone.  The additional sense code "LOGICAL UNIT NOT
> SUPPORTED" that you posted earlier isn't one that I recall seeing for
> failed multipathd paths.  I'll check around more, but a quick look makes
> it appear that this code is only used when you are accessing a LUN that
> really isn't there. It's possible that the TUR checker could return a
> special path state for this, that would cause multipathd to remove the
> device.  Also, even if that additional sense code is only supposed to be
> used for this condition, we should still removing a device that returns
> it configurable, because I can almost guarantee that there will be a
> scsi device that does follow the standard for this.
> 
Hi Ben,
You just mentioned *the TUR checker could return a special path state 
for this*, what is the special path state?  Thanks~

> -Ben
>   
>> My previous patch use three conditions to found those paths:
>> (1)path status is faild;
>> (2)can found path which have the same wwid and different lun
>> number(pp->sg_id.lun) with the failed path ;
>> (3)the founded path's status is active.
>>
>> Based on your analysis of support for all devices, I want to restrict
>> the clean up just for scsi device.
>>
>> Above is my test result and reconsideration after your reply. Thanks a lot~
>>
>> Regards,
>> Chongyun
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-09  6:47               ` Chongyun Wu
@ 2018-03-09 10:47                 ` Xose Vazquez Perez
  2018-03-09 16:22                 ` Benjamin Marzinski
  1 sibling, 0 replies; 18+ messages in thread
From: Xose Vazquez Perez @ 2018-03-09 10:47 UTC (permalink / raw)
  To: Chongyun Wu, Benjamin Marzinski
  Cc: Guozhonghua, dm-devel, Changwei Ge, Changlimin, Martin Wilck

On 03/09/2018 07:47 AM, Chongyun Wu wrote:

> You just mentioned *the TUR checker could return a special path state 
> for this*, what is the special path state?  Thanks~

to follow with this bug, you should post:
- distribution
- kernel release
- multipath-tools release
- /etc/multipath.conf
- and relevant system logs (multipath -v3 -d, journalctl, dmesg, messages, ...)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-09  6:47               ` Chongyun Wu
  2018-03-09 10:47                 ` Xose Vazquez Perez
@ 2018-03-09 16:22                 ` Benjamin Marzinski
  2018-03-19 21:42                   ` Martin Wilck
  1 sibling, 1 reply; 18+ messages in thread
From: Benjamin Marzinski @ 2018-03-09 16:22 UTC (permalink / raw)
  To: Chongyun Wu
  Cc: 'Xose Vazquez Perez',
	Guozhonghua, dm-devel, Changwei Ge, Changlimin, Martin Wilck

On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote:
> On 2018/3/8 23:45, Benjamin Marzinski wrote:
> > On Thu, Mar 08, 2018 at 08:03:50AM +0000, Chongyun Wu wrote:
> >> On 2018/3/7 20:45, Martin Wilck wrote:
> >>> On Wed, 2018-03-07 at 01:45 +0000, Chongyun Wu wrote:
> >>>>
> >>>> Hi Martin,
> >>>> Your analysis is correct. Did you have any good idea to deal with
> >>>> this
> >>>> issue?
> >>>
> >>> Could you maybe explain what was causing the issue in the first place?
> >>> Did you reconfigure the storage in any particular way?
> >>>
> >>> If yes, I think "multipathd reconfigure" would be the correct way to
> >>> deal with the problem. It re-reads everything, so it should get rid of
> >>> the stale paths.
> >>>
> >>> Regards
> >>> Martin
> >>>
> >>
> >> I have used "multipathd reconfigure", but the zombie(or stale) still
> >> here, even restart multipath-tools also can't clean those zombie paths.
> >>
> >> issue reproduce steps:
> >> (1)export the LUN(LUN1) to the server(host1) form LUN value *6* in the
> >> storage array;
> >> (2)scan out LUN1 in host1 and create multipath;
> >> (3)delete multipath in host1;
> >> (4)unexport LUN1 to host1 in the storage array;
> >> (5)export the LUN(LUN1) to the server(host1) form LUN value *3* in the
> >> storage array;
> >> (6)scan out LUN1 in host1 and create multipath, will see the zombie path
> >> like below:
> >> 360002ac000000000000004f40001e2d7 dm-5 3PARdata,VV
> >> size=13G features='1 queue_if_no_path' hwhandler='0' wp=rw
> >> `-+- policy='round-robin 0' prio=1 status=active
> >>     |- 3:0:0:3 sdk 8:160 active ready running
> >>     |- 4:0:0:3 sdn 8:208 active ready running
> >>     |- 3:0:0:6 sdo 8:224 failed faulty running
> >>     `- 4:0:0:6 sdp 8:240 failed faulty running
> >> those zombie paths actually case by cancel the old export relation in
> >> the storage array and change to a new export relation(given a different
> >> LUN value, kernel will create a new device for it), the old device stay
> >> in the system which I called zombie path or stable paths.
> >>
> >> I'm sorry that my first description isn't so clear and can be
> >> misleading. The description *a lun can't be exported from a different
> >> lun number to a host at the same time* actually not the reference to
> >> found zombie paths. I have tested the storage haven't such restrict we
> >> can export one LUN to server from different LUN number at the same time.
> >> But my patch not care about this scenario, because the path which export
> >> many times from different LUN number in the storage array  at the same
> >> time will have the same path status(either faild or active).
> > 
> > If there are multiple routes to the storage, Some of them can be down,
> > even if everything is fine on the storage.  This will cause some paths
> > to be up and some to be down, regardless of the state of the LUN. In
> > every other multipath case but this one, there is just one LUN, and not
> > all the paths have the same state.
> > 
> > Ideally, there would be a way to determine if a path is a zombie, simply
> > by looking at it alone.  The additional sense code "LOGICAL UNIT NOT
> > SUPPORTED" that you posted earlier isn't one that I recall seeing for
> > failed multipathd paths.  I'll check around more, but a quick look makes
> > it appear that this code is only used when you are accessing a LUN that
> > really isn't there. It's possible that the TUR checker could return a
> > special path state for this, that would cause multipathd to remove the
> > device.  Also, even if that additional sense code is only supposed to be
> > used for this condition, we should still removing a device that returns
> > it configurable, because I can almost guarantee that there will be a
> > scsi device that does follow the standard for this.
> > 
> Hi Ben,
> You just mentioned *the TUR checker could return a special path state 
> for this*, what is the special path state?  Thanks~
> 

We would have to add a new state, like PATH_NOT_SUPPORTED, that the TUR
checker could return in this case.  multipathd could be configured to
remove the path if it returned this state. If it wasn't configured to do
so, multipathd would just change the state to PATH_DOWN.

> > -Ben
> >   
> >> My previous patch use three conditions to found those paths:
> >> (1)path status is faild;
> >> (2)can found path which have the same wwid and different lun
> >> number(pp->sg_id.lun) with the failed path ;
> >> (3)the founded path's status is active.
> >>
> >> Based on your analysis of support for all devices, I want to restrict
> >> the clean up just for scsi device.
> >>
> >> Above is my test result and reconsideration after your reply. Thanks a lot~
> >>
> >> Regards,
> >> Chongyun
> > 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-09 16:22                 ` Benjamin Marzinski
@ 2018-03-19 21:42                   ` Martin Wilck
  2018-03-20  3:19                     ` Chongyun Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Martin Wilck @ 2018-03-19 21:42 UTC (permalink / raw)
  To: Benjamin Marzinski, Chongyun Wu
  Cc: 'Xose Vazquez Perez',
	Guozhonghua, dm-devel, Changwei Ge, Changlimin

On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote:
> On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote:
> > On 2018/3/8 23:45, Benjamin Marzinski wrote:
> > > 
> > > If there are multiple routes to the storage, Some of them can be
> > > down,
> > > even if everything is fine on the storage.  This will cause some
> > > paths
> > > to be up and some to be down, regardless of the state of the LUN.
> > > In
> > > every other multipath case but this one, there is just one LUN,
> > > and not
> > > all the paths have the same state.
> > > 
> > > Ideally, there would be a way to determine if a path is a zombie,
> > > simply
> > > by looking at it alone.  The additional sense code "LOGICAL UNIT
> > > NOT
> > > SUPPORTED" that you posted earlier isn't one that I recall seeing
> > > for
> > > failed multipathd paths.  I'll check around more, but a quick
> > > look makes
> > > it appear that this code is only used when you are accessing a
> > > LUN that
> > > really isn't there. It's possible that the TUR checker could
> > > return a
> > > special path state for this, that would cause multipathd to
> > > remove the
> > > device.  Also, even if that additional sense code is only
> > > supposed to be
> > > used for this condition, we should still removing a device that
> > > returns
> > > it configurable, because I can almost guarantee that there will
> > > be a
> > > scsi device that does follow the standard for this.
> > > 
> > 
> > Hi Ben,
> > You just mentioned *the TUR checker could return a special path
> > state 
> > for this*, what is the special path state?  Thanks~
> > 
> 
> We would have to add a new state, like PATH_NOT_SUPPORTED, that the
> TUR
> checker could return in this case.  multipathd could be configured to
> remove the path if it returned this state. If it wasn't configured to
> do
> so, multipathd would just change the state to PATH_DOWN.

Is it really multipathd's job to do remove devices that return "LOGICAL
UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the
SCSI/storage level, and I'm unsure if that's a thing multipathd should
mess with.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-19 21:42                   ` Martin Wilck
@ 2018-03-20  3:19                     ` Chongyun Wu
  2018-03-20  7:36                       ` Martin Wilck
  2018-03-20 14:58                       ` Bart Van Assche
  0 siblings, 2 replies; 18+ messages in thread
From: Chongyun Wu @ 2018-03-20  3:19 UTC (permalink / raw)
  To: Martin Wilck, Benjamin Marzinski
  Cc: 'Xose Vazquez Perez',
	Guozhonghua, dm-devel, Changwei Ge, Changlimin

On 2018/3/20 5:42, Martin Wilck wrote:
> On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote:
>> On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote:
>>> On 2018/3/8 23:45, Benjamin Marzinski wrote:
>>>>
>>>> If there are multiple routes to the storage, Some of them can be
>>>> down,
>>>> even if everything is fine on the storage.  This will cause some
>>>> paths
>>>> to be up and some to be down, regardless of the state of the LUN.
>>>> In
>>>> every other multipath case but this one, there is just one LUN,
>>>> and not
>>>> all the paths have the same state.
>>>>
>>>> Ideally, there would be a way to determine if a path is a zombie,
>>>> simply
>>>> by looking at it alone.  The additional sense code "LOGICAL UNIT
>>>> NOT
>>>> SUPPORTED" that you posted earlier isn't one that I recall seeing
>>>> for
>>>> failed multipathd paths.  I'll check around more, but a quick
>>>> look makes
>>>> it appear that this code is only used when you are accessing a
>>>> LUN that
>>>> really isn't there. It's possible that the TUR checker could
>>>> return a
>>>> special path state for this, that would cause multipathd to
>>>> remove the
>>>> device.  Also, even if that additional sense code is only
>>>> supposed to be
>>>> used for this condition, we should still removing a device that
>>>> returns
>>>> it configurable, because I can almost guarantee that there will
>>>> be a
>>>> scsi device that does follow the standard for this.
>>>>
>>>
>>> Hi Ben,
>>> You just mentioned *the TUR checker could return a special path
>>> state
>>> for this*, what is the special path state?  Thanks~
>>>
>>
>> We would have to add a new state, like PATH_NOT_SUPPORTED, that the
>> TUR
>> checker could return in this case.  multipathd could be configured to
>> remove the path if it returned this state. If it wasn't configured to
>> do
>> so, multipathd would just change the state to PATH_DOWN.
> 
> Is it really multipathd's job to do remove devices that return "LOGICAL
> UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the
> SCSI/storage level, and I'm unsure if that's a thing multipathd should
> mess with.
> 
> Martin
> 
Actually there are two scenario:
(1)Export the LUN to a server at the same time using different LUN nubmer.
As you mentioned this scenario can be considered a misconfiguration 
which we might not care about it.
(2)Export the LUN to a server not at the same time using different LUN 
number.
This scenario's operation may be right, the customer just want to 
reassignment the export relations in the storage.
But the former export operation leave a residual device in the system 
which will been adopted by the latter exported device's multipath. Also 
there are lots of syslog for the former device which actually not 
exist(at lest customer don't think it exists, the customer want only the 
new exported device exist)

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20  3:19                     ` Chongyun Wu
@ 2018-03-20  7:36                       ` Martin Wilck
  2018-03-20 14:58                       ` Bart Van Assche
  1 sibling, 0 replies; 18+ messages in thread
From: Martin Wilck @ 2018-03-20  7:36 UTC (permalink / raw)
  To: Chongyun Wu, Benjamin Marzinski
  Cc: 'Xose Vazquez Perez',
	Guozhonghua, dm-devel, Changwei Ge, Changlimin

On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote:
> On 2018/3/20 5:42, Martin Wilck wrote:
> > On Fri, 2018-03-09 at 10:22 -0600, Benjamin Marzinski wrote:
> > > On Fri, Mar 09, 2018 at 06:47:30AM +0000, Chongyun Wu wrote:
> > > > On 2018/3/8 23:45, Benjamin Marzinski wrote:
> > > > > 
> > > > > If there are multiple routes to the storage, Some of them can
> > > > > be
> > > > > down,
> > > > > even if everything is fine on the storage.  This will cause
> > > > > some
> > > > > paths
> > > > > to be up and some to be down, regardless of the state of the
> > > > > LUN.
> > > > > In
> > > > > every other multipath case but this one, there is just one
> > > > > LUN,
> > > > > and not
> > > > > all the paths have the same state.
> > > > > 
> > > > > Ideally, there would be a way to determine if a path is a
> > > > > zombie,
> > > > > simply
> > > > > by looking at it alone.  The additional sense code "LOGICAL
> > > > > UNIT
> > > > > NOT
> > > > > SUPPORTED" that you posted earlier isn't one that I recall
> > > > > seeing
> > > > > for
> > > > > failed multipathd paths.  I'll check around more, but a quick
> > > > > look makes
> > > > > it appear that this code is only used when you are accessing
> > > > > a
> > > > > LUN that
> > > > > really isn't there. It's possible that the TUR checker could
> > > > > return a
> > > > > special path state for this, that would cause multipathd to
> > > > > remove the
> > > > > device.  Also, even if that additional sense code is only
> > > > > supposed to be
> > > > > used for this condition, we should still removing a device
> > > > > that
> > > > > returns
> > > > > it configurable, because I can almost guarantee that there
> > > > > will
> > > > > be a
> > > > > scsi device that does follow the standard for this.
> > > > > 
> > > > 
> > > > Hi Ben,
> > > > You just mentioned *the TUR checker could return a special path
> > > > state
> > > > for this*, what is the special path state?  Thanks~
> > > > 
> > > 
> > > We would have to add a new state, like PATH_NOT_SUPPORTED, that
> > > the
> > > TUR
> > > checker could return in this case.  multipathd could be
> > > configured to
> > > remove the path if it returned this state. If it wasn't
> > > configured to
> > > do
> > > so, multipathd would just change the state to PATH_DOWN.
> > 
> > Is it really multipathd's job to do remove devices that return
> > "LOGICAL
> > UNIT NOT SUPPORTED"? To me it sounds like a misconfiguration on the
> > SCSI/storage level, and I'm unsure if that's a thing multipathd
> > should
> > mess with.
> > 
> > Martin
> > 
> 
> Actually there are two scenario:
> (1)Export the LUN to a server at the same time using different LUN
> nubmer.
> As you mentioned this scenario can be considered a misconfiguration 
> which we might not care about it.
> (2)Export the LUN to a server not at the same time using different
> LUN 
> number.
> This scenario's operation may be right, the customer just want to 
> reassignment the export relations in the storage.
> But the former export operation leave a residual device in the
> system 
> which will been adopted by the latter exported device's multipath.
> Also 
> there are lots of syslog for the former device which actually not 
> exist(at lest customer don't think it exists, the customer want only
> the 
> new exported device exist)

I agree that the "residual device" should be removed from the system.
But I don't think that it's multipathd's assignment to detect and
remove such devices. Well, detect and spit out a message - maybe, but
remove - rather not. multipathd is for managing (dm-)multipath devices,
 not for taking care of arbitrary problems on the storage layer.
That said, I'd be OK with a PATH_NOT_SUPPORTED state that would result
in the paths being treated like orphans or blacklisted devices.

Regards,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20  3:19                     ` Chongyun Wu
  2018-03-20  7:36                       ` Martin Wilck
@ 2018-03-20 14:58                       ` Bart Van Assche
  2018-03-20 15:12                         ` Xose Vazquez Perez
  2018-03-21  1:17                         ` Chongyun Wu
  1 sibling, 2 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-03-20 14:58 UTC (permalink / raw)
  To: bmarzins, wu.chongyun, mwilck
  Cc: guozhonghua, dm-devel, changlimin, xose.vazquez, ge.changwei

On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote:
> Actually there are two scenario:
> (1)Export the LUN to a server at the same time using different LUN nubmer.
> As you mentioned this scenario can be considered a misconfiguration 
> which we might not care about it.
> (2)Export the LUN to a server not at the same time using different LUN 
> number.
> This scenario's operation may be right, the customer just want to 
> reassignment the export relations in the storage.
> But the former export operation leave a residual device in the system 
> which will been adopted by the latter exported device's multipath. Also 
> there are lots of syslog for the former device which actually not 
> exist(at lest customer don't think it exists, the customer want only the 
> new exported device exist)

Hello Chongyun,

It is on purpose that the SCSI core does not remove stale SCSI device nodes.
If you want that these stale SCSI device nodes get removed automatically,
two possible approaches are (there might be other approaches):
* Write a new user space daemon that periodically checks for stale devices
  (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
   grep -v running) and that triggers a SCSI rescan if any stale devices are
  found.
* Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
  and that triggers a SCSI rescan if this event is triggered by the kernel.

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 14:58                       ` Bart Van Assche
@ 2018-03-20 15:12                         ` Xose Vazquez Perez
  2018-03-20 15:14                           ` Bart Van Assche
  2018-03-21  1:17                         ` Chongyun Wu
  1 sibling, 1 reply; 18+ messages in thread
From: Xose Vazquez Perez @ 2018-03-20 15:12 UTC (permalink / raw)
  To: Bart Van Assche, bmarzins, wu.chongyun, mwilck
  Cc: guozhonghua, dm-devel, changlimin, ge.changwei

On 03/20/2018 03:58 PM, Bart Van Assche wrote:

> It is on purpose that the SCSI core does not remove stale SCSI device nodes.
> If you want that these stale SCSI device nodes get removed automatically,
> two possible approaches are (there might be other approaches):
> * Write a new user space daemon that periodically checks for stale devices
>   (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
>    grep -v running) and that triggers a SCSI rescan if any stale devices are
>   found.
> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
>   and that triggers a SCSI rescan if this event is triggered by the kernel.

There are some "remove" flags in rescan-scsi-bus.sh:
https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080

-r      enables removing of devices        [default: disabled]
--forceremove:   Remove stale devices (DANGEROUS)
--forcerescan:   Remove and readd existing devices (DANGEROUS)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 15:12                         ` Xose Vazquez Perez
@ 2018-03-20 15:14                           ` Bart Van Assche
  2018-03-20 15:19                             ` Martin Wilck
                                               ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-03-20 15:14 UTC (permalink / raw)
  To: xose.vazquez, bmarzins, wu.chongyun, mwilck
  Cc: guozhonghua, dm-devel, changlimin, ge.changwei

On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote:
> On 03/20/2018 03:58 PM, Bart Van Assche wrote:
> 
> > It is on purpose that the SCSI core does not remove stale SCSI device nodes.
> > If you want that these stale SCSI device nodes get removed automatically,
> > two possible approaches are (there might be other approaches):
> > * Write a new user space daemon that periodically checks for stale devices
> >   (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
> >    grep -v running) and that triggers a SCSI rescan if any stale devices are
> >   found.
> > * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
> >   and that triggers a SCSI rescan if this event is triggered by the kernel.
> 
> There are some "remove" flags in rescan-scsi-bus.sh:
> https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080
> 
> -r      enables removing of devices        [default: disabled]
> --forceremove:   Remove stale devices (DANGEROUS)
> --forcerescan:   Remove and readd existing devices (DANGEROUS)

Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs
delete attribute to remove stale devices. That is the mechanism that can
trigger a deadlock in the kernel.

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 15:14                           ` Bart Van Assche
@ 2018-03-20 15:19                             ` Martin Wilck
  2018-03-21  1:54                             ` Chongyun Wu
  2018-03-22  3:40                             ` Chongyun Wu
  2 siblings, 0 replies; 18+ messages in thread
From: Martin Wilck @ 2018-03-20 15:19 UTC (permalink / raw)
  To: Bart Van Assche, xose.vazquez, bmarzins, wu.chongyun
  Cc: guozhonghua, dm-devel, changlimin, ge.changwei

On Tue, 2018-03-20 at 15:14 +0000, Bart Van Assche wrote:
> On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote:
> > 
> > There are some "remove" flags in rescan-scsi-bus.sh:
> > https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2a
> > cc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080
> > 
> > -r      enables removing of devices        [default: disabled]
> > --forceremove:   Remove stale devices (DANGEROUS)
> > --forcerescan:   Remove and readd existing devices (DANGEROUS)
> 
> Last time I checked the rescan-scsi-bus.sh script relied on the SCSI
> sysfs
> delete attribute to remove stale devices. That is the mechanism that
> can
> trigger a deadlock in the kernel.

Is there an alternative the script could use? I'm not aware of any.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 14:58                       ` Bart Van Assche
  2018-03-20 15:12                         ` Xose Vazquez Perez
@ 2018-03-21  1:17                         ` Chongyun Wu
  1 sibling, 0 replies; 18+ messages in thread
From: Chongyun Wu @ 2018-03-21  1:17 UTC (permalink / raw)
  To: Bart Van Assche, bmarzins, mwilck
  Cc: Guozhonghua, dm-devel, Changlimin, xose.vazquez, Changwei Ge

On 2018/3/20 22:58, Bart Van Assche wrote:
> On Tue, 2018-03-20 at 03:19 +0000, Chongyun Wu wrote:
>> Actually there are two scenario:
>> (1)Export the LUN to a server at the same time using different LUN nubmer.
>> As you mentioned this scenario can be considered a misconfiguration
>> which we might not care about it.
>> (2)Export the LUN to a server not at the same time using different LUN
>> number.
>> This scenario's operation may be right, the customer just want to
>> reassignment the export relations in the storage.
>> But the former export operation leave a residual device in the system
>> which will been adopted by the latter exported device's multipath. Also
>> there are lots of syslog for the former device which actually not
>> exist(at lest customer don't think it exists, the customer want only the
>> new exported device exist)
> 
> Hello Chongyun,
> 
> It is on purpose that the SCSI core does not remove stale SCSI device nodes.
> If you want that these stale SCSI device nodes get removed automatically,
> two possible approaches are (there might be other approaches):
> * Write a new user space daemon that periodically checks for stale devices
>    (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
>     grep -v running) and that triggers a SCSI rescan if any stale devices are
>    found.
> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
>    and that triggers a SCSI rescan if this event is triggered by the kernel.
> 
> Bart.

Hi Bart,

Thank you very much for your advice, I think the two approaches are new 
way for me to clean up stale devices, I will have a try.

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 15:14                           ` Bart Van Assche
  2018-03-20 15:19                             ` Martin Wilck
@ 2018-03-21  1:54                             ` Chongyun Wu
  2018-03-21 19:56                               ` Bart Van Assche
  2018-03-22  3:40                             ` Chongyun Wu
  2 siblings, 1 reply; 18+ messages in thread
From: Chongyun Wu @ 2018-03-21  1:54 UTC (permalink / raw)
  To: Bart Van Assche, xose.vazquez, bmarzins, mwilck
  Cc: Guozhonghua, dm-devel, Changlimin, Changwei Ge

On 2018/3/20 23:14, Bart Van Assche wrote:
> On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote:
>> On 03/20/2018 03:58 PM, Bart Van Assche wrote:
>>
>>> It is on purpose that the SCSI core does not remove stale SCSI device nodes.
>>> If you want that these stale SCSI device nodes get removed automatically,
>>> two possible approaches are (there might be other approaches):
>>> * Write a new user space daemon that periodically checks for stale devices
>>>    (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
>>>     grep -v running) and that triggers a SCSI rescan if any stale devices are
>>>    found.
>>> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
>>>    and that triggers a SCSI rescan if this event is triggered by the kernel.
>>
>> There are some "remove" flags in rescan-scsi-bus.sh:
>> https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080
>>
>> -r      enables removing of devices        [default: disabled]
>> --forceremove:   Remove stale devices (DANGEROUS)
>> --forcerescan:   Remove and readd existing devices (DANGEROUS)
> 
> Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs
> delete attribute to remove stale devices. That is the mechanism that can
> trigger a deadlock in the kernel.
> 
> Bart.
> 

Hi Bart,

Is there any special operation or conditions to reproduce the dead lock? 
  I have use SCSI sysfs delete atrribute to remove stale devices in my 
previous patch and test many times, but I haven't encountered any 
deadlock problems.

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-21  1:54                             ` Chongyun Wu
@ 2018-03-21 19:56                               ` Bart Van Assche
  2018-03-22  1:58                                 ` Chongyun Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2018-03-21 19:56 UTC (permalink / raw)
  To: xose.vazquez, wu.chongyun, bmarzins, mwilck
  Cc: guozhonghua, dm-devel, changlimin, ge.changwei

On Wed, 2018-03-21 at 01:54 +0000, Chongyun Wu wrote:
> Is there any special operation or conditions to reproduce the dead lock? 
>   I have use SCSI sysfs delete atrribute to remove stale devices in my 
> previous patch and test many times, but I haven't encountered any 
> deadlock problems.

Hello Chongyun,

In my tests I enabled the lock validator (CONFIG_LOCKDEP) in order not only
to obtain detailed information about the cause of actual deadlocks but also
to obtain information about potential deadlocks.

I'm not sure how to make it more likely to trigger this deadlock. But I think
that you should be aware that some SCSI transports remove the SCSI host if a
transport failure is detected and other SCSI transports don't remove the SCSI
host upon a transport failure. I ran my tests with the SRP protocol and the
SRP transport protocol removes the SCSI host if a transport failure persists
long enough. Maybe triggering SCSI host removal often helps to trigger that
deadlock.

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-21 19:56                               ` Bart Van Assche
@ 2018-03-22  1:58                                 ` Chongyun Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Chongyun Wu @ 2018-03-22  1:58 UTC (permalink / raw)
  To: Bart Van Assche, xose.vazquez, bmarzins, mwilck
  Cc: Guozhonghua, dm-devel, Changlimin, Changwei Ge

On 2018/3/22 3:56, Bart Van Assche wrote:
> On Wed, 2018-03-21 at 01:54 +0000, Chongyun Wu wrote:
>> Is there any special operation or conditions to reproduce the dead lock?
>>    I have use SCSI sysfs delete atrribute to remove stale devices in my
>> previous patch and test many times, but I haven't encountered any
>> deadlock problems.
> 
> Hello Chongyun,
> 
> In my tests I enabled the lock validator (CONFIG_LOCKDEP) in order not only
> to obtain detailed information about the cause of actual deadlocks but also
> to obtain information about potential deadlocks.
> 
> I'm not sure how to make it more likely to trigger this deadlock. But I think
> that you should be aware that some SCSI transports remove the SCSI host if a
> transport failure is detected and other SCSI transports don't remove the SCSI
> host upon a transport failure. I ran my tests with the SRP protocol and the
> SRP transport protocol removes the SCSI host if a transport failure persists
> long enough. Maybe triggering SCSI host removal often helps to trigger that
> deadlock.
> 
> Bart.
> 
> 
> 
Hi Bart,

Thanks, as you mentioned maybe we have different scsi transport so we 
haven't found deadlock so far. Hope your mentioned issue *Avoid that 
SCSI device removal through sysfs triggers a deadlock* can be resolved 
by root. We often encounter problems with device residual which cause 
some issues and want to find a safe and effective method to clean up it.

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-20 15:14                           ` Bart Van Assche
  2018-03-20 15:19                             ` Martin Wilck
  2018-03-21  1:54                             ` Chongyun Wu
@ 2018-03-22  3:40                             ` Chongyun Wu
  2018-03-22 15:18                               ` Bart Van Assche
  2 siblings, 1 reply; 18+ messages in thread
From: Chongyun Wu @ 2018-03-22  3:40 UTC (permalink / raw)
  To: Bart Van Assche, xose.vazquez, bmarzins, mwilck
  Cc: Guozhonghua, dm-devel, Changlimin, Changwei Ge

On 2018/3/20 23:14, Bart Van Assche wrote:
> On Tue, 2018-03-20 at 16:12 +0100, Xose Vazquez Perez wrote:
>> On 03/20/2018 03:58 PM, Bart Van Assche wrote:
>>
>>> It is on purpose that the SCSI core does not remove stale SCSI device nodes.
>>> If you want that these stale SCSI device nodes get removed automatically,
>>> two possible approaches are (there might be other approaches):
>>> * Write a new user space daemon that periodically checks for stale devices
>>>    (e.g. by running grep -aH . /sys/class/scsi_device/*/*/state |
>>>     grep -v running) and that triggers a SCSI rescan if any stale devices are
>>>    found.
>>> * Write a udev rule that listens for SDEV_UA=REPORTED_LUNS_DATA_HAS_CHANGED
>>>    and that triggers a SCSI rescan if this event is triggered by the kernel.
>>
>> There are some "remove" flags in rescan-scsi-bus.sh:
>> https://github.com/hreinecke/sg3_utils/blob/d4dbbede04db21c206e4c2acc1cf766117f003c3/scripts/rescan-scsi-bus.sh#L1080
>>
>> -r      enables removing of devices        [default: disabled]
>> --forceremove:   Remove stale devices (DANGEROUS)
>> --forcerescan:   Remove and readd existing devices (DANGEROUS)
> 
> Last time I checked the rescan-scsi-bus.sh script relied on the SCSI sysfs
> delete attribute to remove stale devices. That is the mechanism that can
> trigger a deadlock in the kernel.
> 
> Bart.
> 
> 
> Hi Bart,

I did a test. Below command can remove the residual device:
*echo "scsi remove-single-device 3 0 0 3" > /proc/scsi/scsi*
Is it safe?

Regards,
Chongyun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] multipathd: check and cleanup zombie paths
  2018-03-22  3:40                             ` Chongyun Wu
@ 2018-03-22 15:18                               ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-03-22 15:18 UTC (permalink / raw)
  To: xose.vazquez, wu.chongyun, bmarzins, mwilck
  Cc: guozhonghua, dm-devel, changlimin, ge.changwei

On Thu, 2018-03-22 at 03:40 +0000, Chongyun Wu wrote:
> I did a test. Below command can remove the residual device:
> *echo "scsi remove-single-device 3 0 0 3" > /proc/scsi/scsi*
> Is it safe?

Hello Chongyun,

Are you aware of the linux-scsi mailing list? I think this question would be
more appropriate for that mailing list. Regarding your question, I think that
you should be aware of the following comment above the proc write method that
implements that functionality: "this provides a legacy mechanism to add or
remove devices".

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-03-22 15:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CEB9978CF3252343BE3C67AC9F0086A34295C462@H3CMLB14-EX.srv.huawei-3com.com>
     [not found] ` <1520325779.4131.4.camel@suse.com>
     [not found]   ` <CEB9978CF3252343BE3C67AC9F0086A34295C9D0@H3CMLB14-EX.srv.huawei-3com.com>
     [not found]     ` <1520349519.4131.20.camel@suse.com>
     [not found]       ` <CEB9978CF3252343BE3C67AC9F0086A34295CA7D@H3CMLB14-EX.srv.huawei-3com.com>
     [not found]         ` <1520426679.11340.5.camel@suse.com>
     [not found]           ` <CEB9978CF3252343BE3C67AC9F0086A34295CF62@H3CMLB14-EX.srv.huawei-3com.com>
2018-03-08 15:54             ` [PATCH] multipathd: check and cleanup zombie paths Xose Vazquez Perez
2018-03-09  6:11               ` Chongyun Wu
     [not found]             ` <20180308154435.GB14513@octiron.msp.redhat.com>
2018-03-09  6:47               ` Chongyun Wu
2018-03-09 10:47                 ` Xose Vazquez Perez
2018-03-09 16:22                 ` Benjamin Marzinski
2018-03-19 21:42                   ` Martin Wilck
2018-03-20  3:19                     ` Chongyun Wu
2018-03-20  7:36                       ` Martin Wilck
2018-03-20 14:58                       ` Bart Van Assche
2018-03-20 15:12                         ` Xose Vazquez Perez
2018-03-20 15:14                           ` Bart Van Assche
2018-03-20 15:19                             ` Martin Wilck
2018-03-21  1:54                             ` Chongyun Wu
2018-03-21 19:56                               ` Bart Van Assche
2018-03-22  1:58                                 ` Chongyun Wu
2018-03-22  3:40                             ` Chongyun Wu
2018-03-22 15:18                               ` Bart Van Assche
2018-03-21  1:17                         ` Chongyun Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.