All of lore.kernel.org
 help / color / mirror / Atom feed
* [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
@ 2023-03-14 14:58 Li Xiao Keng
  2023-03-14 15:04 ` Martin Wilck
  2023-03-14 16:40 ` Coly Li
  0 siblings, 2 replies; 14+ messages in thread
From: Li Xiao Keng @ 2023-03-14 14:58 UTC (permalink / raw)
  To: Jes Sorensen, Martin Wilck, Paul Menzel, Coly Li, linux-raid
  Cc: linfeilong, louhongxiang, liuzhiqiang (I), miaoguanqin

Hi,
   Here we meet a question. When we add a new disk to a raid, it may return
-EBUSY.
   The main process of --add(for example md0, sdf):
       1.dev_open(sdf)
       2.add_to_super
       3.write_init_super
       4.fsync(fd)
       5.close(fd)
       6.ioctl(ADD_NEW_DISK).
   However, there will be some udev(change of sdf) event after step5. Then
"/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}"
will be run, and the sdf will be added to md0. After that, step6 will return
-EBUSY.
   It is a problem to user. First time adding disk does not return success
but disk is actually added. And I have no good idea to deal with it. Please
give some great advice.

Regards,
Li Xiao Keng

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 14:58 [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export" Li Xiao Keng
@ 2023-03-14 15:04 ` Martin Wilck
  2023-03-14 15:59   ` Mariusz Tkaczyk
  2023-03-14 16:40 ` Coly Li
  1 sibling, 1 reply; 14+ messages in thread
From: Martin Wilck @ 2023-03-14 15:04 UTC (permalink / raw)
  To: Li Xiao Keng, Jes Sorensen, Paul Menzel, Coly Li, linux-raid
  Cc: linfeilong, louhongxiang, liuzhiqiang (I), miaoguanqin

On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:
> Hi,
>    Here we meet a question. When we add a new disk to a raid, it may
> return
> -EBUSY.
>    The main process of --add(for example md0, sdf):
>        1.dev_open(sdf)
>        2.add_to_super
>        3.write_init_super
>        4.fsync(fd)
>        5.close(fd)
>        6.ioctl(ADD_NEW_DISK).
>    However, there will be some udev(change of sdf) event after step5.
> Then
> "/usr/sbin/mdadm --incremental --export $devnode --offroot
> $env{DEVLINKS}"
> will be run, and the sdf will be added to md0. After that, step6 will
> return
> -EBUSY.
>    It is a problem to user. First time adding disk does not return
> success
> but disk is actually added. And I have no good idea to deal with it.
> Please
> give some great advice.

I haven't looked at the code in detail, but off the top of my head, it
should help to execute step 5 after step 6. The close() in step 5
triggers the uevent via inotify; doing it after the ioctl should avoid
the above problem.

Another obvious workaround in mdadm would be to check the state of the
array in the EBUSY case and find out that the disk had already been
added.

But again, this was just a high-level guess.

Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 15:04 ` Martin Wilck
@ 2023-03-14 15:59   ` Mariusz Tkaczyk
  2023-03-14 16:11     ` Martin Wilck
  0 siblings, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-03-14 15:59 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Li Xiao Keng, Jes Sorensen, Paul Menzel, Coly Li, linux-raid,
	linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Tue, 14 Mar 2023 16:04:23 +0100
Martin Wilck <mwilck@suse.com> wrote:

> On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:
> > Hi,
> >    Here we meet a question. When we add a new disk to a raid, it may
> > return
> > -EBUSY.
> >    The main process of --add(for example md0, sdf):
> >        1.dev_open(sdf)
> >        2.add_to_super
> >        3.write_init_super
> >        4.fsync(fd)
> >        5.close(fd)
> >        6.ioctl(ADD_NEW_DISK).
> >    However, there will be some udev(change of sdf) event after step5.
> > Then
> > "/usr/sbin/mdadm --incremental --export $devnode --offroot
> > $env{DEVLINKS}"
> > will be run, and the sdf will be added to md0. After that, step6 will
> > return
> > -EBUSY.
> >    It is a problem to user. First time adding disk does not return
> > success
> > but disk is actually added. And I have no good idea to deal with it.
> > Please
> > give some great advice.  
> 
> I haven't looked at the code in detail, but off the top of my head, it
> should help to execute step 5 after step 6. The close() in step 5
> triggers the uevent via inotify; doing it after the ioctl should avoid
> the above problem.
Hi,
That will result in EBUSY in everytime. mdadm will always handle
descriptor and kernel will refuse to add the drive.

> 
> Another obvious workaround in mdadm would be to check the state of the
> array in the EBUSY case and find out that the disk had already been
> added.
> 
> But again, this was just a high-level guess.
> 
> Martin
> 

Hmm... I'm not a native expert but why we cannot write metadata after adding
drive to array? Why kernel can't handle that?

Ideally, we should lock device and block udev- I know that there is flock
based API to do that but I'm not sure if flock() won't cause the same problem.

There is also something like "udev-md-raid-creating.rules". Maybe we can reuse
it?

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 15:59   ` Mariusz Tkaczyk
@ 2023-03-14 16:11     ` Martin Wilck
  2023-03-15 10:10       ` Mariusz Tkaczyk
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Wilck @ 2023-03-14 16:11 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Li Xiao Keng, Jes Sorensen, Paul Menzel, Coly Li, linux-raid,
	linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
> On Tue, 14 Mar 2023 16:04:23 +0100
> Martin Wilck <mwilck@suse.com> wrote:
> 
> > On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:
> > > Hi,
> > >    Here we meet a question. When we add a new disk to a raid, it
> > > may
> > > return
> > > -EBUSY.
> > >    The main process of --add(for example md0, sdf):
> > >        1.dev_open(sdf)
> > >        2.add_to_super
> > >        3.write_init_super
> > >        4.fsync(fd)
> > >        5.close(fd)
> > >        6.ioctl(ADD_NEW_DISK).
> > >    However, there will be some udev(change of sdf) event after
> > > step5.
> > > Then
> > > "/usr/sbin/mdadm --incremental --export $devnode --offroot
> > > $env{DEVLINKS}"
> > > will be run, and the sdf will be added to md0. After that, step6
> > > will
> > > return
> > > -EBUSY.
> > >    It is a problem to user. First time adding disk does not
> > > return
> > > success
> > > but disk is actually added. And I have no good idea to deal with
> > > it.
> > > Please
> > > give some great advice.  
> > 
> > I haven't looked at the code in detail, but off the top of my head,
> > it
> > should help to execute step 5 after step 6. The close() in step 5
> > triggers the uevent via inotify; doing it after the ioctl should
> > avoid
> > the above problem.
> Hi,
> That will result in EBUSY in everytime. mdadm will always handle
> descriptor and kernel will refuse to add the drive.

Why would it cause EBUSY? Please elaborate. My suggestion would avoid
the race described by Li Xiao Keng. I only suggested to postpone the
close(), nothing else. The fsync() would still be done before the
ioctl, so the metadata should be safely on disk when the ioctl is run.

This is a recurring pattern. Tools that manipulate block devices must
be aware that close-after-write triggers an uevent, and should make
sure that they don't close() such files prematurely.

> > 
> > Another obvious workaround in mdadm would be to check the state of
> > the
> > array in the EBUSY case and find out that the disk had already been
> > added.
> > 
> > But again, this was just a high-level guess.
> > 
> > Martin
> > 
> 
> Hmm... I'm not a native expert but why we cannot write metadata after
> adding
> drive to array? Why kernel can't handle that?
> 
> Ideally, we should lock device and block udev- I know that there is
> flock
> based API to do that but I'm not sure if flock() won't cause the same
> problem.

That doesn't work reliably. At least not in general. The mechanmism is
disabled for for dm devices (e.g. multipath), for example. See
https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483


> There is also something like "udev-md-raid-creating.rules". Maybe we
> can reuse
> it?
> 

Unless I am mistaken, these rules are exactly those that cause the
issue we are discussing. Since these rules are also part of the mdadm
package, it might be possible to set some flag under /run that would
indicate to the rules that auto-assembly should be skipped. But that
might be racy, too.

Martin

> Thanks,
> Mariusz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 14:58 [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export" Li Xiao Keng
  2023-03-14 15:04 ` Martin Wilck
@ 2023-03-14 16:40 ` Coly Li
  2023-03-15  2:25   ` Li Xiao Keng
  1 sibling, 1 reply; 14+ messages in thread
From: Coly Li @ 2023-03-14 16:40 UTC (permalink / raw)
  To: Li Xiao Keng
  Cc: Jes Sorensen, Martin Wilck, Paul Menzel, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin



> 2023年3月14日 22:58,Li Xiao Keng <lixiaokeng@huawei.com> 写道:
> 
> Hi,
>   Here we meet a question. When we add a new disk to a raid, it may return
> -EBUSY.
>   The main process of --add(for example md0, sdf):
>       1.dev_open(sdf)
>       2.add_to_super
>       3.write_init_super
>       4.fsync(fd)
>       5.close(fd)
>       6.ioctl(ADD_NEW_DISK).
>   However, there will be some udev(change of sdf) event after step5. Then
> "/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}"
> will be run, and the sdf will be added to md0. After that, step6 will return
> -EBUSY.
>   It is a problem to user. First time adding disk does not return success
> but disk is actually added. And I have no good idea to deal with it. Please
> give some great advice.

BTW, may I ask what is the mdadm version used in the above scenario?

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 16:40 ` Coly Li
@ 2023-03-15  2:25   ` Li Xiao Keng
  0 siblings, 0 replies; 14+ messages in thread
From: Li Xiao Keng @ 2023-03-15  2:25 UTC (permalink / raw)
  To: Coly Li
  Cc: Jes Sorensen, Martin Wilck, Paul Menzel, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin



On 2023/3/15 0:40, Coly Li wrote:
> 
> 
>> 2023年3月14日 22:58,Li Xiao Keng <lixiaokeng@huawei.com> 写道:
>>
>> Hi,
>>   Here we meet a question. When we add a new disk to a raid, it may return
>> -EBUSY.
>>   The main process of --add(for example md0, sdf):
>>       1.dev_open(sdf)
>>       2.add_to_super
>>       3.write_init_super
>>       4.fsync(fd)
>>       5.close(fd)
>>       6.ioctl(ADD_NEW_DISK).
>>   However, there will be some udev(change of sdf) event after step5. Then
>> "/usr/sbin/mdadm --incremental --export $devnode --offroot $env{DEVLINKS}"
>> will be run, and the sdf will be added to md0. After that, step6 will return
>> -EBUSY.
>>   It is a problem to user. First time adding disk does not return success
>> but disk is actually added. And I have no good idea to deal with it. Please
>> give some great advice.
> 
> BTW, may I ask what is the mdadm version used in the above scenario?
> 
The mdadm version is 4.1. After read the newest upstream code, I think
the problem also exists.

> Thanks.
> 
> Coly Li
> 
> .
> 

Regards
Li Xiao Keng

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-14 16:11     ` Martin Wilck
@ 2023-03-15 10:10       ` Mariusz Tkaczyk
  2023-03-15 13:10         ` Li Xiao Keng
  0 siblings, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-03-15 10:10 UTC (permalink / raw)
  To: Martin Wilck, Song Liu
  Cc: Li Xiao Keng, Jes Sorensen, Paul Menzel, Coly Li, linux-raid,
	linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Tue, 14 Mar 2023 17:11:06 +0100
Martin Wilck <mwilck@suse.com> wrote:

> On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
> > On Tue, 14 Mar 2023 16:04:23 +0100
> > Martin Wilck <mwilck@suse.com> wrote:
> >   
> > > On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:  
> > > > Hi,
> > > >    Here we meet a question. When we add a new disk to a raid, it
> > > > may
> > > > return
> > > > -EBUSY.
> > > >    The main process of --add(for example md0, sdf):
> > > >        1.dev_open(sdf)
> > > >        2.add_to_super
> > > >        3.write_init_super
> > > >        4.fsync(fd)
> > > >        5.close(fd)
> > > >        6.ioctl(ADD_NEW_DISK).
> > > >    However, there will be some udev(change of sdf) event after
> > > > step5.
> > > > Then
> > > > "/usr/sbin/mdadm --incremental --export $devnode --offroot
> > > > $env{DEVLINKS}"
> > > > will be run, and the sdf will be added to md0. After that, step6
> > > > will
> > > > return
> > > > -EBUSY.
> > > >    It is a problem to user. First time adding disk does not
> > > > return
> > > > success
> > > > but disk is actually added. And I have no good idea to deal with
> > > > it.
> > > > Please
> > > > give some great advice.    
> > > 
> > > I haven't looked at the code in detail, but off the top of my head,
> > > it
> > > should help to execute step 5 after step 6. The close() in step 5
> > > triggers the uevent via inotify; doing it after the ioctl should
> > > avoid
> > > the above problem.  
> > Hi,
> > That will result in EBUSY in everytime. mdadm will always handle
> > descriptor and kernel will refuse to add the drive.  
> 
> Why would it cause EBUSY? Please elaborate. My suggestion would avoid
> the race described by Li Xiao Keng. I only suggested to postpone the
> close(), nothing else. The fsync() would still be done before the
> ioctl, so the metadata should be safely on disk when the ioctl is run.

Because device will be claimed in userspace by mdadm. MD may check if device
is not claimed. I checked bind_rdev_to_array() and I don't find an obvious
answer, so I could be wrong here.

Maybe someone more kernel experienced can speak here? Song, could you look?

I think that the descriptor opened by incremental block kernel from adding the
device to the array but also the same incremental is able to add it later
because metadata is on device. There is no retry in Manage_add() flow so it
must be added by Incremental so the question is if it is already in
array or disk is just claimed and that is the problem.

Eventually, you can test your proposal that should gives us an answer.

> 
> This is a recurring pattern. Tools that manipulate block devices must
> be aware that close-after-write triggers an uevent, and should make
> sure that they don't close() such files prematurely.

Agree. Mdadm has this problem, descriptors are opened and closed may times.
> 
> > > 
> > > Another obvious workaround in mdadm would be to check the state of
> > > the
> > > array in the EBUSY case and find out that the disk had already been
> > > added.
> > > 
> > > But again, this was just a high-level guess.
> > > 
> > > Martin
> > >   
> > 
> > Hmm... I'm not a native expert but why we cannot write metadata after
> > adding
> > drive to array? Why kernel can't handle that?
> > 

Ok, there is an assumption in kernel that device MUST have metadata.

> > Ideally, we should lock device and block udev- I know that there is
> > flock
> > based API to do that but I'm not sure if flock() won't cause the same
> > problem.  
> 
> That doesn't work reliably. At least not in general. The mechanmism is
> disabled for for dm devices (e.g. multipath), for example. See
> https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483
> 
Yeah, I know but udev processes the disk, not MD device so the locking
should work. But if we cannot trust it, we shouldn't follow this idea.
>  
> > There is also something like "udev-md-raid-creating.rules". Maybe we
> > can reuse
> > it?
> >   
> 
> Unless I am mistaken, these rules are exactly those that cause the
> issue we are discussing. Since these rules are also part of the mdadm
> package, it might be possible to set some flag under /run that would
> indicate to the rules that auto-assembly should be skipped. But that
> might be racy, too.

Yeah, bad idea. Agree.

New day, new ideas:
- why we cannot let udev to add the device? just write metadata and finish,
  udev should handle that because metadata is there.

- incremental uses map_lock() to prevent concurrent updates, I seems to b
  missed for --add. That could be a key to prevent the behavior.Incremental is
  checking if it can lock the map file. If not, it finishes gracefully. To lock
  array we need to read metadata first, so it goes to the first question- is it
  a problem that incremental has opened descriptor?

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-15 10:10       ` Mariusz Tkaczyk
@ 2023-03-15 13:10         ` Li Xiao Keng
  2023-03-15 14:14           ` Martin Wilck
  0 siblings, 1 reply; 14+ messages in thread
From: Li Xiao Keng @ 2023-03-15 13:10 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Martin Wilck, Song Liu
  Cc: Jes Sorensen, Paul Menzel, Coly Li, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin



On 2023/3/15 18:10, Mariusz Tkaczyk wrote:
> On Tue, 14 Mar 2023 17:11:06 +0100
> Martin Wilck <mwilck@suse.com> wrote:
> 
>> On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
>>> On Tue, 14 Mar 2023 16:04:23 +0100
>>> Martin Wilck <mwilck@suse.com> wrote:
>>>   
>>>> On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:  
>>>>> Hi,
>>>>>    Here we meet a question. When we add a new disk to a raid, it
>>>>> may
>>>>> return
>>>>> -EBUSY.
>>>>>    The main process of --add(for example md0, sdf):
>>>>>        1.dev_open(sdf)
>>>>>        2.add_to_super
>>>>>        3.write_init_super
>>>>>        4.fsync(fd)
>>>>>        5.close(fd)
>>>>>        6.ioctl(ADD_NEW_DISK).
>>>>>    However, there will be some udev(change of sdf) event after
>>>>> step5.
>>>>> Then
>>>>> "/usr/sbin/mdadm --incremental --export $devnode --offroot
>>>>> $env{DEVLINKS}"
>>>>> will be run, and the sdf will be added to md0. After that, step6
>>>>> will
>>>>> return
>>>>> -EBUSY.
>>>>>    It is a problem to user. First time adding disk does not
>>>>> return
>>>>> success
>>>>> but disk is actually added. And I have no good idea to deal with
>>>>> it.
>>>>> Please
>>>>> give some great advice.    
>>>>
>>>> I haven't looked at the code in detail, but off the top of my head,
>>>> it
>>>> should help to execute step 5 after step 6. The close() in step 5
>>>> triggers the uevent via inotify; doing it after the ioctl should
>>>> avoid
>>>> the above problem.  
>>> Hi,
>>> That will result in EBUSY in everytime. mdadm will always handle
>>> descriptor and kernel will refuse to add the drive.  
>>
>> Why would it cause EBUSY? Please elaborate. My suggestion would avoid
>> the race described by Li Xiao Keng. I only suggested to postpone the
>> close(), nothing else. The fsync() would still be done before the
>> ioctl, so the metadata should be safely on disk when the ioctl is run.
> 
> Because device will be claimed in userspace by mdadm. MD may check if device
> is not claimed. I checked bind_rdev_to_array() and I don't find an obvious
> answer, so I could be wrong here.
> 
  I test move close() after ioctl(). The reason of EBUSY is that mdadm
open(sdf) with O_EXCL. So fd should be closed before ioctl. When I remove
O_EXCL, ioctl() will return success. I guess MD check if device is not
claimed in md_import_device()->lock_rdev()->blkdev_get_by_dev()->blkdev_get().

> Maybe someone more kernel experienced can speak here? Song, could you look?
> 
> I think that the descriptor opened by incremental block kernel from adding the
> device to the array but also the same incremental is able to add it later
> because metadata is on device. There is no retry in Manage_add() flow so it
> must be added by Incremental so the question is if it is already in
> array or disk is just claimed and that is the problem.
> 
> Eventually, you can test your proposal that should gives us an answer.
> 
>>
>> This is a recurring pattern. Tools that manipulate block devices must
>> be aware that close-after-write triggers an uevent, and should make
>> sure that they don't close() such files prematurely.
> 
> Agree. Mdadm has this problem, descriptors are opened and closed may times.

Yes, there is closing descriptors in write_init_super1, Kill(),
tst->ss->free_super(tst) between fsync() and ioctl().

>>
>>>>
>>>> Another obvious workaround in mdadm would be to check the state of
>>>> the
>>>> array in the EBUSY case and find out that the disk had already been
>>>> added.
>>>>
>>>> But again, this was just a high-level guess.
>>>>
>>>> Martin
>>>>   
>>>
>>> Hmm... I'm not a native expert but why we cannot write metadata after
>>> adding
>>> drive to array? Why kernel can't handle that?
>>>
> 
> Ok, there is an assumption in kernel that device MUST have metadata.
> 
>>> Ideally, we should lock device and block udev- I know that there is
>>> flock
>>> based API to do that but I'm not sure if flock() won't cause the same
>>> problem.  
>>
>> That doesn't work reliably. At least not in general. The mechanmism is
>> disabled for for dm devices (e.g. multipath), for example. See
>> https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483
>>
> Yeah, I know but udev processes the disk, not MD device so the locking
> should work. But if we cannot trust it, we shouldn't follow this idea.
>>  
>>> There is also something like "udev-md-raid-creating.rules". Maybe we
>>> can reuse
>>> it?
>>>   
>>
>> Unless I am mistaken, these rules are exactly those that cause the
>> issue we are discussing. Since these rules are also part of the mdadm
>> package, it might be possible to set some flag under /run that would
>> indicate to the rules that auto-assembly should be skipped. But that
>> might be racy, too.
> 
> Yeah, bad idea. Agree.
> 
> New day, new ideas:
> - why we cannot let udev to add the device? just write metadata and finish,
>   udev should handle that because metadata is there.
  If we let udev to add the device, We cannot determine whether the disk is
added successfully from the command ("--add") return value. For example,
writing metadata succeeded, but udev failed.
> 
> - incremental uses map_lock() to prevent concurrent updates, I seems to b
>   missed for --add. That could be a key to prevent the behavior.Incremental is
>   checking if it can lock the map file. If not, it finishes gracefully. To lock
>   array we need to read metadata first, so it goes to the first question- is it
>   a problem that incremental has opened descriptor?
> 
> Thanks,
> Mariusz
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-15 13:10         ` Li Xiao Keng
@ 2023-03-15 14:14           ` Martin Wilck
  2023-03-15 14:57             ` Li Xiao Keng
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Wilck @ 2023-03-15 14:14 UTC (permalink / raw)
  To: Li Xiao Keng, Mariusz Tkaczyk, Song Liu
  Cc: Jes Sorensen, Paul Menzel, Coly Li, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Wed, 2023-03-15 at 21:10 +0800, Li Xiao Keng wrote:
> > 
>   I test move close() after ioctl(). The reason of EBUSY is that
> mdadm
> open(sdf) with O_EXCL. So fd should be closed before ioctl. When I
> remove
> O_EXCL, ioctl() will return success.

This makes sense. I suppose mdadm must use O_EXCL if it modifies RAID
meta data, otherwise data corruption is just too likely. It is also
impossible to drop the O_EXCL flag with fcntl() without closing the fd.

So, if mdadm must close the fd before calling ioctl(), the race can
hardly be avoided. The close() will cause a uevent, and nothing
prevents the udev rules from running before the ioctl() returns.

Unless we want to explore the flag-in-tmpfs idea, I think mdadm must
expect this to happen, and deal with -EBUSY accordingly.

But first we should get an answer to Coly's question, which version of
mdadm is in use?

Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-15 14:14           ` Martin Wilck
@ 2023-03-15 14:57             ` Li Xiao Keng
  2023-03-15 15:01               ` Martin Wilck
  0 siblings, 1 reply; 14+ messages in thread
From: Li Xiao Keng @ 2023-03-15 14:57 UTC (permalink / raw)
  To: Martin Wilck, Mariusz Tkaczyk, Song Liu
  Cc: Jes Sorensen, Paul Menzel, Coly Li, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin



On 2023/3/15 22:14, Martin Wilck wrote:
> On Wed, 2023-03-15 at 21:10 +0800, Li Xiao Keng wrote:
>>>
>>   I test move close() after ioctl(). The reason of EBUSY is that
>> mdadm
>> open(sdf) with O_EXCL. So fd should be closed before ioctl. When I
>> remove
>> O_EXCL, ioctl() will return success.
> 
> This makes sense. I suppose mdadm must use O_EXCL if it modifies RAID
> meta data, otherwise data corruption is just too likely. It is also
> impossible to drop the O_EXCL flag with fcntl() without closing the fd.
> 
> So, if mdadm must close the fd before calling ioctl(), the race can
> hardly be avoided. The close() will cause a uevent, and nothing
> prevents the udev rules from running before the ioctl() returns.
> 
  Now I find that close() cause a change udev. Is it necessary to import
"mdadm --incremental --export" when change udev cause? Can we ignore it?

> Unless we want to explore the flag-in-tmpfs idea, I think mdadm must
> expect this to happen, and deal with -EBUSY accordingly.
> 
> But first we should get an answer to Coly's question, which version of
> mdadm is in use?
> 
  I use 4.1 mdadm and 5.10 kernel.
> Martin
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-15 14:57             ` Li Xiao Keng
@ 2023-03-15 15:01               ` Martin Wilck
  2023-03-16 10:44                 ` Mariusz Tkaczyk
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Wilck @ 2023-03-15 15:01 UTC (permalink / raw)
  To: Li Xiao Keng, Mariusz Tkaczyk, Song Liu
  Cc: Jes Sorensen, Paul Menzel, Coly Li, linux-raid, linfeilong,
	louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Wed, 2023-03-15 at 22:57 +0800, Li Xiao Keng wrote:
> 
> 
> On 2023/3/15 22:14, Martin Wilck wrote:
> > On Wed, 2023-03-15 at 21:10 +0800, Li Xiao Keng wrote:
> > > > 
> > >   I test move close() after ioctl(). The reason of EBUSY is that
> > > mdadm
> > > open(sdf) with O_EXCL. So fd should be closed before ioctl. When
> > > I
> > > remove
> > > O_EXCL, ioctl() will return success.
> > 
> > This makes sense. I suppose mdadm must use O_EXCL if it modifies
> > RAID
> > meta data, otherwise data corruption is just too likely. It is also
> > impossible to drop the O_EXCL flag with fcntl() without closing the
> > fd.
> > 
> > So, if mdadm must close the fd before calling ioctl(), the race can
> > hardly be avoided. The close() will cause a uevent, and nothing
> > prevents the udev rules from running before the ioctl() returns.
> > 
>   Now I find that close() cause a change udev. Is it necessary to
> import
> "mdadm --incremental --export" when change udev cause? Can we ignore
> it?

Normally this is what you want to happen if a change uevent for a MD
member device is processed.

The case you're looking at is the exception, as another instance of
mdamn is handling the device right at this point in time.

Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-15 15:01               ` Martin Wilck
@ 2023-03-16 10:44                 ` Mariusz Tkaczyk
  2023-03-20 15:36                   ` Martin Wilck
  0 siblings, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-03-16 10:44 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Li Xiao Keng, Song Liu, Jes Sorensen, Paul Menzel, Coly Li,
	linux-raid, linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Wed, 15 Mar 2023 16:01:26 +0100
Martin Wilck <mwilck@suse.com> wrote:

> On Wed, 2023-03-15 at 22:57 +0800, Li Xiao Keng wrote:
> > 
> > 
> > On 2023/3/15 22:14, Martin Wilck wrote:  
> > > On Wed, 2023-03-15 at 21:10 +0800, Li Xiao Keng wrote:  
> > > > >   
> > > >   I test move close() after ioctl(). The reason of EBUSY is that
> > > > mdadm
> > > > open(sdf) with O_EXCL. So fd should be closed before ioctl. When
> > > > I
> > > > remove
> > > > O_EXCL, ioctl() will return success.  
> > > 
> > > This makes sense. I suppose mdadm must use O_EXCL if it modifies
> > > RAID
> > > meta data, otherwise data corruption is just too likely. It is also
> > > impossible to drop the O_EXCL flag with fcntl() without closing the
> > > fd.
> > > 
> > > So, if mdadm must close the fd before calling ioctl(), the race can
> > > hardly be avoided. The close() will cause a uevent, and nothing
> > > prevents the udev rules from running before the ioctl() returns.
> > >   
> >   Now I find that close() cause a change udev. Is it necessary to
> > import
> > "mdadm --incremental --export" when change udev cause? Can we ignore
> > it?  
> 
> Normally this is what you want to happen if a change uevent for a MD
> member device is processed.
> 
> The case you're looking at is the exception, as another instance of
> mdamn is handling the device right at this point in time.
> 
> Martin
> 
Hi,
Code snipped from Incremental, devname seems to be our disk:

	/* 4/ Check if array exists.
	 */
	if (map_lock(&map))
		pr_err("failed to get exclusive lock on mapfile\n");
	/* Now check we can get O_EXCL.  If not, probably "mdadm -A" has
	 * taken over
	 */
	dfd = dev_open(devname, O_RDONLY|O_EXCL);

The map_lock in --add should resolve your issue. It seems to be simplest
option. I we cannot lock udev effectively then it seems to be the best one.
We need to control adding the device because external metadata is quite
different and for example in IMSM the initial metadata doesn't point to
container wanted by user. Hope it clarifies all objections here.

I don't see any blocker from using locking mechanism for --add.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-16 10:44                 ` Mariusz Tkaczyk
@ 2023-03-20 15:36                   ` Martin Wilck
  2023-03-20 15:51                     ` Mariusz Tkaczyk
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Wilck @ 2023-03-20 15:36 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Li Xiao Keng, Song Liu, Jes Sorensen, Paul Menzel, Coly Li,
	linux-raid, linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Thu, 2023-03-16 at 11:44 +0100, Mariusz Tkaczyk wrote:
> On Wed, 15 Mar 2023 16:01:26 +0100
> Martin Wilck <mwilck@suse.com> wrote:
> > 
> > Normally this is what you want to happen if a change uevent for a
> > MD
> > member device is processed.
> > 
> > The case you're looking at is the exception, as another instance of
> > mdamn is handling the device right at this point in time.
> > 
> > Martin
> > 
> Hi,
> Code snipped from Incremental, devname seems to be our disk:
> 
>         /* 4/ Check if array exists.
>          */
>         if (map_lock(&map))
>                 pr_err("failed to get exclusive lock on mapfile\n");
>         /* Now check we can get O_EXCL.  If not, probably "mdadm -A"
> has
>          * taken over
>          */
>         dfd = dev_open(devname, O_RDONLY|O_EXCL);
> 
> The map_lock in --add should resolve your issue. It seems to be
> simplest
> option. I we cannot lock udev effectively then it seems to be the
> best one.
> We need to control adding the device because external metadata is
> quite
> different and for example in IMSM the initial metadata doesn't point
> to
> container wanted by user. Hope it clarifies all objections here.
> 
> I don't see any blocker from using locking mechanism for --add.


AFAICS it would only help if the code snipped above did not only
pr_err() but exit if it can't get an exclusive lock.

Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
  2023-03-20 15:36                   ` Martin Wilck
@ 2023-03-20 15:51                     ` Mariusz Tkaczyk
  0 siblings, 0 replies; 14+ messages in thread
From: Mariusz Tkaczyk @ 2023-03-20 15:51 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Li Xiao Keng, Song Liu, Jes Sorensen, Paul Menzel, Coly Li,
	linux-raid, linfeilong, louhongxiang, liuzhiqiang (I),
	miaoguanqin

On Mon, 20 Mar 2023 16:36:44 +0100
Martin Wilck <mwilck@suse.com> wrote:

> On Thu, 2023-03-16 at 11:44 +0100, Mariusz Tkaczyk wrote:
> > On Wed, 15 Mar 2023 16:01:26 +0100
> > Martin Wilck <mwilck@suse.com> wrote:  
> > > 
> > > Normally this is what you want to happen if a change uevent for a
> > > MD
> > > member device is processed.
> > > 
> > > The case you're looking at is the exception, as another instance of
> > > mdamn is handling the device right at this point in time.
> > > 
> > > Martin
> > >   
> > Hi,
> > Code snipped from Incremental, devname seems to be our disk:
> > 
> >         /* 4/ Check if array exists.
> >          */
> >         if (map_lock(&map))
> >                 pr_err("failed to get exclusive lock on mapfile\n");
> >         /* Now check we can get O_EXCL.  If not, probably "mdadm -A"
> > has
> >          * taken over
> >          */
> >         dfd = dev_open(devname, O_RDONLY|O_EXCL);
> > 
> > The map_lock in --add should resolve your issue. It seems to be
> > simplest
> > option. I we cannot lock udev effectively then it seems to be the
> > best one.
> > We need to control adding the device because external metadata is
> > quite
> > different and for example in IMSM the initial metadata doesn't point
> > to
> > container wanted by user. Hope it clarifies all objections here.
> > 
> > I don't see any blocker from using locking mechanism for --add.  
> 
> 
> AFAICS it would only help if the code snipped above did not only
> pr_err() but exit if it can't get an exclusive lock.
> 
> Martin
> 

Indeed. I missed that... Thanks for catching. My idea is no longer valid.

Mariusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-03-20 16:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-14 14:58 [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export" Li Xiao Keng
2023-03-14 15:04 ` Martin Wilck
2023-03-14 15:59   ` Mariusz Tkaczyk
2023-03-14 16:11     ` Martin Wilck
2023-03-15 10:10       ` Mariusz Tkaczyk
2023-03-15 13:10         ` Li Xiao Keng
2023-03-15 14:14           ` Martin Wilck
2023-03-15 14:57             ` Li Xiao Keng
2023-03-15 15:01               ` Martin Wilck
2023-03-16 10:44                 ` Mariusz Tkaczyk
2023-03-20 15:36                   ` Martin Wilck
2023-03-20 15:51                     ` Mariusz Tkaczyk
2023-03-14 16:40 ` Coly Li
2023-03-15  2:25   ` Li Xiao Keng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.