All of lore.kernel.org
 help / color / mirror / Atom feed
* Detecting that an array has been stopped
@ 2013-09-24 20:20 Ian Pilcher
  2013-09-27 23:00 ` CoolCold
  0 siblings, 1 reply; 4+ messages in thread
From: Ian Pilcher @ 2013-09-24 20:20 UTC (permalink / raw)
  To: linux-raid

I've successfully gotten my NAS monitoring program to check the status
of my RAID arrays by parsing /proc/mdstat.  (Definitely a PITA, but I
did get to learn about RAID 10 layouts ans POSIX regular expressions.)

I'm now thinking about how to make the program robust in the situation
where the array names (in /proc/mdstat) aren't necessarily stable.  For
example, a couple of arrays might be stopped for some sort of
maintenance activity and "swap" names when they are reassembled.

The obvious answer is to use mdadm to check the UUIDs of the arrays, but
I don't want to do that every time I check the RAID status (currently
every 30 seconds).  So my plan is to only read the UUID of an array
when it first appears in /proc/mdstat (i.e. it wasn't there the last
time I read the file).

This will work as long as the program notices that an array has been
stopped before a (possibly different) array appears with the same name.
So it would be nice if there were a simple way to reliably detect that
a particular array has been stopped -- even if a different array has
since been started with the same name.  It appears that I can do this
pretty easily with sysfs.

From my initial testing, it looks like I can open each array's
array_state file when I first detect the array, and lseek/read will
return ENODEV if the array is ever stopped -- even if the array is
restarted (with the same or a different name) or if a different array is
started with the same name.

It seems almost too easy.

Is there any reason that this approach won't work?

Thanks!

-- 
========================================================================
Ian Pilcher                                         arequipeno@gmail.com
Sometimes there's nothing left to do but crash and burn...or die trying.
========================================================================


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Detecting that an array has been stopped
  2013-09-24 20:20 Detecting that an array has been stopped Ian Pilcher
@ 2013-09-27 23:00 ` CoolCold
  2013-09-27 23:03   ` CoolCold
  2013-10-11  4:48   ` Can running array device name change? - Was: " Ian Pilcher
  0 siblings, 2 replies; 4+ messages in thread
From: CoolCold @ 2013-09-27 23:00 UTC (permalink / raw)
  To: Ian Pilcher; +Cc: Linux RAID

Hello!
I'm a bit confused by what you mean with "swap names" - if you have
proper mdadm.conf , you will get consistent array names even after
stop/start cycle . Keeping mdadm.conf within initrd (many distros do
this by default), will make you happy in case of reboot too.

Hope this info will be useful for you.

On Wed, Sep 25, 2013 at 12:20 AM, Ian Pilcher <arequipeno@gmail.com> wrote:
> I've successfully gotten my NAS monitoring program to check the status
> of my RAID arrays by parsing /proc/mdstat.  (Definitely a PITA, but I
> did get to learn about RAID 10 layouts ans POSIX regular expressions.)
>
> I'm now thinking about how to make the program robust in the situation
> where the array names (in /proc/mdstat) aren't necessarily stable.  For
> example, a couple of arrays might be stopped for some sort of
> maintenance activity and "swap" names when they are reassembled.
>
> The obvious answer is to use mdadm to check the UUIDs of the arrays, but
> I don't want to do that every time I check the RAID status (currently
> every 30 seconds).  So my plan is to only read the UUID of an array
> when it first appears in /proc/mdstat (i.e. it wasn't there the last
> time I read the file).
>
> This will work as long as the program notices that an array has been
> stopped before a (possibly different) array appears with the same name.
> So it would be nice if there were a simple way to reliably detect that
> a particular array has been stopped -- even if a different array has
> since been started with the same name.  It appears that I can do this
> pretty easily with sysfs.
>
> From my initial testing, it looks like I can open each array's
> array_state file when I first detect the array, and lseek/read will
> return ENODEV if the array is ever stopped -- even if the array is
> restarted (with the same or a different name) or if a different array is
> started with the same name.
>
> It seems almost too easy.
>
> Is there any reason that this approach won't work?
>
> Thanks!
>
> --
> ========================================================================
> Ian Pilcher                                         arequipeno@gmail.com
> Sometimes there's nothing left to do but crash and burn...or die trying.
> ========================================================================
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best regards,
[COOLCOLD-RIPN]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Detecting that an array has been stopped
  2013-09-27 23:00 ` CoolCold
@ 2013-09-27 23:03   ` CoolCold
  2013-10-11  4:48   ` Can running array device name change? - Was: " Ian Pilcher
  1 sibling, 0 replies; 4+ messages in thread
From: CoolCold @ 2013-09-27 23:03 UTC (permalink / raw)
  To: Ian Pilcher; +Cc: Linux RAID

Also, you may be interested in check_raid Nagios plugin - "check_raid
(rev1.119): plugin to check sw/hw RAID status The plugin looks for any
known types of RAID configurations, and checks them all." -
https://apps.ubuntu.com/cat/applications/nagios-plugins-contrib/

On Sat, Sep 28, 2013 at 3:00 AM, CoolCold <coolthecold@gmail.com> wrote:
> Hello!
> I'm a bit confused by what you mean with "swap names" - if you have
> proper mdadm.conf , you will get consistent array names even after
> stop/start cycle . Keeping mdadm.conf within initrd (many distros do
> this by default), will make you happy in case of reboot too.
>
> Hope this info will be useful for you.
>
> On Wed, Sep 25, 2013 at 12:20 AM, Ian Pilcher <arequipeno@gmail.com> wrote:
>> I've successfully gotten my NAS monitoring program to check the status
>> of my RAID arrays by parsing /proc/mdstat.  (Definitely a PITA, but I
>> did get to learn about RAID 10 layouts ans POSIX regular expressions.)
>>
>> I'm now thinking about how to make the program robust in the situation
>> where the array names (in /proc/mdstat) aren't necessarily stable.  For
>> example, a couple of arrays might be stopped for some sort of
>> maintenance activity and "swap" names when they are reassembled.
>>
>> The obvious answer is to use mdadm to check the UUIDs of the arrays, but
>> I don't want to do that every time I check the RAID status (currently
>> every 30 seconds).  So my plan is to only read the UUID of an array
>> when it first appears in /proc/mdstat (i.e. it wasn't there the last
>> time I read the file).
>>
>> This will work as long as the program notices that an array has been
>> stopped before a (possibly different) array appears with the same name.
>> So it would be nice if there were a simple way to reliably detect that
>> a particular array has been stopped -- even if a different array has
>> since been started with the same name.  It appears that I can do this
>> pretty easily with sysfs.
>>
>> From my initial testing, it looks like I can open each array's
>> array_state file when I first detect the array, and lseek/read will
>> return ENODEV if the array is ever stopped -- even if the array is
>> restarted (with the same or a different name) or if a different array is
>> started with the same name.
>>
>> It seems almost too easy.
>>
>> Is there any reason that this approach won't work?
>>
>> Thanks!
>>
>> --
>> ========================================================================
>> Ian Pilcher                                         arequipeno@gmail.com
>> Sometimes there's nothing left to do but crash and burn...or die trying.
>> ========================================================================
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best regards,
> [COOLCOLD-RIPN]



-- 
Best regards,
[COOLCOLD-RIPN]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Can running array device name change? - Was: Detecting that an array has been stopped
  2013-09-27 23:00 ` CoolCold
  2013-09-27 23:03   ` CoolCold
@ 2013-10-11  4:48   ` Ian Pilcher
  1 sibling, 0 replies; 4+ messages in thread
From: Ian Pilcher @ 2013-10-11  4:48 UTC (permalink / raw)
  To: linux-raid

TL;DR: Is there any way that the kernel device name of a running MD RAID
       array (as shown in /proc/mdstat) can be changed without stopping
       the array?

On 09/27/2013 06:00 PM, CoolCold wrote:
> I'm a bit confused by what you mean with "swap names" - if you have
> proper mdadm.conf , you will get consistent array names even after
> stop/start cycle . Keeping mdadm.conf within initrd (many distros do
> this by default), will make you happy in case of reboot too.

Sorry for the delay in responding to this.  I've been off on a journey
to the darkest depths of pthreads, signals, and child processes.  Uugh!

Anyway ... The background to my original question is that I want my
monitoring daemon to deal is intelligently as is (reasonably) possible
with whatever the OS throws at it.  So while it's true that arrays that
are listed in mdadm.conf won't suffer from "unstable" device names, it's
also possible that not every array will always be listed.

In fact, my workstation currently shows the following in /proc/mdstat:

Personalities : [raid1] [raid10]
md126 : inactive dm-13[0] dm-12[4](F) dm-11[3] dm-10[2] dm-9[1](F)
      6288384 blocks super 1.2

md9 : active raid1 sda13[0] sdb13[1]
      81234872 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
...

It turns out that my host OS auto-detected the RAID array from my test
VM (which is why the component devices are logical volumes).  It
initially showed up as md127, and I've since changed its name by
stopping and reassembling it.

It's also possible that I (or someone else who wants to use the program)
may decide that worrying about array device numbers is simply
unnecessary.  Everything is on LVM anyway, so why do I really care what
device name is assigned to a particular array?

Here's what I've come up with:

1. My program maintains a list of known arrays, including each array's
   UUID (always) and last known kernel device name (if any).

2. At program startup, I read the UUIDs of "static" arrays from
   /etc/mdadm.conf.  static arrays are supposed to always be running;
   the program will alert if a static array is not present in
   /proc/mdstat.

3. When parsing an array in /proc/mdstat, I search the list for an array
   with a matching device name.  (The first time through, I obviously
   won't find any because I only read the UUIDs from mdadm.conf.)

4. If no array with a matching device name is found, I do the following:

   a. Open a file descriptor to the array's array_state file in sysfs.
   b. Use mdadm to read the UUID of the array.
   c. Search the list for array with a matching UUID.
   d. If I find an array with a matching UUID, I update its device name
      and replace its old file descriptor with the new one.
   e. If I don't find an array with a matching UUID, I add the new
      array to my list, marking it as "transient" (i.e. the program
      won't alert if it goes away).

5. If I found an array with a matching device name in step #3, I use
   lseek/read on its associated file descriptor to attempt to read the
   first byte of its array_state file.  I have found that the read will
   result in an ENODEV if the array has been stopped at any time since
   the file descriptor was originally opened (even if the same array or
   a different array has subsequently been started with the same device
   name).

6. If the read in step #5 succeeds, then I know that the array has not
   been stopped since I opened the file descriptor.  AFAIK, this means
   that it's kernel device name has not changed, since I do not believe
   that there is any that the device name of a running array can change.

7. If the read in step #5 causes an ENODEV error, then I know that the
   array that was using this device name at my last scan of /proc/mdstat
   has been stopped.  The array that is now using the device name may or
   may not be the same array, so I do the following:

   a. Close its (now useless) file descriptor.
   b. Clear the array's device name in my list.  (The UUID is the
      canonical identifier, not the name.)
   c. Treat the array as if its device name were newly encountered;
      goto step 4a.

At a higher level, this allows me to do the following:

1. Read /proc/mdstat.

2. Parse /proc/mdstat, checking for any potential device name changes or
   or new arrays since the last scan.

3. If any potential name changes or new arrays are encountered, go back
   to step #1.

Once this loop completes (which will almost always be in a single pass),
I can be confident that the /proc/mdstat contents that I'm parsing
correspond to a known device name <--> UUID mapping.

Of course this all only works if my assumption that the kernel device
name of an array can't change while the array is running is correct.

-- 
========================================================================
Ian Pilcher                                         arequipeno@gmail.com
Sometimes there's nothing left to do but crash and burn...or die trying.
========================================================================


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-10-11  4:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-24 20:20 Detecting that an array has been stopped Ian Pilcher
2013-09-27 23:00 ` CoolCold
2013-09-27 23:03   ` CoolCold
2013-10-11  4:48   ` Can running array device name change? - Was: " Ian Pilcher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.