All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Bitmap did not survive reboot
       [not found] <4AF9ABAA.1020407@redhat.com>
@ 2009-11-11  3:34 ` Leslie Rhorer
  2009-11-11  3:46   ` Leslie Rhorer
                     ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11  3:34 UTC (permalink / raw)
  To: linux-raid


> >> Notice that bring up all raids normally happens before mount
> >> filesystems.  So, with your bitmaps on a partition that likely isn't
> >> mounted when the raids are brought up, how would it ever work?
> >
> > 	It's possible it never has, since as I say I don't ordinarily reboot
> > these systems.  How can I work around this?  As you can see from my
> original
> > post, the boot and root file systems are reiserfs, and we are cautioned
> not
> > to put the bitmap on anything other than an ext2 or ext3 file system,
> which
> > is why I created the small /dev/hd4 partition.
> 
> Sure, that makes sense, but given that usually filesystems are on raid
> devices and not the other way around, it makes sense that the raids are
> brought up first ;-)

	I see your point.  It's certainly a simpler approach.

> > 	I suppose I could put a script in /etc/init.d et al in order to grow
> > the array with a bitmap every time the system boots, but that's clumsy,
> and
> > I could imagine it could create a problem if the array ever comes up
> > unclean.
> >
> > 	I guess I could also drop to nunlevel 0, wipe the root partition,
> > recreate it as an ext3, and copy all the files back, but I don't relish
> the
> > idea.  Is there any way to convert a reiserfs partition to an ext3 on
> the
> > fly?
> >
> > 	Oh, and just BTW, Debian does not employ rc.sysinit.
> 
> I don't grok debian unfortunately :-(  However, short of essentially
> going in an hand editing the startup files to either A) mount the bitmap
> mount early or B) both start and mount md0 late, you won't get it to
> work.

	Well, it's supposed to be pretty simple, but I just ran across
something very odd.  Instead of using an rc.sysinit file, Debian maintains a
directory in /etc for each runlevel named rcN.d, where N is the runlevel,
plus one named rcS.d and a file named rc.local.  The rc.local is run after
exiting any multi-user runlevel, and normally does nothing but quit with an
exit code of 0.  Generally, the files in the rcN.d and rcS.d directories are
all just symlinks to scripts in /etc/init.d.  The convention is the link
names are of the form Sxxyyyy or Kxxyyyy, where xx is a number between 01
and 99 and yyyy is just some mnemonic text.  Any link with a leading "K" is
taken to be disabled and is thus ignored by the system

	The scripts in rcS.d are executed during a system boot, before
entering any runlevel, including a single user runlevel.  In addition to
running everything in rcS.d at boot time, whenever entering runlevel N, all
the files in rcN.d are executed.  Each file is executed in order by its
name.  Thus, all the S01 - S10 scripts are run before S20, etc.  By the time
any S40xxx script runs in rcS.d, all the local file systems should be
mounted, networking should be available, and all device drivers should be
initialized.  By the time any S60xxx script is run, the system clock should
be set, any NFS file systems should be mounted (unless they depend upon the
automounter), and any file system cleaning should be done.  The first RAID
script in rcS.d is S25mdadm-raid and the first call to the mount script is
S35mountall.sh.  Thus, as you say, the RAID systems are loaded before the
system attempts to mount anything other than /.  The default runlevel in
Debian is 2, so during ordinary booting, everything in rcS.d should run
followed by everything in rc2.d.

	Here's what's weird, and it can't really be correct... I don't
think.  In both rcS.d and rc2.d (and no doubt others), there are two
scripts:

RAID-Server:/etc# ll rcS.d/*md*
lrwxrwxrwx 1 root root 20 2008-11-21 22:35 rcS.d/S25mdadm-raid ->
../init.d/mdadm-raid
lrwxrwxrwx 1 root root 20 2008-12-27 18:35 rcS.d/S99mdadm_monitor ->
../init.d/mdadm-raid
RAID-Server:/etc# ll rc2.d/*md*
lrwxrwxrwx 1 root root 15 2008-11-21 22:35 rc2.d/S25mdadm -> ../init.d/mdadm
lrwxrwxrwx 1 root root 20 2008-12-27 18:36 rc2.d/S99mdadm_monitor ->
../init.d/mdadm-raid

	Note both S99mdadm_monitor links point to /etc/init.d/mdadm-raid,
and so does the S25mdadm-raid script in rc2.d, while the /etc/rc2.d/S25mdadm
script points to /etc/init.d/mdadm.  The mdadm-raid script starts up the
RAID process, and the mdadm script runs the monitor.  It seems to me the
only link which is really correct is the rcS.d/S25mdadm.  At the very least
I would think both the S99mdadm_monitor links should point to init.d/mdadm
(which, after all is the script which starts the monitor) and that
rc2.d/S25mdadm-raid would point to init.d/mdadm, just as the
rcS.d/S25mdadm-raid link does.  Of course, since the RAID startup script
does get called before any of the others, and since the script only shuts
down RAID for runlevel 0 (halt) or runlevel 6 (reboot) and not for runlevel
1 - 5 or S, it still works OK, but I don't think it's really correct.  Can
someone else comment?

	Getting back to my dilema, however, I suppose I could simply create
an /etc/rcS.d/S24mounthda4 script that explicitly mounts /dev/hda4 to
/etc/mdadm/bitmap, or I could modify the init.d/mdadm-raid script to
explicitly mount the /dev/hda partition if it is not already mounted.
Editing the init.d/mdadm-raid script is a bit cleaner and perhaps clearer,
but any update to mdadm is liable to wipe out the modifications to the
startup script.

>  If the array isn't super performance critical, I would use mdadm
> to delete the bitmap, then grow an internal bitmap with a nice high
> chunk size and just go from there.  It can't be worse than what you've
> got going on now.

	I really dislike that option.  Doing it manually every time I boot
would be a pain.  Writing a script to do it automatically is no more trouble
(or really much different) than writing a script to mount the partition
explicitly prior to running mdadm, but it avoids any issues of which I am
unaware (but can imagine) with, say, trying to grow a bitmap on an array
that is other than clean.  I'd rather have mdadm take care of such details.

	What do you (and the other memebers of the list) think?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  3:34 ` Bitmap did not survive reboot Leslie Rhorer
@ 2009-11-11  3:46   ` Leslie Rhorer
  2009-11-11  5:22     ` Majed B.
  2009-11-11 15:19   ` Gabor Gombas
  2009-11-11 20:32   ` Doug Ledford
  2 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11  3:46 UTC (permalink / raw)
  To: linux-raid

> The mdadm-raid script starts up the
> RAID process, and the mdadm script runs the monitor.  It seems to me the
> only link which is really correct is the rcS.d/S25mdadm.  At the very
> least
> I would think both the S99mdadm_monitor links should point to init.d/mdadm
> (which, after all is the script which starts the monitor) and that
> rc2.d/S25mdadm-raid would point to init.d/mdadm, just as the
> rcS.d/S25mdadm-raid link does.

	'Sorry, I meant to say, "rc2.d/S25mdadm-raid would point to
init.d/mdadm-raid, just as the rcS.d/S25mdadm-raid link does."  The naming
convention here is definitely confusing.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  3:46   ` Leslie Rhorer
@ 2009-11-11  5:22     ` Majed B.
  2009-11-11  8:13       ` Leslie Rhorer
  2009-11-11  9:16       ` Robin Hill
  0 siblings, 2 replies; 30+ messages in thread
From: Majed B. @ 2009-11-11  5:22 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

Hello Leslie,

If you have a temporary space for your data, I'd suggest you move it
out and go for an internal bitmap solution. It certainly beats the
patch work you're going to have to do on the startup scripts (and
every time you update mdadm, or the distro).

On Wed, Nov 11, 2009 at 6:46 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>> The mdadm-raid script starts up the
>> RAID process, and the mdadm script runs the monitor.  It seems to me the
>> only link which is really correct is the rcS.d/S25mdadm.  At the very
>> least
>> I would think both the S99mdadm_monitor links should point to init.d/mdadm
>> (which, after all is the script which starts the monitor) and that
>> rc2.d/S25mdadm-raid would point to init.d/mdadm, just as the
>> rcS.d/S25mdadm-raid link does.
>
>        'Sorry, I meant to say, "rc2.d/S25mdadm-raid would point to
> init.d/mdadm-raid, just as the rcS.d/S25mdadm-raid link does."  The naming
> convention here is definitely confusing.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  5:22     ` Majed B.
@ 2009-11-11  8:13       ` Leslie Rhorer
  2009-11-11  8:19         ` Michael Evans
  2009-11-11  9:31         ` Majed B.
  2009-11-11  9:16       ` Robin Hill
  1 sibling, 2 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11  8:13 UTC (permalink / raw)
  To: linux-raid

> If you have a temporary space for your data, I'd suggest you move it
> out and go for an internal bitmap solution. It certainly beats the

	For 8 Terabytes of data?  No, I don't.  I'm also not really keen on
interrupting the system ( in whatever fashion ) for six to eight days while
I copy the data out and back or taking the primary copy offline while I
re-do the array just so I can implement an internal bitmap.  It's much
easier to handle the external situation one way or the other.

> patch work you're going to have to do on the startup scripts (and
> every time you update mdadm, or the distro).

	That's why I am leaning strongly toward the lower value script,
which in fact I have already done.  Of course, it's also trivial to disable
it.  Updating mdadm or the distro won't affect the mount script.  At most I
would only have to rename the link, and then only if the mdadm startup link
gets re-numbered, which is unlilkely.  Creating the following script and one
symlink to it are hardly "patchwork" in any significant sense:

#! /bin/sh
# Explicitly mount /dev/hda4 prior to running mdadm so the write-intent
# bitmap will be available to mdadm

echo Mounting RAID bitmap...
mount -t ext2 /dev/hda4 /etc/mdadm/bitmap




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  8:13       ` Leslie Rhorer
@ 2009-11-11  8:19         ` Michael Evans
  2009-11-11  8:53           ` Leslie Rhorer
  2009-11-11  9:31         ` Majed B.
  1 sibling, 1 reply; 30+ messages in thread
From: Michael Evans @ 2009-11-11  8:19 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

That kind of exception is one of the main areas I think major dists
fail.  Making it so absolutely difficult to insert administratively
know requirements at 'odd' points in the boot order.  When I last used
Debian it was easy with that S## / K## linking system.  Arch is
another dist that has a way of doing that, except it's based in a core
config file.  I like Debian's method more because then you can use
shell scripts to easily slice/dice/add things at given points.  Arch
is more than sufficient for normal tasks though.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  8:19         ` Michael Evans
@ 2009-11-11  8:53           ` Leslie Rhorer
  2009-11-11  9:31             ` John Robinson
  0 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11  8:53 UTC (permalink / raw)
  To: linux-raid

> That kind of exception is one of the main areas I think major dists
> fail.  Making it so absolutely difficult to insert administratively
> know requirements at 'odd' points in the boot order.  When I last used
> Debian it was easy with that S## / K## linking system.

	I agree.  It may not be the fastest or the most "sexy" method, but
it is solid, simple, and easy to manage, the oddness I discovered this
evening notwithstanding.  (The backup server doesn't have the issue, only
the main video server did.  I simply erased the "duplicate" links, so now
RAID is started at boot and the monitor is started on entry to runlevel 2 -
5.)

>  Arch is
> another dist that has a way of doing that, except it's based in a core
> config file.  I like Debian's method more because then you can use
> shell scripts to easily slice/dice/add things at given points.  Arch
> is more than sufficient for normal tasks though.

	The main reason I like Debian is it stays well away from the
bleeding edge.  Most distros have a stable and an unstable version, and some
have a testing version.  Debian has experimental, testing, unstable, and
stable, and its testing version is more like what most distros call their
stable version.  I do really like the approach Debian takes to its booting.
It's really easy to troubleshoot a booting issue.  Making changes to the
boot sequence often doesn't even require editing any files.  One can simply
rename a link to a higher or lower number to move it about in the boot
sequence, or rename it from Sxxyyy to Kxxyyy to disable it.

	It took me less than 3 minutes total to implement the explicit mount
routine on both servers, and unless someone can give me either a much better
solution or a solid reason I should not take this approach, I think I'm
going to stay with it until such time as I either re-format the root
partition or else re-format the array on either respective system.  I don't
expect the former on either system any time soon.  I expect to do the latter
on one of the arrays some time in the next three or four months, and the
other within a year or so, at which point I may choose the internal bitmap.
I don't know, though.  I think I rather prefer the external bitmap.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  5:22     ` Majed B.
  2009-11-11  8:13       ` Leslie Rhorer
@ 2009-11-11  9:16       ` Robin Hill
  2009-11-11 15:01         ` Leslie Rhorer
  1 sibling, 1 reply; 30+ messages in thread
From: Robin Hill @ 2009-11-11  9:16 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

On Wed Nov 11, 2009 at 08:22:26AM +0300, Majed B. wrote:

> Hello Leslie,
> 
> If you have a temporary space for your data, I'd suggest you move it
> out and go for an internal bitmap solution. It certainly beats the
> patch work you're going to have to do on the startup scripts (and
> every time you update mdadm, or the distro).
> 
There should be no need to move the data off - you can add an internal
bitmap using the --grow option.  An internal bitmap does have more of an
overhead than an external one though.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  8:53           ` Leslie Rhorer
@ 2009-11-11  9:31             ` John Robinson
  2009-11-11 14:52               ` Leslie Rhorer
  0 siblings, 1 reply; 30+ messages in thread
From: John Robinson @ 2009-11-11  9:31 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

On 11/11/2009 08:53, Leslie Rhorer wrote:
[...]
> 	It took me less than 3 minutes total to implement the explicit mount
> routine on both servers, and unless someone can give me either a much better
> solution or a solid reason I should not take this approach

The only problem I can see is that your bitmap is not on a RAID device 
so if the disc goes you lose the bitmap. I guess you're accepting the 
risk of downtime anyway because your filesystem root is on a non-RAID 
device, but while you can (as I think you've said before) replace the 
boot disc quickly, you're exposing yourself to a long, slow resync which 
will increase your downtime by perhaps days...

Cheers,

John.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  8:13       ` Leslie Rhorer
  2009-11-11  8:19         ` Michael Evans
@ 2009-11-11  9:31         ` Majed B.
  2009-11-11 14:54           ` Leslie Rhorer
  1 sibling, 1 reply; 30+ messages in thread
From: Majed B. @ 2009-11-11  9:31 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

> mount -t ext2 /dev/hda4 /etc/mdadm/bitmap

I would suggest you mount the partition using UUID, just in case one
day the disk decided to change its name, like what happened to me a
while back.

On Wed, Nov 11, 2009 at 11:13 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>> If you have a temporary space for your data, I'd suggest you move it
>> out and go for an internal bitmap solution. It certainly beats the
>
>        For 8 Terabytes of data?  No, I don't.  I'm also not really keen on
> interrupting the system ( in whatever fashion ) for six to eight days while
> I copy the data out and back or taking the primary copy offline while I
> re-do the array just so I can implement an internal bitmap.  It's much
> easier to handle the external situation one way or the other.
>
>> patch work you're going to have to do on the startup scripts (and
>> every time you update mdadm, or the distro).
>
>        That's why I am leaning strongly toward the lower value script,
> which in fact I have already done.  Of course, it's also trivial to disable
> it.  Updating mdadm or the distro won't affect the mount script.  At most I
> would only have to rename the link, and then only if the mdadm startup link
> gets re-numbered, which is unlilkely.  Creating the following script and one
> symlink to it are hardly "patchwork" in any significant sense:
>
> #! /bin/sh
> # Explicitly mount /dev/hda4 prior to running mdadm so the write-intent
> # bitmap will be available to mdadm
>
> echo Mounting RAID bitmap...
> mount -t ext2 /dev/hda4 /etc/mdadm/bitmap
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  9:31             ` John Robinson
@ 2009-11-11 14:52               ` Leslie Rhorer
  2009-11-11 16:02                 ` John Robinson
  0 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11 14:52 UTC (permalink / raw)
  To: 'John Robinson'; +Cc: linux-raid

> On 11/11/2009 08:53, Leslie Rhorer wrote:
> [...]
> > 	It took me less than 3 minutes total to implement the explicit mount
> > routine on both servers, and unless someone can give me either a much
> better
> > solution or a solid reason I should not take this approach
> 
> The only problem I can see is that your bitmap is not on a RAID device
> so if the disc goes you lose the bitmap. I guess you're accepting the

	True, but since the OS is also on the drive, if it goes I will need
to shut down the system anyway.

> risk of downtime anyway because your filesystem root is on a non-RAID
> device, but while you can (as I think you've said before) replace the
> boot disc quickly, you're exposing yourself to a long, slow resync which
> will increase your downtime by perhaps days...

	That presumes a loss of the boot drive *AND* one or two RAID drives.
That's pretty unlikely, unless something really nasty happens.  The boot
drives are not on the same controller or in the same enclosure.  They aren't
even the same type of drive.  The boot drives are PATA drives inside the
same enclosure as the respective motherboards.  The RAID drives are SATA
drives in external RAID enclosures.  Right now, a non-bitmap resync takes
about a day and a half, if I limit array access.

	


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  9:31         ` Majed B.
@ 2009-11-11 14:54           ` Leslie Rhorer
  0 siblings, 0 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11 14:54 UTC (permalink / raw)
  To: 'Majed B.'; +Cc: linux-raid

> > mount -t ext2 /dev/hda4 /etc/mdadm/bitmap
> 
> I would suggest you mount the partition using UUID, just in case one
> day the disk decided to change its name, like what happened to me a
> while back.

	Yeah, I've had that happen with SATA drives quite a bit.  That's not
a bad idea.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11  9:16       ` Robin Hill
@ 2009-11-11 15:01         ` Leslie Rhorer
  2009-11-11 15:53           ` Robin Hill
  2009-11-11 20:35           ` Doug Ledford
  0 siblings, 2 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11 15:01 UTC (permalink / raw)
  To: linux-raid


> > If you have a temporary space for your data, I'd suggest you move it
> > out and go for an internal bitmap solution. It certainly beats the
> > patch work you're going to have to do on the startup scripts (and
> > every time you update mdadm, or the distro).
> >
> There should be no need to move the data off - you can add an internal
> bitmap using the --grow option.  An internal bitmap does have more of an
> overhead than an external one though.

	I thought I remembered reading in the man page than an internal
bitmap could only be added when the array was created?  Is that incorrect?  


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  3:34 ` Bitmap did not survive reboot Leslie Rhorer
  2009-11-11  3:46   ` Leslie Rhorer
@ 2009-11-11 15:19   ` Gabor Gombas
  2009-11-11 16:48     ` Leslie Rhorer
  2009-11-11 20:32   ` Doug Ledford
  2 siblings, 1 reply; 30+ messages in thread
From: Gabor Gombas @ 2009-11-11 15:19 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

On Tue, Nov 10, 2009 at 09:34:09PM -0600, Leslie Rhorer wrote:

> RAID-Server:/etc# ll rcS.d/*md*
> lrwxrwxrwx 1 root root 20 2008-11-21 22:35 rcS.d/S25mdadm-raid ->
> ../init.d/mdadm-raid
> lrwxrwxrwx 1 root root 20 2008-12-27 18:35 rcS.d/S99mdadm_monitor ->
> ../init.d/mdadm-raid
> RAID-Server:/etc# ll rc2.d/*md*
> lrwxrwxrwx 1 root root 15 2008-11-21 22:35 rc2.d/S25mdadm -> ../init.d/mdadm
> lrwxrwxrwx 1 root root 20 2008-12-27 18:36 rc2.d/S99mdadm_monitor ->
> ../init.d/mdadm-raid

What Debian version do you have? There are no mdadm_monitor links in
lenny, and I do not have etch systems anymore to check.

> 	Getting back to my dilema, however, I suppose I could simply create
> an /etc/rcS.d/S24mounthda4 script that explicitly mounts /dev/hda4 to
> /etc/mdadm/bitmap, or I could modify the init.d/mdadm-raid script to
> explicitly mount the /dev/hda partition if it is not already mounted.
> Editing the init.d/mdadm-raid script is a bit cleaner and perhaps clearer,
> but any update to mdadm is liable to wipe out the modifications to the
> startup script.

In Debian, modifications to init scripts are preserved during upgrade
unless you explicitely request them to be overwritten.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 15:01         ` Leslie Rhorer
@ 2009-11-11 15:53           ` Robin Hill
  2009-11-11 20:35           ` Doug Ledford
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Hill @ 2009-11-11 15:53 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

On Wed Nov 11, 2009 at 09:01:46AM -0600, Leslie Rhorer wrote:

> 
> > > If you have a temporary space for your data, I'd suggest you move it
> > > out and go for an internal bitmap solution. It certainly beats the
> > > patch work you're going to have to do on the startup scripts (and
> > > every time you update mdadm, or the distro).
> > >
> > There should be no need to move the data off - you can add an internal
> > bitmap using the --grow option.  An internal bitmap does have more of an
> > overhead than an external one though.
> 
> 	I thought I remembered reading in the man page than an internal
> bitmap could only be added when the array was created?  Is that incorrect?  
> 
I've certainly done this with mdadm 2.6.8 - I guess older versions may
not be able to though.  It's only able to use a limited amount of space
though (whatever's left between the metadata and the data/end of disk),
so you don't get as much (if any) control of the chunk size.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 14:52               ` Leslie Rhorer
@ 2009-11-11 16:02                 ` John Robinson
  0 siblings, 0 replies; 30+ messages in thread
From: John Robinson @ 2009-11-11 16:02 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

On 11/11/2009 14:52, Leslie Rhorer wrote:
>> On 11/11/2009 08:53, Leslie Rhorer wrote:
[...]
>> risk of downtime anyway because your filesystem root is on a non-RAID
>> device, but while you can (as I think you've said before) replace the
>> boot disc quickly, you're exposing yourself to a long, slow resync which
>> will increase your downtime by perhaps days...
> 
> 	That presumes a loss of the boot drive *AND* one or two RAID drives.
> That's pretty unlikely, unless something really nasty happens.

I don't think it does - even if none of the RAID drives goes, the system 
crashing because of the root drive going AWOL might well leave the 
system with the RAID array marked "dirty" because it wasn't shut down 
cleanly, which without the bitmap would mean you'd get a full resync 
when you rebooted.

Cheers,

John.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11 15:19   ` Gabor Gombas
@ 2009-11-11 16:48     ` Leslie Rhorer
  0 siblings, 0 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-11 16:48 UTC (permalink / raw)
  To: 'Gabor Gombas'; +Cc: linux-raid

> > RAID-Server:/etc# ll rcS.d/*md*
> > lrwxrwxrwx 1 root root 20 2008-11-21 22:35 rcS.d/S25mdadm-raid ->
> > ../init.d/mdadm-raid
> > lrwxrwxrwx 1 root root 20 2008-12-27 18:35 rcS.d/S99mdadm_monitor ->
> > ../init.d/mdadm-raid
> > RAID-Server:/etc# ll rc2.d/*md*
> > lrwxrwxrwx 1 root root 15 2008-11-21 22:35 rc2.d/S25mdadm ->
> ../init.d/mdadm
> > lrwxrwxrwx 1 root root 20 2008-12-27 18:36 rc2.d/S99mdadm_monitor ->
> > ../init.d/mdadm-raid
> 
> What Debian version do you have? There are no mdadm_monitor links in
> lenny, and I do not have etch systems anymore to check.

	Lenny.  I'm not sure where they came from.  It might even have been
me playing around at some point, and I just forgot to delete them.  I did a
lot of fiddling with mdadm about a year ago.

> > 	Getting back to my dilema, however, I suppose I could simply create
> > an /etc/rcS.d/S24mounthda4 script that explicitly mounts /dev/hda4 to
> > /etc/mdadm/bitmap, or I could modify the init.d/mdadm-raid script to
> > explicitly mount the /dev/hda partition if it is not already mounted.
> > Editing the init.d/mdadm-raid script is a bit cleaner and perhaps
> clearer,
> > but any update to mdadm is liable to wipe out the modifications to the
> > startup script.
> 
> In Debian, modifications to init scripts are preserved during upgrade
> unless you explicitely request them to be overwritten.

	I didn't know that, but even so, it's probably at least somewhat
better not to modify a script unless it's necessary, because the new package
might have some important differences in the script.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11  3:34 ` Bitmap did not survive reboot Leslie Rhorer
  2009-11-11  3:46   ` Leslie Rhorer
  2009-11-11 15:19   ` Gabor Gombas
@ 2009-11-11 20:32   ` Doug Ledford
  2 siblings, 0 replies; 30+ messages in thread
From: Doug Ledford @ 2009-11-11 20:32 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2544 bytes --]

On 11/10/2009 10:34 PM, Leslie Rhorer wrote:
>>  If the array isn't super performance critical, I would use mdadm
>> to delete the bitmap, then grow an internal bitmap with a nice high
>> chunk size and just go from there.  It can't be worse than what you've
>> got going on now.
> 
> 	I really dislike that option.  Doing it manually every time I boot
> would be a pain.  Writing a script to do it automatically is no more trouble
> (or really much different) than writing a script to mount the partition
> explicitly prior to running mdadm, but it avoids any issues of which I am
> unaware (but can imagine) with, say, trying to grow a bitmap on an array
> that is other than clean.  I'd rather have mdadm take care of such details.

I think you are overestimating the difficulty of this solution.  It's as
simple as:

mdadm -G /dev/md0 --bitmap=none
mdadm -G /dev/md0 --bitmap=internal --bitmap-chunk=32768 (or even higher)

and you are done.  It won't need to resync the entire device as the
device is already clean and it won't create a bitmap that's too large
for the free space that currently exists between the superblock and the
start of your data.  You can see the free space available for a bitmap
by running mdadm -E on one of the block devices and interpreting the
data start/data offset/superblock offset fields (sorry there isn't a
simply field to look at, but the math changes depending on what
superblock version you use and I can't remember if I've ever known what
superblock you happen to have).  No need to copy stuff around, no need
to take things down, all done in place, and the issue is solved
permanently with no need to muck around in your system init scripts as
from now on when you boot up the bitmap is internal to the array and
will be used from the second the array is assembled.  The only reason I
mentioned anything about performance is because an internal bitmap does
slightly slow down random access to an array (although not so much
streaming access), but that slowdown is mitigated by using a nice high
bitmap chunk size (and for most people a big bitmap chunk is preferable
anyway).  As I recall, you are serving video files, so your access
pattern is large streaming I/O, and that means the bitmap really
shouldn't be noticeable in your performance.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 15:01         ` Leslie Rhorer
  2009-11-11 15:53           ` Robin Hill
@ 2009-11-11 20:35           ` Doug Ledford
  2009-11-11 21:46             ` Ben DJ
  2009-11-12  0:23             ` Leslie Rhorer
  1 sibling, 2 replies; 30+ messages in thread
From: Doug Ledford @ 2009-11-11 20:35 UTC (permalink / raw)
  To: Leslie Rhorer, Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 1294 bytes --]

On 11/11/2009 10:01 AM, Leslie Rhorer wrote:
> 
>>> If you have a temporary space for your data, I'd suggest you move it
>>> out and go for an internal bitmap solution. It certainly beats the
>>> patch work you're going to have to do on the startup scripts (and
>>> every time you update mdadm, or the distro).
>>>
>> There should be no need to move the data off - you can add an internal
>> bitmap using the --grow option.  An internal bitmap does have more of an
>> overhead than an external one though.
> 
> 	I thought I remembered reading in the man page than an internal
> bitmap could only be added when the array was created?  Is that incorrect?  

Yes, very incorrect.  You can use grow to add an internal bitmap later,
the only limitation is that the bitmap must be small enough to fit in
the reserved space around the superblock.  It's in the case that you
want to create some super huge, absolutely insanely fine grained bitmap
that it must be done at raid device creation time and that's only so it
can reserve sufficient space for the bitmap.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 20:35           ` Doug Ledford
@ 2009-11-11 21:46             ` Ben DJ
  2009-11-11 22:10               ` Robin Hill
  2009-11-12  1:35               ` Doug Ledford
  2009-11-12  0:23             ` Leslie Rhorer
  1 sibling, 2 replies; 30+ messages in thread
From: Ben DJ @ 2009-11-11 21:46 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Leslie Rhorer, Linux RAID Mailing List

Hi,

On Wed, Nov 11, 2009 at 12:35 PM, Doug Ledford <dledford@redhat.com> wrote:
> Yes, very incorrect.  You can use grow to add an internal bitmap later,

Is that true for RAID-10, as well?  I understood "--grow" with RAID-10
wasn't fully capable -- yet.

BenDJ
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 21:46             ` Ben DJ
@ 2009-11-11 22:10               ` Robin Hill
  2009-11-12  1:35               ` Doug Ledford
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Hill @ 2009-11-11 22:10 UTC (permalink / raw)
  To: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

On Wed Nov 11, 2009 at 01:46:50PM -0800, Ben DJ wrote:

> Hi,
> 
> On Wed, Nov 11, 2009 at 12:35 PM, Doug Ledford <dledford@redhat.com> wrote:
> > Yes, very incorrect.  You can use grow to add an internal bitmap later,
> 
> Is that true for RAID-10, as well?  I understood "--grow" with RAID-10
> wasn't fully capable -- yet.
> 
It's true for RAID-10, yes.  You can't physically grow the array, but
you can definitely add/remove the bitmap.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-11 20:35           ` Doug Ledford
  2009-11-11 21:46             ` Ben DJ
@ 2009-11-12  0:23             ` Leslie Rhorer
  2009-11-12  1:34               ` Doug Ledford
  1 sibling, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-12  0:23 UTC (permalink / raw)
  To: 'Doug Ledford', 'Linux RAID Mailing List'

> > would be a pain.  Writing a script to do it automatically is no more
> trouble
> > (or really much different) than writing a script to mount the partition
> > explicitly prior to running mdadm, but it avoids any issues of which I
> am
> > unaware (but can imagine) with, say, trying to grow a bitmap on an array
> > that is other than clean.  I'd rather have mdadm take care of such
> details.
> I think you are overestimating the difficulty of this solution.  It's as
> simple as:
> 
> mdadm -G /dev/md0 --bitmap=none
> mdadm -G /dev/md0 --bitmap=internal --bitmap-chunk=32768 (or even higher)

	No, I was referring to a script which grew an external bitmap on a
mounted file system after mdadm had already done its magic.  What I was
mis-remembering was:

> >>> If you have a temporary space for your data, I'd suggest you move it
> >>> out and go for an internal bitmap solution. It certainly beats the
> >>> patch work you're going to have to do on the startup scripts (and
> >>> every time you update mdadm, or the distro).
> >>>
> >> There should be no need to move the data off - you can add an internal
> >> bitmap using the --grow option.  An internal bitmap does have more of
> an
> >> overhead than an external one though.
> >
> > 	I thought I remembered reading in the man page than an internal
> > bitmap could only be added when the array was created?  Is that
> incorrect?
> 
> Yes, very incorrect.  You can use grow to add an internal bitmap later,

	I guess I skimmed over the manual rather quickly back then, and I
was dealing with serious RAID issues at the time, so I must have improperly
inferred the man page to imply this in the section which says, "Note that
if you add a bitmap stored in a file which is in a filesystem that is on the
raid array being affected, the system will deadlock.  The bitmap must be on
a separate filesystem" to read something more like, "Note that if you add a
bitmap ...  the bitmap must be on a separate filesystem.

> the only limitation is that the bitmap must be small enough to fit in
> the reserved space around the superblock.  It's in the case that you
> want to create some super huge, absolutely insanely fine grained bitmap
> that it must be done at raid device creation time and that's only so it
> can reserve sufficient space for the bitmap.

	How can I know how much space is available?  I tried adding the
internal bitmap without specifying anything, and it seems to have worked
fine.  When I created the bitmap in an external file (without specifying the
size), it was around 100K, which seems rather small.  Both of these systems
use un-partitioned disks with XFS mounted directly on the RAID array.  One
is a 7 drive RAID5 array on 1.5 TB disks and the other is a 10 drive RAID6
array on 1.0TB disks.  Both are using a version 1.2 superblock.  The only
thing which jumps out at me is --examine, but it doesn't seem to tell me
much:

RAID-Server:/usr/share/pyTivo# mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 5ff10d73:a096195f:7a646bba:a68986ca
           Name : RAID-Server:0  (local to host RAID-Server)
  Creation Time : Sat Apr 25 01:17:12 2009
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 1953524896 (931.51 GiB 1000.20 GB)
     Array Size : 15628197888 (7452.11 GiB 8001.64 GB)
  Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : d40c9255:cef0739f:966d448d:e549ada8

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Nov 11 18:17:26 2009
       Checksum : 9a4cc480 - correct
         Events : 488380

     Chunk Size : 256K

    Array Slot : 0 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
   Array State : Uuuuuuuuuu


Backup:/etc/gadmin-rsync# mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 940ae4e4:04057ffc:5e92d2fb:63e3efb7
           Name : 'Backup':0
  Creation Time : Sun Jul 12 20:44:02 2009
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 2930276896 (1397.26 GiB 1500.30 GB)
     Array Size : 17581661184 (8383.59 GiB 9001.81 GB)
  Used Dev Size : 2930276864 (1397.26 GiB 1500.30 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 6156794f:00807e1b:306ed20d:b81914de

Internal Bitmap : 2 sectors from superblock
    Update Time : Wed Nov 11 11:52:43 2009
       Checksum : 12afc60a - correct
         Events : 10100

         Layout : left-symmetric
     Chunk Size : 256K

    Array Slot : 0 (0, 1, 2, 3, 4, 5, 6)
   Array State : Uuuuuuu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-12  0:23             ` Leslie Rhorer
@ 2009-11-12  1:34               ` Doug Ledford
  2009-11-12  4:55                 ` Leslie Rhorer
  0 siblings, 1 reply; 30+ messages in thread
From: Doug Ledford @ 2009-11-12  1:34 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: 'Linux RAID Mailing List'

[-- Attachment #1: Type: text/plain, Size: 4506 bytes --]

On 11/11/2009 07:23 PM, Leslie Rhorer wrote:

> 	I guess I skimmed over the manual rather quickly back then, and I
> was dealing with serious RAID issues at the time, so I must have improperly
> inferred the man page to imply this in the section which says, "Note that
> if you add a bitmap stored in a file which is in a filesystem that is on the
> raid array being affected, the system will deadlock.  The bitmap must be on
> a separate filesystem" to read something more like, "Note that if you add a
> bitmap ...  the bitmap must be on a separate filesystem.

Understandable, and now corrected, so no biggie ;-)

>> the only limitation is that the bitmap must be small enough to fit in
>> the reserved space around the superblock.  It's in the case that you
>> want to create some super huge, absolutely insanely fine grained bitmap
>> that it must be done at raid device creation time and that's only so it
>> can reserve sufficient space for the bitmap.
> 
> 	How can I know how much space is available?  I tried adding the
> internal bitmap without specifying anything, and it seems to have worked
> fine.  When I created the bitmap in an external file (without specifying the
> size), it was around 100K, which seems rather small.

100k is a huge bitmap.  For my 2.5TB array, and a bitmap chunk size of
32768KB, I get the entire in-memory bitmap in 24k (as I recall, the
in-memory bitmap is larger than the on-disk bitmap as the on-disk bitmap
only stores a dirty/clean bit per chunk where as the in-memory bitmap
also includes a counter per chunk so it knows when all outstanding
writes complete and it needs to transition to clean, but I could be
mis-remembering that).

>  Both of these systems
> use un-partitioned disks with XFS mounted directly on the RAID array.  One
> is a 7 drive RAID5 array on 1.5 TB disks and the other is a 10 drive RAID6
> array on 1.0TB disks.  Both are using a version 1.2 superblock.  The only
> thing which jumps out at me is --examine, but it doesn't seem to tell me
> much:
> 
> RAID-Server:/usr/share/pyTivo# mdadm --examine /dev/sda
> /dev/sda:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x1
>      Array UUID : 5ff10d73:a096195f:7a646bba:a68986ca
>            Name : RAID-Server:0  (local to host RAID-Server)
>   Creation Time : Sat Apr 25 01:17:12 2009
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 1953524896 (931.51 GiB 1000.20 GB)
>      Array Size : 15628197888 (7452.11 GiB 8001.64 GB)
>   Used Dev Size : 1953524736 (931.51 GiB 1000.20 GB)
>     Data Offset : 272 sectors
>    Super Offset : 8 sectors

The above two items are what you need for both version 1.1 and 1.2
superblocks in order to figure things out.  The data, aka the filesystem
itself, starts at the Data Offset which is 272 sectors.  The superblock
itself is 8 sectors in from the front of the disk because you have
version 1.2 superblocks.  So, 272 - 8 - size of the superblock, which is
only a sector or two, is how much internal space you have.  So, in your
case, you have about 132k of space for the bitmap.  Version 1.0
superblocks are a little different in that you need to know the actual
size of the device and you need the super offset and possibly the used
dev size.  There will be free space between the end of the data and the
superblock (super offset - used dev size) and free space after the
superblock (actual dev size as given by fdisk (either the size of the
device itself on whole disk devices or the size of the partition you are
using) - super offset - size of superblock).  I don't know which is used
by the bitmap, but I seem to recall the bitmap wants to be between the
superblock and the end of the data, so I think the used dev size and
super offset are the important numbers there.

You mentioned that you used the defaults when creating the bitmap.
That's likely to hurt your performance.  The default bitmap chunk is too
small.  I would redo it with a larger bitmap chunk.  If you look in
/proc/mdstat, it should tell you the current bitmap chunk.  Given that
you stream large sequential files, you could go with an insanely large
bitmap chunk and be fine.  Something like 65536 or 131072 should be good.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-11 21:46             ` Ben DJ
  2009-11-11 22:10               ` Robin Hill
@ 2009-11-12  1:35               ` Doug Ledford
  1 sibling, 0 replies; 30+ messages in thread
From: Doug Ledford @ 2009-11-12  1:35 UTC (permalink / raw)
  To: Ben DJ; +Cc: Leslie Rhorer, Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On 11/11/2009 04:46 PM, Ben DJ wrote:
> Hi,
> 
> On Wed, Nov 11, 2009 at 12:35 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Yes, very incorrect.  You can use grow to add an internal bitmap later,
> 
> Is that true for RAID-10, as well?  I understood "--grow" with RAID-10
> wasn't fully capable -- yet.

I don't know.  I never heard anything about raid-10 and grow not being
compatible.  I'd just set up a couple fakes devices using loopback,
create a raid-10, and then try it ;-)


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-12  1:34               ` Doug Ledford
@ 2009-11-12  4:55                 ` Leslie Rhorer
  2009-11-12  5:22                   ` Doug Ledford
  0 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-12  4:55 UTC (permalink / raw)
  To: 'Doug Ledford'; +Cc: 'Linux RAID Mailing List'

> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> 
> The above two items are what you need for both version 1.1 and 1.2
> superblocks in order to figure things out.  The data, aka the filesystem
> itself, starts at the Data Offset which is 272 sectors.  The superblock
> itself is 8 sectors in from the front of the disk because you have
> version 1.2 superblocks.  So, 272 - 8 - size of the superblock, which is
> only a sector or two, is how much internal space you have.  So, in your
> case, you have about 132k of space for the bitmap.

	OK.  The 10 drive system shows:

bitmap: 0/466 pages [0KB], 1024KB chunk

	The 7 drive system shows:

bitmap: 0/350 pages [0KB], 2048KB chunk

So you think I should remove both and replace them with 

mdadm -G /dev/md0 --bitmap=internal --bitmap-chunk=65536

?

	While most of the files are large video files, there are a fair
number which are smaller data files such as those of the IMAP server and
Quicken.  I don't want performance to be too terrible for them, either.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-12  4:55                 ` Leslie Rhorer
@ 2009-11-12  5:22                   ` Doug Ledford
  2009-11-14 21:48                     ` Leslie Rhorer
  0 siblings, 1 reply; 30+ messages in thread
From: Doug Ledford @ 2009-11-12  5:22 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: 'Linux RAID Mailing List'

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

On 11/11/2009 11:55 PM, Leslie Rhorer wrote:
>>>     Data Offset : 272 sectors
>>>    Super Offset : 8 sectors
>>
>> The above two items are what you need for both version 1.1 and 1.2
>> superblocks in order to figure things out.  The data, aka the filesystem
>> itself, starts at the Data Offset which is 272 sectors.  The superblock
>> itself is 8 sectors in from the front of the disk because you have
>> version 1.2 superblocks.  So, 272 - 8 - size of the superblock, which is
>> only a sector or two, is how much internal space you have.  So, in your
>> case, you have about 132k of space for the bitmap.
> 
> 	OK.  The 10 drive system shows:
> 
> bitmap: 0/466 pages [0KB], 1024KB chunk
> 
> 	The 7 drive system shows:
> 
> bitmap: 0/350 pages [0KB], 2048KB chunk
> 
> So you think I should remove both and replace them with 
> 
> mdadm -G /dev/md0 --bitmap=internal --bitmap-chunk=65536
> 
> ?
> 
> 	While most of the files are large video files, there are a fair
> number which are smaller data files such as those of the IMAP server and
> Quicken.  I don't want performance to be too terrible for them, either.

Oh yeah, those chunk sizes are waaayyyy too small.  Definitely replace
them.  If it will make you feel better, you can do some performance
testing before and after to see why I say so ;-)  I would recommend
running these tests to check the performance change for yourself:

dbench -t 300 -D $mpoint --clients-per-process=4 16 | tail -19 >> $log_file
mkdir $mpoint/bonnie
chown nobody.nobody $mpoint/bonnie
bonnie++ -u nobody:nobody -d $mpoint/bonnie -f -m
RAID${lvl}-${num}Disk-${chunk}k -n 64:65536:1024:16 >>$log_file 2>/dev/null
tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 4096 >> $log_file
tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 16384 >> $log_file

Obviously, I pulled these tests out of a script I use where all these
various variables are defined.  Just replace the variables with
something sensible for accessing your array, run them, save off the
results, run again with a different chunk size, then please post the
results back here as I imagine they would be very informative.
Especially the dbench results as I think they are likely to benefit the
most from the change.  Note: dbench, bonnie++, and tiotest should all be
available in the debian repos.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-12  5:22                   ` Doug Ledford
@ 2009-11-14 21:48                     ` Leslie Rhorer
  2009-11-15 11:01                       ` Doug Ledford
  0 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-14 21:48 UTC (permalink / raw)
  To: 'Doug Ledford'; +Cc: 'Linux RAID Mailing List'

> dbench -t 300 -D $mpoint --clients-per-process=4 16 | tail -19 >>
> $log_file
> mkdir $mpoint/bonnie
> chown nobody.nobody $mpoint/bonnie
> bonnie++ -u nobody:nobody -d $mpoint/bonnie -f -m
> RAID${lvl}-${num}Disk-${chunk}k -n 64:65536:1024:16 >>$log_file
> 2>/dev/null
> tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 4096 >> $log_file
> tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 16384 >> $log_file
> 
> Obviously, I pulled these tests out of a script I use where all these
> various variables are defined.  Just replace the variables with
> something sensible for accessing your array, run them, save off the
> results, run again with a different chunk size, then please post the
> results back here as I imagine they would be very informative.
> Especially the dbench results as I think they are likely to benefit the
> most from the change.  Note: dbench, bonnie++, and tiotest should all be
> available in the debian repos.

	I could not find tiotest.  Also, the version dbench in the distro
does not support the --clients-per-process switch.  I'll post the results
from the backup system here, and from the primary system in the next post.

Backup with bitmap-chunk 65M:

  16    363285   109.22 MB/sec  execute 286 sec
  16    364067   109.05 MB/sec  execute 287 sec
  16    365960   109.26 MB/sec  execute 288 sec
  16    366880   109.13 MB/sec  execute 289 sec
  16    368850   109.35 MB/sec  execute 290 sec
  16    370444   109.45 MB/sec  execute 291 sec
  16    372360   109.64 MB/sec  execute 292 sec
  16    373973   109.74 MB/sec  execute 293 sec
  16    374821   109.61 MB/sec  execute 294 sec
  16    376967   109.88 MB/sec  execute 295 sec
  16    377813   109.77 MB/sec  execute 296 sec
  16    379422   109.87 MB/sec  execute 297 sec
  16    381197   110.05 MB/sec  execute 298 sec
  16    382029   109.92 MB/sec  execute 299 sec
  16    383868   110.10 MB/sec  cleanup 300 sec
  16    383868   109.74 MB/sec  cleanup 301 sec
  16    383868   109.37 MB/sec  cleanup 302 sec

Throughput 110.101 MB/sec 16 procs
Version 1.03d       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
RAID5-7Disk-6 3520M           66779  23 46588  22           127821  29 334.7
2
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
   64:65536:1024/16    35   1   113   2   404   9    36   1    54   1   193
5


Backup with default (2048K) bitmap-chunk


  16    306080    91.68 MB/sec  execute 286 sec
  16    307214    91.69 MB/sec  execute 287 sec
  16    307922    91.61 MB/sec  execute 288 sec
  16    308828    91.53 MB/sec  execute 289 sec
  16    310653    91.78 MB/sec  execute 290 sec
  16    311926    91.82 MB/sec  execute 291 sec
  16    313569    92.01 MB/sec  execute 292 sec
  16    314478    91.96 MB/sec  execute 293 sec
  16    315578    91.99 MB/sec  execute 294 sec
  16    317416    92.18 MB/sec  execute 295 sec
  16    318576    92.25 MB/sec  execute 296 sec
  16    320391    92.39 MB/sec  execute 297 sec
  16    321309    92.40 MB/sec  execute 298 sec
  16    322461    92.42 MB/sec  execute 299 sec
  16    324486    92.70 MB/sec  cleanup 300 sec
  16    324486    92.39 MB/sec  cleanup 301 sec
  16    324486    92.17 MB/sec  cleanup 302 sec

Throughput 92.6969 MB/sec 16 procs
Version 1.03d       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
RAID5-7Disk-2 3520M           38751  14 31738  15           114481  28 279.0
1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
   64:65536:1024/16    30   1   104   2   340   8    30   1    64   1   160
4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bitmap did not survive reboot
  2009-11-14 21:48                     ` Leslie Rhorer
@ 2009-11-15 11:01                       ` Doug Ledford
  2009-11-15 19:27                         ` Leslie Rhorer
  0 siblings, 1 reply; 30+ messages in thread
From: Doug Ledford @ 2009-11-15 11:01 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: 'Linux RAID Mailing List'

[-- Attachment #1: Type: text/plain, Size: 6277 bytes --]

On 11/14/2009 04:48 PM, Leslie Rhorer wrote:
>> dbench -t 300 -D $mpoint --clients-per-process=4 16 | tail -19 >>
>> $log_file
>> mkdir $mpoint/bonnie
>> chown nobody.nobody $mpoint/bonnie
>> bonnie++ -u nobody:nobody -d $mpoint/bonnie -f -m
>> RAID${lvl}-${num}Disk-${chunk}k -n 64:65536:1024:16 >>$log_file
>> 2>/dev/null
>> tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 4096 >> $log_file
>> tiotest -f 1024 -t 6 -r 1000 -d $mpoint -b 16384 >> $log_file
>>
>> Obviously, I pulled these tests out of a script I use where all these
>> various variables are defined.  Just replace the variables with
>> something sensible for accessing your array, run them, save off the
>> results, run again with a different chunk size, then please post the
>> results back here as I imagine they would be very informative.
>> Especially the dbench results as I think they are likely to benefit the
>> most from the change.  Note: dbench, bonnie++, and tiotest should all be
>> available in the debian repos.
> 
> 	I could not find tiotest.  Also, the version dbench in the distro
> does not support the --clients-per-process switch.  I'll post the results
> from the backup system here, and from the primary system in the next post.
> 
> Backup with bitmap-chunk 65M:
> 
>   16    363285   109.22 MB/sec  execute 286 sec
>   16    364067   109.05 MB/sec  execute 287 sec
>   16    365960   109.26 MB/sec  execute 288 sec
>   16    366880   109.13 MB/sec  execute 289 sec
>   16    368850   109.35 MB/sec  execute 290 sec
>   16    370444   109.45 MB/sec  execute 291 sec
>   16    372360   109.64 MB/sec  execute 292 sec
>   16    373973   109.74 MB/sec  execute 293 sec
>   16    374821   109.61 MB/sec  execute 294 sec
>   16    376967   109.88 MB/sec  execute 295 sec
>   16    377813   109.77 MB/sec  execute 296 sec
>   16    379422   109.87 MB/sec  execute 297 sec
>   16    381197   110.05 MB/sec  execute 298 sec
>   16    382029   109.92 MB/sec  execute 299 sec
>   16    383868   110.10 MB/sec  cleanup 300 sec
>   16    383868   109.74 MB/sec  cleanup 301 sec
>   16    383868   109.37 MB/sec  cleanup 302 sec


Hmmm...interesting.  This is not the output I expected.  This is the
second by second update from the app, not the final results.  The tail
-19 should have grabbed the final results and looked something like this:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------

 NTCreateX    3712699     0.186   297.432
 Close        2726300     0.013   168.654
 Rename        157340     0.149   161.108
 Unlink        750442     0.317   274.044
 Qpathinfo    3367128     0.054   297.590
 Qfileinfo     586968     0.011   148.788
 Qfsinfo       617376     0.921   373.536
 Sfileinfo     302636     0.028   151.030
 Find         1301556     0.121   309.603
 WriteX       1834128     0.125   341.075
 ReadX        5825192     0.047   239.368
 LockX          12088     0.006    24.543
 UnlockX        12088     0.006    23.540
 Flush         260391     7.149   520.703

Throughput 385.585 MB/sec  64 clients  16 procs  max_latency=661.232 ms

This allows comparison of not just the final throughput but also the
various activities.  Regardless though, 109 average versus 92 average is
a very telling story.  That's an 18% performance difference and amounts
to a *HUGE* factor.

> Throughput 110.101 MB/sec 16 procs
> Version 1.03d       ------Sequential Output------ --Sequential Input-
> --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
> %CP
> RAID5-7Disk-6 3520M           66779  23 46588  22           127821  29 334.7
> 2
>                     ------Sequential Create------ --------Random
> Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>    64:65536:1024/16    35   1   113   2   404   9    36   1    54   1   193
> 5


Ditto with these bonnie++ numbers, *HUGE* difference.  66MB/s versus
38MB/s, 46MB/s versus 31MB/s, and 127MB/s versus 114MB/s on the
sequential stuff.  The random numbers are all low enough that I'm not
sure I trust them (the random numbers in my test setup are in the
thousands, not the hundreds).

> 
> Backup with default (2048K) bitmap-chunk
> 
> 
>   16    306080    91.68 MB/sec  execute 286 sec
>   16    307214    91.69 MB/sec  execute 287 sec
>   16    307922    91.61 MB/sec  execute 288 sec
>   16    308828    91.53 MB/sec  execute 289 sec
>   16    310653    91.78 MB/sec  execute 290 sec
>   16    311926    91.82 MB/sec  execute 291 sec
>   16    313569    92.01 MB/sec  execute 292 sec
>   16    314478    91.96 MB/sec  execute 293 sec
>   16    315578    91.99 MB/sec  execute 294 sec
>   16    317416    92.18 MB/sec  execute 295 sec
>   16    318576    92.25 MB/sec  execute 296 sec
>   16    320391    92.39 MB/sec  execute 297 sec
>   16    321309    92.40 MB/sec  execute 298 sec
>   16    322461    92.42 MB/sec  execute 299 sec
>   16    324486    92.70 MB/sec  cleanup 300 sec
>   16    324486    92.39 MB/sec  cleanup 301 sec
>   16    324486    92.17 MB/sec  cleanup 302 sec
> 
> Throughput 92.6969 MB/sec 16 procs
> Version 1.03d       ------Sequential Output------ --Sequential Input-
> --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
> %CP
> RAID5-7Disk-2 3520M           38751  14 31738  15           114481  28 279.0
> 1
>                     ------Sequential Create------ --------Random
> Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>    64:65536:1024/16    30   1   104   2   340   8    30   1    64   1   160
> 4


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-15 11:01                       ` Doug Ledford
@ 2009-11-15 19:27                         ` Leslie Rhorer
  0 siblings, 0 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-15 19:27 UTC (permalink / raw)
  To: 'Doug Ledford'; +Cc: 'Linux RAID Mailing List'

> This allows comparison of not just the final throughput but also the
> various activities.  Regardless though, 109 average versus 92 average is
> a very telling story.  That's an 18% performance difference and amounts
> to a *HUGE* factor.

	Well, not so much.  Remember, there is only 1 link for ingress /
egress on these machines - a single Gig-E link.  Getting much over 90 MBps
would be a challenge.  Really the only process running on this machine is an
rsync daemon which runs at 04:00 every morning, and I really don't care if
the rsync takes an extra 10 minutes or some such.  Of course, in the event
of having to copy the entire data set to a failed array, any extra
performance would be welcome, but I'm really not concerned about it.  Now if
this were one of my commercial production servers, it would be a different
matter, but this is for my house, and it is only a backup unit.  That
doesn't mean I am going to revert to the smaller bitmap chunk, though.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Bitmap did not survive reboot
  2009-11-10  1:44 Leslie Rhorer
@ 2009-11-10  1:58 ` Leslie Rhorer
  0 siblings, 0 replies; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-10  1:58 UTC (permalink / raw)
  To: linux-raid

> 	I had to reboot one of my Linux systems a few days ago (November 2)
> because something was a little unstable, although RAID was AFAIK working
> just fine.  This is not an online production system, so rather than try to
> run down the culprit, I just rebooted the box.  Everything seemed to come
> back up just fine, so I really didn't spend too much time checking
> everything out.  Today one of the drives in the RAID5 array was kicked
> out,
> so I removed it and added it back.  It wasn't until I added the drive back
> that I noticed the array no longer had a write-intent bitmap.  The array
> had
> an external bitmap, but it is no longer there, and I presume for some
> reason
> it was not registered when the box rebooted.  I don't see anything which

	Oh, hey, I just looked at one of my other Linux systems which was
shut down during a protracted power outage 16 days ago, and it, too, is
missing its bitmap, presumably since it was brought back up.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Bitmap did not survive reboot
@ 2009-11-10  1:44 Leslie Rhorer
  2009-11-10  1:58 ` Leslie Rhorer
  0 siblings, 1 reply; 30+ messages in thread
From: Leslie Rhorer @ 2009-11-10  1:44 UTC (permalink / raw)
  To: linux-raid


	I had to reboot one of my Linux systems a few days ago (November 2)
because something was a little unstable, although RAID was AFAIK working
just fine.  This is not an online production system, so rather than try to
run down the culprit, I just rebooted the box.  Everything seemed to come
back up just fine, so I really didn't spend too much time checking
everything out.  Today one of the drives in the RAID5 array was kicked out,
so I removed it and added it back.  It wasn't until I added the drive back
that I noticed the array no longer had a write-intent bitmap.  The array had
an external bitmap, but it is no longer there, and I presume for some reason
it was not registered when the box rebooted.  I don't see anything which
looks like a failure related to md in the logs.  The external bitmap is in
an ext2 file system in a partition of the boot drive, so the file should be
available during boot prior to building the RAID array.  What could be
causing the bitmap to drop out?  This isn't the first time it has happened.
I searched /var/log for the string "md" to find the messages related to
activity on the array.  Is there some other string for which I should
search?  Here is /etc/fstab :

Backup:/var/log# cat /etc/fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>     <options>       <dump>  <pass>
proc          /proc             proc       defaults        0       0
/dev/hda2     /                 reiserfs   defaults        0       1
/dev/hda1     /boot             reiserfs   notail          0       2
/dev/hda4     /etc/mdadm/bitmap ext2       defaults        0       1
/dev/hda5     none              swap       sw              0       0
/dev/hdb      /media/cdrom0     udf,iso9660user,noauto     0       0
/dev/md0      /Backup           xfs        defaults        0       2

	Here is /etc/mdadm/mdadm.conf

Backup:/etc/mdadm# cat mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR lrhorer@satx.rr.com

# definitions of existing MD arrays

# This file was auto-generated on Thu, 14 May 2009 20:25:57 -0500
# by mkconf $Id$
PROGRAM /usr/bin/mdadm_notify
DEVICE /dev/sd[a-g]
ARRAY /dev/md0 level=raid5 metadata=1.2 num-devices=7
UUID=940ae4e4:04057ffc:5e92d2fb:63e3efb7 name='Backup':0
bitmap=/etc/mdadm/bitmap/md0.map


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-11-15 19:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4AF9ABAA.1020407@redhat.com>
2009-11-11  3:34 ` Bitmap did not survive reboot Leslie Rhorer
2009-11-11  3:46   ` Leslie Rhorer
2009-11-11  5:22     ` Majed B.
2009-11-11  8:13       ` Leslie Rhorer
2009-11-11  8:19         ` Michael Evans
2009-11-11  8:53           ` Leslie Rhorer
2009-11-11  9:31             ` John Robinson
2009-11-11 14:52               ` Leslie Rhorer
2009-11-11 16:02                 ` John Robinson
2009-11-11  9:31         ` Majed B.
2009-11-11 14:54           ` Leslie Rhorer
2009-11-11  9:16       ` Robin Hill
2009-11-11 15:01         ` Leslie Rhorer
2009-11-11 15:53           ` Robin Hill
2009-11-11 20:35           ` Doug Ledford
2009-11-11 21:46             ` Ben DJ
2009-11-11 22:10               ` Robin Hill
2009-11-12  1:35               ` Doug Ledford
2009-11-12  0:23             ` Leslie Rhorer
2009-11-12  1:34               ` Doug Ledford
2009-11-12  4:55                 ` Leslie Rhorer
2009-11-12  5:22                   ` Doug Ledford
2009-11-14 21:48                     ` Leslie Rhorer
2009-11-15 11:01                       ` Doug Ledford
2009-11-15 19:27                         ` Leslie Rhorer
2009-11-11 15:19   ` Gabor Gombas
2009-11-11 16:48     ` Leslie Rhorer
2009-11-11 20:32   ` Doug Ledford
2009-11-10  1:44 Leslie Rhorer
2009-11-10  1:58 ` Leslie Rhorer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.