All of lore.kernel.org
 help / color / mirror / Atom feed
* creating a new 80 TB XFS
@ 2012-02-24 12:52 Richard Ems
  2012-02-24 14:08 ` Emmanuel Florac
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Richard Ems @ 2012-02-24 12:52 UTC (permalink / raw)
  To: xfs

Hi list,

I am not a storage expert, so sorry in advance for probably some *naive*
questions or proposals from me. 8)

*INTRO*
We are getting new hardware soon and I wanted to check with you my plans
for creating and mounting this XFS.

The storage system is from EUROstor,
http://eurostor.com/en/products/raid-sas-host/es-6600-sassas-toploader.html
.

We are getting now 32 x 3 TB Hitachi SATA HDDs.
I plan to configure them in a single RAID 6 set with one or two
hot-standby discs. The raw storage space will then be 28 x 3 TB = 84 TB.
On this one RAID set I will create only one volume.
Any thoughts on this?

This storage will be used as secondary storage for backups. We use
dirvish (www.dirvish.org, which uses rsync) to run our daily backups.
dirvish heavily uses hard links. It compares all files, one by one, and
synchronizes all new or changed files with rsync to the current daily
dir YYYY-MM-DD, and creates hard links for all not changed files from
the last previous backup on YYYY-MM-(DD-1) to the current YYYY-MM-DD
directory.


*MKFS*
We also heavily use ACLs for almost all of our files. Christoph Hellwig
suggested in a previous mail to use "-i size=512" on XFS creation, so my
mkfs.xfs would look something like:

mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1


*MOUNT*
On mount I will use the options

mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
/dev/sdX1 /mount_point

What about the largeio mount option? In which cases would it be useful?


Do you have any other/better suggestions or comments?


Many thanks,
Richard






-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
@ 2012-02-24 14:08 ` Emmanuel Florac
  2012-02-24 15:43   ` Richard Ems
  2012-02-24 14:52 ` Peter Grandi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Emmanuel Florac @ 2012-02-24 14:08 UTC (permalink / raw)
  To: xfs

Le Fri, 24 Feb 2012 13:52:40 +0100
Richard Ems <richard.ems@cape-horn-eng.com> écrivait:

> Hi list,
> 
> We are getting now 32 x 3 TB Hitachi SATA HDDs.
> I plan to configure them in a single RAID 6 set with one or two
> hot-standby discs. The raw storage space will then be 28 x 3 TB = 84
> TB. On this one RAID set I will create only one volume.
> Any thoughts on this?

If you'd rather go for more safety you could build 2 16 drives RAID-6
arrays instead. I'd be somewhat reluctant to make a 30 drives array
--though current drives are quite safe apparently.

> 
> *MKFS*
> We also heavily use ACLs for almost all of our files. Christoph
> Hellwig suggested in a previous mail to use "-i size=512" on XFS
> creation, so my mkfs.xfs would look something like:
> 
> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1

Looks OK to me.
 
> 
> *MOUNT*
> On mount I will use the options
> 
> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> /dev/sdX1 /mount_point

I think that the logbufs/logbsize option matches the default here. Use
delaylog if applicable. See the xfs FAQ.
 
> What about the largeio mount option? In which cases would it be
> useful?
> 

If you're mostly writing/reading large files. Like really large
(several megabytes and more).

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
  2012-02-24 14:08 ` Emmanuel Florac
@ 2012-02-24 14:52 ` Peter Grandi
  2012-02-24 14:57 ` Michael Weissenbacher
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Grandi @ 2012-02-24 14:52 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

> We are getting now 32 x 3 TB Hitachi SATA HDDs. I plan to
> configure them in a single RAID 6 set with one or two
> hot-standby discs. The raw storage space will then be 28 x 3
> TB = 84 TB.  On this one RAID set I will create only one
> volume.  Any thoughts on this?

Well, many storage experts would be impressed by and support
such an audacious plan...

But I think that wide RAID6 sets and large RAID6 stripes are a
phenomenally bad idea, and large filetrees also strikingly bad,
and the two combined seems to me almost the most terrible setup.
It is also remarkably brave to use 32 identical drives in a RAID
set. But all this is very popular because in the beginning "it
works" and is really cheap.

The proposed setup has only 7% redundancy, RMW issues with large
stripe sizes, and 'fsck' time and space issues with large trees.

Consider this series of blog notes:

  http://www.sabi.co.uk/blog/12-two.html#120218
  http://www.sabi.co.uk/blog/12-two.html#120127
  http://www.sabi.co.uk/blog/1104Apr.html#110401
  http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5

> This storage will be used as secondary storage for backups. We
> use dirvish (www.dirvish.org, which uses rsync) to run our
> daily backups.

So it will be lots and lots of metadata (mostly directory)
updates. Not a very good match there. Especially considering
that almost always you will be only writing to it even for data,
and presumably from multiple hosts concurrently. You may benefit
considerably from putting the XFS log on a separate disk, and if
you use Linux MD for RAID the bitmaps on a separate disk.

> *MKFS* We also heavily use ACLs for almost all of our files.

That's a daring choice.

> [ ... ] "-i size=512" on XFS creation, so my mkfs.xfs would look
> something like: mkfs.xfs -i size=512 -d su=stripe_size,sw=28
> -L Backup_2 /dev/sdX1

As a rule I specify a sector size of 4096, and in your case
perhaps an inode size of 2048 might be appropriate to raise the
chance of ACLs and directories fully stored in inode tails,
which seem particularly important in your case. Something like:

  -s size=4096 -b size=4096 -i size=2048,attr=2

> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> /dev/sdX1 /mount_point

'nobarrier' seems rather optimistic unless you are very very
sure there won't be failures.

There are many others details to looks into, from readhead to
flusher frequency.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
  2012-02-24 14:08 ` Emmanuel Florac
  2012-02-24 14:52 ` Peter Grandi
@ 2012-02-24 14:57 ` Michael Weissenbacher
  2012-02-24 16:05   ` Richard Ems
  2012-02-24 15:17 ` Eric Sandeen
  2012-02-27 11:56 ` Michael Monnerie
  4 siblings, 1 reply; 21+ messages in thread
From: Michael Weissenbacher @ 2012-02-24 14:57 UTC (permalink / raw)
  To: xfs

Hi Richard!
> 
> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
> 
This suggests that you plan to create partitions like /dev/sdX1 on the
RAID-6. If you really do (which is not a good idea IMHO because it buys
you nothing) you will have to be extra-careful to not mess up the stripe
alignment. I think the better option is to create the XFS directly on
/dev/sdX without partitions.

cheers,
Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
                   ` (2 preceding siblings ...)
  2012-02-24 14:57 ` Michael Weissenbacher
@ 2012-02-24 15:17 ` Eric Sandeen
  2012-10-01 14:28   ` Richard Ems
  2012-02-27 11:56 ` Michael Monnerie
  4 siblings, 1 reply; 21+ messages in thread
From: Eric Sandeen @ 2012-02-24 15:17 UTC (permalink / raw)
  To: Richard Ems; +Cc: xfs

On 2/24/12 6:52 AM, Richard Ems wrote:
> Hi list,
> 
> I am not a storage expert, so sorry in advance for probably some *naive*
> questions or proposals from me. 8)
> 
> *INTRO*
> We are getting new hardware soon and I wanted to check with you my plans
> for creating and mounting this XFS.
> 
> The storage system is from EUROstor,
> http://eurostor.com/en/products/raid-sas-host/es-6600-sassas-toploader.html
> .
> 
> We are getting now 32 x 3 TB Hitachi SATA HDDs.
> I plan to configure them in a single RAID 6 set with one or two
> hot-standby discs. The raw storage space will then be 28 x 3 TB = 84 TB.
> On this one RAID set I will create only one volume.
> Any thoughts on this?
> 
> This storage will be used as secondary storage for backups. We use
> dirvish (www.dirvish.org, which uses rsync) to run our daily backups.
> dirvish heavily uses hard links. It compares all files, one by one, and
> synchronizes all new or changed files with rsync to the current daily
> dir YYYY-MM-DD, and creates hard links for all not changed files from
> the last previous backup on YYYY-MM-(DD-1) to the current YYYY-MM-DD
> directory.
> 
> 
> *MKFS*
> We also heavily use ACLs for almost all of our files. Christoph Hellwig
> suggested in a previous mail to use "-i size=512" on XFS creation, so my
> mkfs.xfs would look something like:
> 
> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1

Be sure the stripe geometry matches the way the raid controller is
set up.

You know the size of your acls, so you can probably do some testing
to find out how well 512-byte inodes keep ACLs in-line.

As others mentioned, if sdX1 means you've partitioned your 80T
device, that's probably unnecessary.

> *MOUNT*
> On mount I will use the options
> 
> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> /dev/sdX1 /mount_point

Understand what nobarrier means, and convince yourself that it's safe
before you turn them off.  Then convince yourself again.
You'll want to know if your raid controller has a write back
cache, whether it disables disk write back caches, whether any active
caches are battery-backed, etc.
 
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html

You are restating defaults for logbufs.  Your logbsize value is bigger than default.
"The trade off for this increase in metadata performance is that more operations may
be "missing" after recovery if the system crashes while actively making modifications. "

inode64 is a good idea.

also, why nofail?

> What about the largeio mount option? In which cases would it be useful?

probably none in your case.  It changes what stat reports in st_blksize,
so it depends on what (if anything) your userspace does with that.

> Do you have any other/better suggestions or comments?

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

-Eric

> 
> Many thanks,
> Richard
> 
> 
> 
> 
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 14:08 ` Emmanuel Florac
@ 2012-02-24 15:43   ` Richard Ems
  2012-02-24 16:20     ` Martin Steigerwald
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Richard Ems @ 2012-02-24 15:43 UTC (permalink / raw)
  To: xfs



On 02/24/2012 03:08 PM, Emmanuel Florac wrote:
> Le Fri, 24 Feb 2012 13:52:40 +0100
> Richard Ems <richard.ems@cape-horn-eng.com> écrivait:
> 
>> Hi list,
>>
>> We are getting now 32 x 3 TB Hitachi SATA HDDs.
>> I plan to configure them in a single RAID 6 set with one or two
>> hot-standby discs. The raw storage space will then be 28 x 3 TB = 84
>> TB. On this one RAID set I will create only one volume.
>> Any thoughts on this?
> 
> If you'd rather go for more safety you could build 2 16 drives RAID-6
> arrays instead. I'd be somewhat reluctant to make a 30 drives array
> --though current drives are quite safe apparently.

Thanks, yes, this sounds good, but it's a matter of administration
simplicity doing the backups why I chose to have only one
volume/partition/XFS. At some point one of both drives will become near
to full and moving dirs from one partition to the other won't be that
easy with out backup system ...

> 
>>
>> *MKFS*
>> We also heavily use ACLs for almost all of our files. Christoph
>> Hellwig suggested in a previous mail to use "-i size=512" on XFS
>> creation, so my mkfs.xfs would look something like:
>>
>> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
> 
> Looks OK to me.
>  
>>
>> *MOUNT*
>> On mount I will use the options
>>
>> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
>> /dev/sdX1 /mount_point
> 
> I think that the logbufs/logbsize option matches the default here. Use
> delaylog if applicable. See the xfs FAQ.

Yes, if I trust the mount manual page, it states "The default value is 8
buffers for any recent kernel." . I suppose 3.2.6 is "a recent kernel",
so this could be avoided, but having it explicitly on the mkfs.xfs line
does not hurt, or?
And for logbsize: "The default value for any recent kernel is 32768."

But then at the end of the manual page for mount it says "December
2004", so how actual is this information? Can the default mount values
be shown by running mount with some verbose and dry-run parameters?

>  
>> What about the largeio mount option? In which cases would it be
>> useful?
>>
> 
> If you're mostly writing/reading large files. Like really large
> (several megabytes and more).
> 

Ok, thanks.

Richard


-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 14:57 ` Michael Weissenbacher
@ 2012-02-24 16:05   ` Richard Ems
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Ems @ 2012-02-24 16:05 UTC (permalink / raw)
  To: xfs

On 02/24/2012 03:57 PM, Michael Weissenbacher wrote:
> Hi Richard!
>>
>> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
>>
> This suggests that you plan to create partitions like /dev/sdX1 on the
> RAID-6. If you really do (which is not a good idea IMHO because it buys
> you nothing) you will have to be extra-careful to not mess up the stripe
> alignment. I think the better option is to create the XFS directly on
> /dev/sdX without partitions.

Ok, thanks. I think I will go then to /dev/sdX , no partitions.

Richard

-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 15:43   ` Richard Ems
@ 2012-02-24 16:20     ` Martin Steigerwald
  2012-02-24 16:51       ` Stan Hoeppner
  2012-02-24 16:58     ` Roger Willcocks
  2012-02-25 21:57     ` Peter Grandi
  2 siblings, 1 reply; 21+ messages in thread
From: Martin Steigerwald @ 2012-02-24 16:20 UTC (permalink / raw)
  To: xfs

Am Freitag, 24. Februar 2012 schrieb Richard Ems:
> >> MOUNT
> >> On mount I will use the options
> >> 
> >> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> >> /dev/sdX1 /mount_point
> >
> > 
> >
> > I think that the logbufs/logbsize option matches the default here.
> > Use delaylog if applicable. See the xfs FAQ.
> 
> Yes, if I trust the mount manual page, it states "The default value is
> 8 buffers for any recent kernel." . I suppose 3.2.6 is "a recent
> kernel", so this could be avoided, but having it explicitly on the
> mkfs.xfs line does not hurt, or?
> And for logbsize: "The default value for any recent kernel is 32768."
> 
> But then at the end of the manual page for mount it says "December
> 2004", so how actual is this information? Can the default mount values
> be shown by running mount with some verbose and dry-run parameters?

Does cat /proc/mounts show them? /proc/mounts is more detailed than mount 
or mount -l.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 16:20     ` Martin Steigerwald
@ 2012-02-24 16:51       ` Stan Hoeppner
  2012-02-25 10:59         ` Martin Steigerwald
  0 siblings, 1 reply; 21+ messages in thread
From: Stan Hoeppner @ 2012-02-24 16:51 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: xfs

On 2/24/2012 10:20 AM, Martin Steigerwald wrote:
> Am Freitag, 24. Februar 2012 schrieb Richard Ems:
>>>> MOUNT
>>>> On mount I will use the options
>>>>
>>>> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
>>>> /dev/sdX1 /mount_point
>>>
>>>
>>>
>>> I think that the logbufs/logbsize option matches the default here.
>>> Use delaylog if applicable. See the xfs FAQ.
>>
>> Yes, if I trust the mount manual page, it states "The default value is
>> 8 buffers for any recent kernel." . I suppose 3.2.6 is "a recent
>> kernel", so this could be avoided, but having it explicitly on the
>> mkfs.xfs line does not hurt, or?
>> And for logbsize: "The default value for any recent kernel is 32768."
>>
>> But then at the end of the manual page for mount it says "December
>> 2004", so how actual is this information? Can the default mount values
>> be shown by running mount with some verbose and dry-run parameters?
> 
> Does cat /proc/mounts show them? /proc/mounts is more detailed than mount 
> or mount -l.

Vanilla kernel.org 3.2.6:

~$ cat /proc/mounts
/dev/sda7 /samba xfs rw,relatime,attr2,delaylog,noquota 0 0

It doesn't show the default logbufs and logbsize values.  I asked about
this specific issue over a year ago, because the documentation is
inconsistent, and you can't get the default values out of a running
system.  If you can I don't know how.  If someone stated a method, I
can't recall it. :(

I do recall Dave, IIRC, saying something to the effect of 'just use the
defaults, as they are 8 and 256K in recent kernels anyway'.  That's not
a direct quote, but my recollection.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 15:43   ` Richard Ems
  2012-02-24 16:20     ` Martin Steigerwald
@ 2012-02-24 16:58     ` Roger Willcocks
  2012-02-25 21:57     ` Peter Grandi
  2 siblings, 0 replies; 21+ messages in thread
From: Roger Willcocks @ 2012-02-24 16:58 UTC (permalink / raw)
  To: Richard Ems; +Cc: xfs


On 24 Feb 2012, at 15:43, Richard Ems wrote:

> 
> 
> On 02/24/2012 03:08 PM, Emmanuel Florac wrote:
>> Le Fri, 24 Feb 2012 13:52:40 +0100
>> Richard Ems <richard.ems@cape-horn-eng.com> écrivait:
>> 
>>> Hi list,
>>> 
>>> We are getting now 32 x 3 TB Hitachi SATA HDDs.
>>> I plan to configure them in a single RAID 6 set with one or two
>>> hot-standby discs. The raw storage space will then be 28 x 3 TB = 84
>>> TB. On this one RAID set I will create only one volume.
>>> Any thoughts on this?
>> 
>> If you'd rather go for more safety you could build 2 16 drives RAID-6
>> arrays instead. I'd be somewhat reluctant to make a 30 drives array
>> --though current drives are quite safe apparently.
> 
> Thanks, yes, this sounds good, but it's a matter of administration
> simplicity doing the backups why I chose to have only one
> volume/partition/XFS. At some point one of both drives will become near
> to full and moving dirs from one partition to the other won't be that
> easy with out backup system ...

You might consider making a software raid0 from the two raid-6 arrays.

--
Roger

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 16:51       ` Stan Hoeppner
@ 2012-02-25 10:59         ` Martin Steigerwald
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Steigerwald @ 2012-02-25 10:59 UTC (permalink / raw)
  To: xfs, stan

Am Freitag, 24. Februar 2012 schrieb Stan Hoeppner:
> On 2/24/2012 10:20 AM, Martin Steigerwald wrote:
> > Am Freitag, 24. Februar 2012 schrieb Richard Ems:
> >>>> MOUNT
> >>>> On mount I will use the options
> >>>> 
> >>>> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> >>>> /dev/sdX1 /mount_point
> >>> 
> >>> I think that the logbufs/logbsize option matches the default here.
> >>> Use delaylog if applicable. See the xfs FAQ.
> >> 
> >> Yes, if I trust the mount manual page, it states "The default value
> >> is 8 buffers for any recent kernel." . I suppose 3.2.6 is "a recent
> >> kernel", so this could be avoided, but having it explicitly on the
> >> mkfs.xfs line does not hurt, or?
> >> And for logbsize: "The default value for any recent kernel is
> >> 32768."
> >> 
> >> But then at the end of the manual page for mount it says "December
> >> 2004", so how actual is this information? Can the default mount
> >> values be shown by running mount with some verbose and dry-run
> >> parameters?
> > 
> > Does cat /proc/mounts show them? /proc/mounts is more detailed than
> > mount or mount -l.
> 
> Vanilla kernel.org 3.2.6:
> 
> ~$ cat /proc/mounts
> /dev/sda7 /samba xfs rw,relatime,attr2,delaylog,noquota 0 0
> 
> It doesn't show the default logbufs and logbsize values.  I asked about
> this specific issue over a year ago, because the documentation is
> inconsistent, and you can't get the default values out of a running
> system.  If you can I don't know how.  If someone stated a method, I
> can't recall it. :(
> 
> I do recall Dave, IIRC, saying something to the effect of 'just use the
> defaults, as they are 8 and 256K in recent kernels anyway'.  That's not
> a direct quote, but my recollection.

As I wrote it I thought about that for XFS the option might not be 
displayed. Cause I remember having seen something similar quite some time 
ago.

With NFS this works quite well, also with Ext4 and vfat it seems to me.

I think it would be good to include default options in /proc/mounts.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 15:43   ` Richard Ems
  2012-02-24 16:20     ` Martin Steigerwald
  2012-02-24 16:58     ` Roger Willcocks
@ 2012-02-25 21:57     ` Peter Grandi
  2012-02-26  2:57       ` Stan Hoeppner
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Grandi @ 2012-02-25 21:57 UTC (permalink / raw)
  To: Linux fs XFS

>>> We are getting now 32 x 3 TB Hitachi SATA HDDs. I plan to
>>> configure them in a single RAID 6 set with one or two
>>> hot-standby discs. The raw storage space will then be 28 x 3
>>> TB = 84 TB.  On this one RAID set I will create only one
>>> volume.  Any thoughts on this?

>> Well, many storage experts would be impressed by and support
>> such an audacious plan...

> Audacious?

Please remember that experts reading or responding to this
thread have not objected to the (very) aggressive aspects of
your setup, so obviously it seems mostly fine to them. Just me
pointing out the risks and the one who thinks that 16 drives per
set would be preferable.

> Why? Too many discs together? What would be your recommended
> maximum?

Links below explain. In general I am uncomfortable with storage
redundancy of less then 30% and very worried when it is less
than 20%. Especially for correlated chances of failure due to to
strong common modes such as all disks of the same type and make
in the same box. Fortunately there is a significant report that
the Hitachi 3TB drive has been so far particularly reliable:

  http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/

But consider that several large scale studies report most drives
have a failure rate of 3-5% per year, and in a population of 28
drives with common modes that gives a chance of 3 overlapping
failures which is not comfortable to me.

> We are running our actual backup (remember, this is for
> backups!) on one RAID 6 set on 24 HDDs (21 data + 2 RAID6
> parity + 1 hot-spare) and as you already wrote "it works".

The managers of Lehman and Fukushima also said "it works" until
it did not :-).

>> [ ... ] It is also remarkably brave to use 32 identical
>> drives in a RAID set. But all this is very popular because in
>> the beginning "it works" and is really cheap.

> Yes, costs are an important factor. We could have gone with
> more secure/sophisticated/professional setups, but we would
> have got 1/2 ot 1/4 of the capacity for the same price.

If only it were a cost-free saving... But the saving is upfront
and visible and the cost is in the fat tail and invisible.

However you might want to consider something like a RAID0 of 2+1
RAID5s perhaps.

> But since we need that capacity for the backups we had no
> other choice. As said before, our previous setup with 24 HDDs
> in one RAID 6 worked flawlessly for 5 years. And it still works.

Risk is not a certainty...

>> The proposed setup has only 7% redundancy, RMW issues with
>> large stripe sizes, and 'fsck' time and space issues with
>> large trees.

> 7% ? 2/28 ?  fsck time? and space? Time won't be a problem, as
> long as we are not talking about days.

It could be weeks to months if the filetree is damaged.

> Remember this is a system for storing backups.

And therefore since it is based on RSYNC'ing one that does
vast metadata scans, readings, and quite a few metadata updates.

> How can I estimate the time needed? And what do you mean with
> "space" ?  Memory issues while running fsck?

The time is hard to estimate beyond the time needed to check an
undamaged or very lightly damaged filetree. As to space, you
might need several dozen GiB (depending on metadata size) as per
the link below.

>> Consider this series of blog notes:

>> http://www.sabi.co.uk/blog/12-two.html#120218
>> http://www.sabi.co.uk/blog/12-two.html#120127
>> http://www.sabi.co.uk/blog/1104Apr.html#110401
>> http://groups.google.com/group/linux.debian.ports.x86-64/msg/fd2b4d46a4c294b5

>> [ ... ] presumably from multiple hosts concurrently. You may
>> benefit considerably from putting the XFS log on a separate
>> disk, and if you use Linux MD for RAID the bitmaps on a
>> separate disk.

> No, not concurrently, we run the backups from multiple hosts
> one after another.

Then you have a peculiar situation for such a large capacity
backup system.

>>> *MKFS* We also heavily use ACLs for almost all of our files.

>> That's a daring choice.

> Is there a better way of giving different access rights per user to
> files and directories? Complicated group setups?

Probably yes, and they would not be that complicated. Or really
simple ACLs, btu you seem to have complicated ones, and you
don't seem to work for the NSA :-).

>> 'nobarrier' seems rather optimistic unless you are very very
>> sure there won't be failures.

> There are always failures. But again, this is a backup system.

Sure, but the last thing you want is for your backup system to
fail. Because people often do silly things with "main" systems
because they are confident in there being backups, and if they
try and get those backups and they are not there, because after
all the backups system was designed with the idea that it is
just a backup system...

> And the controller will be battery backed up, and it's
> connected to an UPS that gives about 30 min power in case of a
> power failure.

That's good, but there also hardware failures and kernel
crashes.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-25 21:57     ` Peter Grandi
@ 2012-02-26  2:57       ` Stan Hoeppner
  2012-02-26 16:08         ` Emmanuel Florac
  0 siblings, 1 reply; 21+ messages in thread
From: Stan Hoeppner @ 2012-02-26  2:57 UTC (permalink / raw)
  To: xfs

On 2/25/2012 3:57 PM, Peter Grandi wrote:

>> There are always failures. But again, this is a backup system.
> 
> Sure, but the last thing you want is for your backup system to
> fail.

Putting an exclamation point on Peter's wisdom requires nothing more
than browsing the list archive:

Subject: xfs_repair of critical volume
Date: Sun, 31 Oct 2010 00:54:13 -0700
To: xfs@oss.sgi.com

I have a large XFS filesystem (60 TB) that is composed of 5 hardware
RAID 6 volumes. One of those volumes had several drives fail in a very
short time and we lost that volume. However, four of the volumes seem
OK. We are in a worse state because our backup unit failed a week later
when four drives simultaneously went offline. So we are in a bad very state.
[...]


This saga is available in these two XFS list threads:
http://oss.sgi.com/archives/xfs/2010-07/msg00077.html
http://oss.sgi.com/archives/xfs/2010-10/msg00373.html

Lessons:
1.  Don't use cheap hardware for a backup server
2.  Make sure your backup system is reliable
    Do test restores operations regularly


I suggest you get the dual active/active controller configuration and
use two PCIe SAS HBAs, one connected to each controller, and use SCSI
multipath.  This prevents a dead HBA leaving you dead in the water until
replacement.  How long does it take, and at what cost to operations, if
your single HBA fails during a critical restore?

Get the battery backed cache option.  Verify the controllers disable the
drive write caches.

Others have recommended stitching 2 small arrays together with mdadm and
using a single XFS on the volume instead of one big array and one XFS.
I suggest using two XFS, one on each small array.  This ensures you can
still access some of your backups in the event of a problem with one
array or one filesystem.

As others mentioned, an xfs_[check|repair] can take many hours or even
days on a multi-terabyte huge metadata filesystem.  If you need to do a
restore during that period you're out of luck.  With two filesystems,
and if duplicating critical images/files on each, you're still in business.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-26  2:57       ` Stan Hoeppner
@ 2012-02-26 16:08         ` Emmanuel Florac
  2012-02-26 16:55           ` Joe Landman
  0 siblings, 1 reply; 21+ messages in thread
From: Emmanuel Florac @ 2012-02-26 16:08 UTC (permalink / raw)
  To: stan; +Cc: xfs

Le Sat, 25 Feb 2012 20:57:05 -0600 vous écriviez:

> As others mentioned, an xfs_[check|repair] can take many hours or even
> days on a multi-terabyte huge metadata filesystem. 

Just nitpicking, but I never had such a problem. I've run quite a lot
of xfs_repair on 40TB+ filesystems, and it rarely was longer than 10 to
20 minutes. The important part is having enough RAM if the system hits
swap it makes the check much slower).

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-26 16:08         ` Emmanuel Florac
@ 2012-02-26 16:55           ` Joe Landman
  0 siblings, 0 replies; 21+ messages in thread
From: Joe Landman @ 2012-02-26 16:55 UTC (permalink / raw)
  To: xfs

On 02/26/2012 11:08 AM, Emmanuel Florac wrote:
> Le Sat, 25 Feb 2012 20:57:05 -0600 vous écriviez:
>
>> As others mentioned, an xfs_[check|repair] can take many hours or even
>> days on a multi-terabyte huge metadata filesystem.
>
> Just nitpicking, but I never had such a problem. I've run quite a lot
> of xfs_repair on 40TB+ filesystems, and it rarely was longer than 10 to
> 20 minutes. The important part is having enough RAM if the system hits
> swap it makes the check much slower).

We've found that adding the -m X and -P options seem to fix many of the 
longer running issues for large nearly full many TB file systems. 
Biggest one we've repaired has been 108TB and its taken a few hours, 
with ~80% utilization of the underlying file system.

I don't know if the sparse file bit we reported last year (with more 
data reported to the list in January this year) has had much attention 
(hard to reproduce I would imagine).  But apart from this, repair seems 
to work reasonably quickly.  I've not seen an instance after using the 
-m X -P options, or "days" for repair, even on heavily fragmented file 
systems.  Possibly Peter has seen this, and he might describe his 
observations in this regard.

Repair time is important.  There's no doubt of that.  To some degree, 
repair performance will be related to the speed of accessing the data on 
the drives, so if your best case IO speeds are low, performance on 
repair won't be terribly good.  Memory size is also important ...  we've 
had some repairs start swapping (not good) during repair.  Hence the -m 
X option (for suitable values of X).

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
                   ` (3 preceding siblings ...)
  2012-02-24 15:17 ` Eric Sandeen
@ 2012-02-27 11:56 ` Michael Monnerie
  2012-02-27 12:20   ` Richard Ems
  4 siblings, 1 reply; 21+ messages in thread
From: Michael Monnerie @ 2012-02-27 11:56 UTC (permalink / raw)
  To: xfs; +Cc: Richard Ems


[-- Attachment #1.1: Type: text/plain, Size: 846 bytes --]

Am Freitag, 24. Februar 2012, 13:52:40 schrieb Richard Ems:
> The raw storage space will then be 28 x 3 TB = 84 TB.

Please remember that it will be 2,79 TiB per drive, that's 78 TiB 
overall. So it's missing 6 TB in reality, depending on how you 
calculated the source space you have and backup space you need. In 
Multi-TB environments people still are surprised by the big "rounding 
error" from harddisk TB to real TiB. And I've had customers who couldn't 
even do their first backup, as the backup system was full already. They 
calculated backup space only for one backup, and without the TB->TiB 
conversion.

https://en.wikipedia.org/wiki/Byte

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-27 11:56 ` Michael Monnerie
@ 2012-02-27 12:20   ` Richard Ems
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Ems @ 2012-02-27 12:20 UTC (permalink / raw)
  To: xfs

On 02/27/2012 12:56 PM, Michael Monnerie wrote:
> Am Freitag, 24. Februar 2012, 13:52:40 schrieb Richard Ems:
>> The raw storage space will then be 28 x 3 TB = 84 TB.
> 
> Please remember that it will be 2,79 TiB per drive, that's 78 TiB 
> overall. So it's missing 6 TB in reality, depending on how you 
> calculated the source space you have and backup space you need. In 
> Multi-TB environments people still are surprised by the big "rounding 
> error" from harddisk TB to real TiB. And I've had customers who couldn't 
> even do their first backup, as the backup system was full already. They 
> calculated backup space only for one backup, and without the TB->TiB 
> conversion.

Thanks Michael, I have alrady taken this into account.

We have now about 40 TB to backup and about 60 TB max. And there is
space for 32 more HDDs in the backup system, which we will add later as
needed.

Richard

-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-02-24 15:17 ` Eric Sandeen
@ 2012-10-01 14:28   ` Richard Ems
  2012-10-01 14:36     ` Richard Ems
  2012-10-01 14:39     ` Eric Sandeen
  0 siblings, 2 replies; 21+ messages in thread
From: Richard Ems @ 2012-10-01 14:28 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On 02/24/2012 04:17 PM, Eric Sandeen wrote:
>> *MKFS*
>> > We also heavily use ACLs for almost all of our files. Christoph Hellwig
>> > suggested in a previous mail to use "-i size=512" on XFS creation, so my
>> > mkfs.xfs would look something like:
>> > 
>> > mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
> Be sure the stripe geometry matches the way the raid controller is
> set up.
> 
> You know the size of your acls, so you can probably do some testing
> to find out how well 512-byte inodes keep ACLs in-line.


Hi Eric,

This is a reply to an email from you sent 7 months ago ...

How could I do the testing you were proposing? How can I find out if my
512-byte inodes keep our ACLs in-line?

I am going to create a similar new RAID set, and wanted to check this
before on the one already in production.

Thanks,
Richard

-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-10-01 14:28   ` Richard Ems
@ 2012-10-01 14:36     ` Richard Ems
  2012-10-01 14:39     ` Eric Sandeen
  1 sibling, 0 replies; 21+ messages in thread
From: Richard Ems @ 2012-10-01 14:36 UTC (permalink / raw)
  To: Richard Ems; +Cc: Eric Sandeen, xfs

On 10/01/2012 04:28 PM, Richard Ems wrote:
> On 02/24/2012 04:17 PM, Eric Sandeen wrote:
>>> *MKFS*
>>>> We also heavily use ACLs for almost all of our files. Christoph Hellwig
>>>> suggested in a previous mail to use "-i size=512" on XFS creation, so my
>>>> mkfs.xfs would look something like:
>>>>
>>>> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
>> Be sure the stripe geometry matches the way the raid controller is
>> set up.
>>
>> You know the size of your acls, so you can probably do some testing
>> to find out how well 512-byte inodes keep ACLs in-line.
> 
> 
> Hi Eric,
> 
> This is a reply to an email from you sent 7 months ago ...
> 
> How could I do the testing you were proposing? How can I find out if my
> 512-byte inodes keep our ACLs in-line?
> 
> I am going to create a similar new RAID set, and wanted to check this
> before on the one already in production.
> 
> Thanks,
> Richard
> 

Hi again,

If the "method" is to use "xfs_bmap -a ..." and check for "no extents",
then I found it !

Thanks, sorry for the noise,
Richard

-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-10-01 14:28   ` Richard Ems
  2012-10-01 14:36     ` Richard Ems
@ 2012-10-01 14:39     ` Eric Sandeen
  2012-10-01 14:45       ` Richard Ems
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Sandeen @ 2012-10-01 14:39 UTC (permalink / raw)
  To: Richard Ems; +Cc: xfs

On 10/1/12 9:28 AM, Richard Ems wrote:
> On 02/24/2012 04:17 PM, Eric Sandeen wrote:
>>> *MKFS*
>>>> We also heavily use ACLs for almost all of our files. Christoph Hellwig
>>>> suggested in a previous mail to use "-i size=512" on XFS creation, so my
>>>> mkfs.xfs would look something like:
>>>>
>>>> mkfs.xfs -i size=512 -d su=stripe_size,sw=28 -L Backup_2 /dev/sdX1
>> Be sure the stripe geometry matches the way the raid controller is
>> set up.
>>
>> You know the size of your acls, so you can probably do some testing
>> to find out how well 512-byte inodes keep ACLs in-line.
> 
> 
> Hi Eric,
> 
> This is a reply to an email from you sent 7 months ago ...
> 
> How could I do the testing you were proposing? How can I find out if my
> 512-byte inodes keep our ACLs in-line?
> 
> I am going to create a similar new RAID set, and wanted to check this
> before on the one already in production.

you can use the xfs_bmap tool to map the attribute fork by using the "-a" option.

If it lists any block numbers, then it's outside the inode.

If you have varying sizes of acls, you'd just iterate over the fs to see what you've got.

-Eric

> Thanks,
> Richard
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: creating a new 80 TB XFS
  2012-10-01 14:39     ` Eric Sandeen
@ 2012-10-01 14:45       ` Richard Ems
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Ems @ 2012-10-01 14:45 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On 10/01/2012 04:39 PM, Eric Sandeen wrote:
> you can use the xfs_bmap tool to map the attribute fork by using the "-a" option.
> 
> If it lists any block numbers, then it's outside the inode.
> 
> If you have varying sizes of acls, you'd just iterate over the fs to see what you've got.

Ok, many thanks Eric!

cheers,
Richard


-- 
Richard Ems       mail: Richard.Ems@Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5º piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-10-01 14:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-24 12:52 creating a new 80 TB XFS Richard Ems
2012-02-24 14:08 ` Emmanuel Florac
2012-02-24 15:43   ` Richard Ems
2012-02-24 16:20     ` Martin Steigerwald
2012-02-24 16:51       ` Stan Hoeppner
2012-02-25 10:59         ` Martin Steigerwald
2012-02-24 16:58     ` Roger Willcocks
2012-02-25 21:57     ` Peter Grandi
2012-02-26  2:57       ` Stan Hoeppner
2012-02-26 16:08         ` Emmanuel Florac
2012-02-26 16:55           ` Joe Landman
2012-02-24 14:52 ` Peter Grandi
2012-02-24 14:57 ` Michael Weissenbacher
2012-02-24 16:05   ` Richard Ems
2012-02-24 15:17 ` Eric Sandeen
2012-10-01 14:28   ` Richard Ems
2012-10-01 14:36     ` Richard Ems
2012-10-01 14:39     ` Eric Sandeen
2012-10-01 14:45       ` Richard Ems
2012-02-27 11:56 ` Michael Monnerie
2012-02-27 12:20   ` Richard Ems

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.