* Question regarding XFS on LVM over hardware RAID. @ 2014-01-29 14:26 C. Morgan Hamill 2014-01-29 15:07 ` Eric Sandeen 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-01-29 14:26 UTC (permalink / raw) To: xfs Howdy folks, I understand that XFS have stripe unit and width configured according to the underlying RAID device when using LVM, but I was wondering if this is still the case when a given XFS-formatted logical volume takes up only part of the available space on the RAID. In particular, I could imagine that stripe width would need to be modified proportionally with the decrease in filesystem size. My intuition says that's false, but I wanted to check with folks who know for sure. Thanks for any help! -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-29 14:26 Question regarding XFS on LVM over hardware RAID C. Morgan Hamill @ 2014-01-29 15:07 ` Eric Sandeen 2014-01-29 19:11 ` C. Morgan Hamill 2014-01-29 22:40 ` Stan Hoeppner 0 siblings, 2 replies; 27+ messages in thread From: Eric Sandeen @ 2014-01-29 15:07 UTC (permalink / raw) To: C. Morgan Hamill, xfs On 1/29/14, 8:26 AM, C. Morgan Hamill wrote: > Howdy folks, > > I understand that XFS have stripe unit and width configured according to > the underlying RAID device when using LVM, but I was wondering if this > is still the case when a given XFS-formatted logical volume takes up > only part of the available space on the RAID. In particular, I could > imagine that stripe width would need to be modified proportionally with > the decrease in filesystem size. My intuition says that's false, but > I wanted to check with folks who know for sure. The stripe unit and width are units of geometry of the underlying storage; a filesystem will span some number of stripe units, depending on its size. So no, the filesystem's notion of stripe geometry does not change with the filesystem size. You do want to make sure that stripe geometry is correct and aligned from top to bottom. I helped write up the RHEL storage admin guide, and there are some nice words about geometry and alignment in there: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-iolimits.html (Hopefully this is available w/o login, I think it is) -Eric > Thanks for any help! > -- > Morgan Hamill > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-29 15:07 ` Eric Sandeen @ 2014-01-29 19:11 ` C. Morgan Hamill 2014-01-29 23:55 ` Stan Hoeppner 2014-01-29 22:40 ` Stan Hoeppner 1 sibling, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-01-29 19:11 UTC (permalink / raw) To: xfs Thanks for the quick reply. Excerpts from Eric Sandeen's message of 2014-01-29 10:07:15 -0500: > The stripe unit and width are units of geometry of the underlying > storage; a filesystem will span some number of stripe units, depending > on its size. > > So no, the filesystem's notion of stripe geometry does not change > with the filesystem size. > > You do want to make sure that stripe geometry is correct and aligned > from top to bottom. Just to make sure I've understood, for 3 14-disk RAID 6 groups striped together into a single RAID 60, with stripe units of 128k, split up into some number of LVM logical volumes, I'd create the filesystems with the following: mkfs.xfs -d su=128k,sw=36 ... for all of the filesystems, regardless of how many and what size they were. Does that sound right? > I helped write up the RHEL storage admin guide, and there are some > nice words about geometry and alignment in there: > > https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-iolimits.html Thanks for the resource! -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-29 19:11 ` C. Morgan Hamill @ 2014-01-29 23:55 ` Stan Hoeppner 2014-01-30 14:28 ` C. Morgan Hamill 0 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-01-29 23:55 UTC (permalink / raw) To: C. Morgan Hamill, xfs On 1/29/2014 1:11 PM, C. Morgan Hamill wrote: > Thanks for the quick reply. > > Excerpts from Eric Sandeen's message of 2014-01-29 10:07:15 -0500: >> The stripe unit and width are units of geometry of the underlying >> storage; a filesystem will span some number of stripe units, depending >> on its size. >> >> So no, the filesystem's notion of stripe geometry does not change >> with the filesystem size. >> >> You do want to make sure that stripe geometry is correct and aligned >> from top to bottom. > > Just to make sure I've understood, for 3 14-disk RAID 6 groups striped > together into a single RAID 60, with stripe units of 128k, split up into > some number of LVM logical volumes, I'd create the filesystems with the > following: > > mkfs.xfs -d su=128k,sw=36 ... This is not correct. You must align to either the outer stripe or the inner stripe when using a nested array. In this case it appears your inner stripe is RAID6 su 128KB * sw 12 = 1536KB. You did not state your outer RAID0 stripe geometry. Which one you align to depends entirely on your workload. However, given that you currently intend to assemble one large array from 3 smaller arrays, then immediately carve it into smaller pieces, it's seems that RAID60 is probably not the correct architecture for your workload. RAID60 is suitable for very large streaming write/read workloads where you are evenly distributing filesystem blocks across a very large spindle count, with a deterministic IO pattern, and with no RMW. It is not very suitable for consolidation workloads, as you seem to be describing here. Everything starts and ends with the workload. You always design the storage to meet the needs of the workload, not the other way round. You seem to be designing your system from the storage up. This is often a recipe for disaster. Please describe your workload in more detail so we can provide better, detailed, advice. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-29 23:55 ` Stan Hoeppner @ 2014-01-30 14:28 ` C. Morgan Hamill 2014-01-30 20:28 ` Dave Chinner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-01-30 14:28 UTC (permalink / raw) To: stan; +Cc: xfs First, thanks very much for your help. We're weening ourselves off unnecessarily expensive storage and as such I unfortunately haven't had as much experience with physical filesystems as I'd like. I am also unfamiliar with XFS. I appreciate the help immensely. Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500: > This is not correct. You must align to either the outer stripe or the > inner stripe when using a nested array. In this case it appears your > inner stripe is RAID6 su 128KB * sw 12 = 1536KB. You did not state your > outer RAID0 stripe geometry. Which one you align to depends entirely on > your workload. Ahh this makes sense; it had occurred to me that something like this might be the case. I'm not exactly sure what you mean by inner and outer; I can imagine it going both ways. Just to clarify, it looks like this: XFS | XFS | XFS | XFS --------------------------------------------------------- LVM volume group --------------------------------------------------------- RAID 0 --------------------------------------------------------- RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks) --------------------------------------------------------- 42 4TB SAS disks ...more or less. I agree that it's quite weird, but I'll describe the workload and the constraints. We're using commercial backup software to provide backup needs for the University I work at (CrashPlan Pro enterprisey whathaveyou server). We've got perhaps 1200 or so user desktops and 1 few hundred servers on top of that, all of which currently adds up to just under 100TB on our old backup system which we're moving from (IBM Tivoli). So this archive will be our primary store for on-site backups. CrashPlan is more or less continually transferring some amount of data from clients to itself, which it does all at once in a bundle after determining what's changed. It ends up storing archives on disk as files which look to max out at 4GB each before it opens up the next one. Writes are probably more important than reads, as restores are relatively infrequent, so I'd like to optimize for writes. I expect the bottleneck to be IO as the campus is predominantly 1Gbps throughout and will become 10Gbps is the not-that-distant future, most likely. I can virtually guarantee CPU will not be the bottleneck. Now, here's the constraints, which is why I was planning on setting things up as above: - This is a budget job, so sane things like RAID 10 are our. RAID 6 or 60 are (as far as I can tell, correct me if I'm wrong) our only real options here, as anything else either sacrifices too much storage or is too susceptible failure from UREs. - I need to expose, in the end, three-ish (two or four would be OK) filesystems to the backup software, which should come fairly close to minimizing the effects of the archive maintenance jobs (integrity checks, mostly). CrashPlan will spawn 2 jobs per store point, so a max of 8 at any given time should be a nice balance between under-utilizing and saturating the IO. So I had thought LVM over RAID 60 would make sense because it would give me the option of leaving a bit of disk unallocated and being able to tweak filesystem sizes a bit as time goes on. Now that I think of it though, perhaps something like 2 or 3 RAID6 volumes would make more sense, with XFS directly on top of them. In that case I have to balance number of volumes against the loss of 2 parity disks, however. I'm not sure how best to proceed; any advice would be invaluable. -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-30 14:28 ` C. Morgan Hamill @ 2014-01-30 20:28 ` Dave Chinner 2014-01-31 5:58 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2014-01-30 20:28 UTC (permalink / raw) To: C. Morgan Hamill; +Cc: stan, xfs On Thu, Jan 30, 2014 at 09:28:45AM -0500, C. Morgan Hamill wrote: > First, thanks very much for your help. We're weening ourselves off > unnecessarily expensive storage and as such I unfortunately haven't had > as much experience with physical filesystems as I'd like. I am also > unfamiliar with XFS. I appreciate the help immensely. > > Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500: > > This is not correct. You must align to either the outer stripe or the > > inner stripe when using a nested array. In this case it appears your > > inner stripe is RAID6 su 128KB * sw 12 = 1536KB. You did not state your > > outer RAID0 stripe geometry. Which one you align to depends entirely on > > your workload. > > Ahh this makes sense; it had occurred to me that something like this > might be the case. I'm not exactly sure what you mean by inner and > outer; I can imagine it going both ways. > > Just to clarify, it looks like this: > > XFS | XFS | XFS | XFS > --------------------------------------------------------- > LVM volume group > --------------------------------------------------------- > RAID 0 > --------------------------------------------------------- > RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks) > --------------------------------------------------------- > 42 4TB SAS disks So optimised for sequential IO. The time-honoured method of setting up XFS for this if the workload is large files is to use a stripe unit that is equal to the width of the underlying RAID6 volumes with a stripe width of 3. That way XFS tries to align files to the start of each RAID6 volume, and allocate in full RAID6 stripe chunks. This mostly avoids RMW cycles for large files and sequential IO. i.e. su = 1536k, sw = 3. > ...more or less. > > I agree that it's quite weird, but I'll describe the workload and the > constraints. [snip] summary: concurrent (initially slow) sequential writes of ~4GB files. > Now, here's the constraints, which is why I was planning on setting > things up as above: > > - This is a budget job, so sane things like RAID 10 are our. RAID > 6 or 60 are (as far as I can tell, correct me if I'm wrong) our only > real options here, as anything else either sacrifices too much > storage or is too susceptible failure from UREs. RAID6 is fine for this. > - I need to expose, in the end, three-ish (two or four would be OK) > filesystems to the backup software, which should come fairly close > to minimizing the effects of the archive maintenance jobs (integrity > checks, mostly). CrashPlan will spawn 2 jobs per store point, so > a max of 8 at any given time should be a nice balance between > under-utilizing and saturating the IO. So concurrency is up to 8 files being written at a time. That's pretty much on the money for striped RAID. Much more than this and you end up with performance being limited by seeking on the slowest disk in the RAID sets. > So I had thought LVM over RAID 60 would make sense because it would give > me the option of leaving a bit of disk unallocated and being able to > tweak filesystem sizes a bit as time goes on. *nod* And it allows you, in future, to add more disks and grow across them via linear concatentation of more RAID60 luns of the same layout... > Now that I think of it though, perhaps something like 2 or 3 RAID6 > volumes would make more sense, with XFS directly on top of them. In > that case I have to balance number of volumes against the loss of > 2 parity disks, however. Probably not worth the complexity. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-30 20:28 ` Dave Chinner @ 2014-01-31 5:58 ` Stan Hoeppner 2014-01-31 21:14 ` C. Morgan Hamill 0 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-01-31 5:58 UTC (permalink / raw) To: Dave Chinner, C. Morgan Hamill; +Cc: xfs On 1/30/2014 2:28 PM, Dave Chinner wrote: > On Thu, Jan 30, 2014 at 09:28:45AM -0500, C. Morgan Hamill wrote: >> First, thanks very much for your help. We're weening ourselves off >> unnecessarily expensive storage and as such I unfortunately haven't had >> as much experience with physical filesystems as I'd like. I am also >> unfamiliar with XFS. I appreciate the help immensely. >> >> Excerpts from Stan Hoeppner's message of 2014-01-29 18:55:48 -0500: >>> This is not correct. You must align to either the outer stripe or the >>> inner stripe when using a nested array. In this case it appears your >>> inner stripe is RAID6 su 128KB * sw 12 = 1536KB. You did not state your >>> outer RAID0 stripe geometry. Which one you align to depends entirely on >>> your workload. >> >> Ahh this makes sense; it had occurred to me that something like this >> might be the case. I'm not exactly sure what you mean by inner and >> outer; I can imagine it going both ways. >> >> Just to clarify, it looks like this: >> >> XFS | XFS | XFS | XFS >> --------------------------------------------------------- >> LVM volume group >> --------------------------------------------------------- >> RAID 0 >> --------------------------------------------------------- >> RAID 6 (14 disks) | RAID 6 (14 disks) | RAID 6 (14 disks) >> --------------------------------------------------------- >> 42 4TB SAS disks RAID60 is a nested RAID level just like RAID10 and RAID50. It is a stripe, or RAID0, across multiple primary array types, RAID6 in this case. The stripe width of each 'inner' RAID6 becomes the stripe unit of the 'outer' RAID0 array: RAID6 geometry 128KB * 12 = 1536KB RAID0 geometry 1536KB * 3 = 4608KB If you are creating your RAID60 array with a proprietary hardware RAID/SAN management utility it may not be clearly showing you the resulting nested geometry I've demonstrated above, which is correct for your RAID60. It is possible with software RAID to continue nesting stripe upon stripe to build infinitely large nested arrays. It is not practical to do so for many reasons, but I'll not express those here as it is out of scope for this discussion. I am simply attempting to explain how nested RAID levels are constructed. > So optimised for sequential IO. The time-honoured method of setting > up XFS for this if the workload is large files is to use a stripe > unit that is equal to the width of the underlying RAID6 volumes with > a stripe width of 3. That way XFS tries to align files to the start > of each RAID6 volume, and allocate in full RAID6 stripe chunks. This > mostly avoids RMW cycles for large files and sequential IO. i.e. su > = 1536k, sw = 3. As Dave demonstrates, your hardware geometry is 1536*3=4608KB. Thus, when you create your logical volumes they each need to start and end on a 4608KB boundary, and be evenly divisible by 4608KB. This will ensure that all of your logical volumes are aligned to the RAID60 geometry. When formatting the LVs with XFS you will use: ~# mkfs.xfs -d su=1536k,sw=3 /dev/[lv_device_path] This aligns XFS to the RAID60 geometry. Geometry alignment must be maintained throughout the entire storage stack. If a single layer is not aligned properly, every layer will be misaligned. When this occurs performance will suffer, and could suffer tremendously. You'll want to add "inode64" to your fstab mount options for these filesystems. This has nothing to do with geometry, but how XFS allocates inodes and how/where files are written to AGs. It is the default in very recent kernels but I don't know in which it was made so. >> ...more or less. >> >> I agree that it's quite weird, but I'll describe the workload and the >> constraints. > > [snip] > > summary: concurrent (initially slow) sequential writes of ~4GB files. > >> Now, here's the constraints, which is why I was planning on setting >> things up as above: >> >> - This is a budget job, so sane things like RAID 10 are our. RAID >> 6 or 60 are (as far as I can tell, correct me if I'm wrong) our only >> real options here, as anything else either sacrifices too much >> storage or is too susceptible failure from UREs. > > RAID6 is fine for this. > >> - I need to expose, in the end, three-ish (two or four would be OK) >> filesystems to the backup software, which should come fairly close >> to minimizing the effects of the archive maintenance jobs (integrity >> checks, mostly). CrashPlan will spawn 2 jobs per store point, so >> a max of 8 at any given time should be a nice balance between >> under-utilizing and saturating the IO. > > So concurrency is up to 8 files being written at a time. That's > pretty much on the money for striped RAID. Much more than this and > you end up with performance being limited by seeking on the slowest > disk in the RAID sets. > >> So I had thought LVM over RAID 60 would make sense because it would give >> me the option of leaving a bit of disk unallocated and being able to >> tweak filesystem sizes a bit as time goes on. > > *nod* > > And it allows you, in future, to add more disks and grow across them > via linear concatentation of more RAID60 luns of the same layout... > >> Now that I think of it though, perhaps something like 2 or 3 RAID6 >> volumes would make more sense, with XFS directly on top of them. In >> that case I have to balance number of volumes against the loss of >> 2 parity disks, however. > > Probably not worth the complexity. You'll lose 2 disks to parity with RAID6 regardless. Three standalone arrays costs you 6 disks, same as making a RAID60 of those 3 arrays. The problem you'll have with XFS directly on RAID6 is the inability to easily expand. The only way to do it is by by adding disks to each RAID6 and having the controller reshape the array. Reshapes with 4TB drives will take more than a day to complete and the array will be very slow during the reshape. Every time you reshape the array your geometry will change. XFS has the ability to align to a new geometry using a mount option, but it's best to avoid this. LVM typically affords you much more flexibility here than your RAID/SAN controller. Just be mindful that when you expand you need to keep your geometry, i.e. stripe width, the same. Let's say some time in the future you want to expand but can only afford, or only need, one 14 disk chassis at the time, not another 3 for another RAID60. Here you could create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB. You could then carve it up into 1-3 pieces, each aligned to the start/end of a 4608KB stripe and evenly divisible by 4608KB, and add them to one of more of your LVs/XFS filesystems. This maintains the same overall stripe width geometry as the RAID60 to which all of your XFS filesystems are already aligned. The volume manager in your RAID hardware may not, probably won't, allow doing this type of expansion after the fact, meaning after the original RAID60 has been created. If you remember only 3 words of my post, remember: Alignment, alignment, alignment. For a RAID60 setup such as you're describing, you'll want to use LVM, and you must maintain consistent geometry throughout the stack, from array to filesystem. This means every physical volume you create must start and end on a 4608KB stripe boundary. Every volume group you create must do the same. And every logical volume must also start and end on a 4608KB stripe boundary. If you don't verify each layer is aligned all of your XFS filesystems will likely be unaligned. And again, performance will suffer, possibly horribly so. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-31 5:58 ` Stan Hoeppner @ 2014-01-31 21:14 ` C. Morgan Hamill 2014-02-01 21:06 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-01-31 21:14 UTC (permalink / raw) To: stan; +Cc: xfs Excerpts from Stan Hoeppner's message of 2014-01-31 00:58:46 -0500: > RAID60 is a nested RAID level just like RAID10 and RAID50. It is a > stripe, or RAID0, across multiple primary array types, RAID6 in this > case. The stripe width of each 'inner' RAID6 becomes the stripe unit of > the 'outer' RAID0 array: > > RAID6 geometry 128KB * 12 = 1536KB > RAID0 geometry 1536KB * 3 = 4608KB > > If you are creating your RAID60 array with a proprietary hardware > RAID/SAN management utility it may not be clearly showing you the > resulting nested geometry I've demonstrated above, which is correct for > your RAID60. > > It is possible with software RAID to continue nesting stripe upon stripe > to build infinitely large nested arrays. It is not practical to do so > for many reasons, but I'll not express those here as it is out of scope > for this discussion. I am simply attempting to explain how nested RAID > levels are constructed. > > > So optimised for sequential IO. The time-honoured method of setting > > up XFS for this if the workload is large files is to use a stripe > > unit that is equal to the width of the underlying RAID6 volumes with > > a stripe width of 3. That way XFS tries to align files to the start > > of each RAID6 volume, and allocate in full RAID6 stripe chunks. This > > mostly avoids RMW cycles for large files and sequential IO. i.e. su > > = 1536k, sw = 3. Makes perfect sense. > As Dave demonstrates, your hardware geometry is 1536*3=4608KB. Thus, > when you create your logical volumes they each need to start and end on > a 4608KB boundary, and be evenly divisible by 4608KB. This will ensure > that all of your logical volumes are aligned to the RAID60 geometry. > When formatting the LVs with XFS you will use: > > ~# mkfs.xfs -d su=1536k,sw=3 /dev/[lv_device_path] Noted. > This aligns XFS to the RAID60 geometry. Geometry alignment must be > maintained throughout the entire storage stack. If a single layer is > not aligned properly, every layer will be misaligned. When this occurs > performance will suffer, and could suffer tremendously. > > You'll want to add "inode64" to your fstab mount options for these > filesystems. This has nothing to do with geometry, but how XFS > allocates inodes and how/where files are written to AGs. It is the > default in very recent kernels but I don't know in which it was made so. Yes, I was aware of this. > LVM typically affords you much more flexibility here than your RAID/SAN > controller. Just be mindful that when you expand you need to keep your > geometry, i.e. stripe width, the same. Let's say some time in the > future you want to expand but can only afford, or only need, one 14 disk > chassis at the time, not another 3 for another RAID60. Here you could > create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB. > > You could then carve it up into 1-3 pieces, each aligned to the > start/end of a 4608KB stripe and evenly divisible by 4608KB, and add > them to one of more of your LVs/XFS filesystems. This maintains the > same overall stripe width geometry as the RAID60 to which all of your > XFS filesystems are already aligned. OK, so the upshot is is that any additions to the volume group must be array with su*sw=4608k, and all logical volumes and filesystems must begin and end on multiples of 4608k from the start of the block device. As long as these things hold true, is it all right for logical volumes/filesystems to begin on one physical device and end on another? > If you remember only 3 words of my post, remember: > > Alignment, alignment, alignment. Yes, I am hearing you. :-) > For a RAID60 setup such as you're describing, you'll want to use LVM, > and you must maintain consistent geometry throughout the stack, from > array to filesystem. This means every physical volume you create must > start and end on a 4608KB stripe boundary. Every volume group you > create must do the same. And every logical volume must also start and > end on a 4608KB stripe boundary. If you don't verify each layer is > aligned all of your XFS filesystems will likely be unaligned. And > again, performance will suffer, possibly horribly so. So, basically, --dataalignment is my friend during pvcreate and lvcreate. Thanks so much for your and Dave's help; this has been tremendously helpful. -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-31 21:14 ` C. Morgan Hamill @ 2014-02-01 21:06 ` Stan Hoeppner 2014-02-02 21:21 ` Dave Chinner 2014-02-03 16:07 ` C. Morgan Hamill 0 siblings, 2 replies; 27+ messages in thread From: Stan Hoeppner @ 2014-02-01 21:06 UTC (permalink / raw) To: C. Morgan Hamill; +Cc: xfs On 1/31/2014 3:14 PM, C. Morgan Hamill wrote: > Excerpts from Stan Hoeppner's message of 2014-01-31 00:58:46 -0500: ... >> LVM typically affords you much more flexibility here than your RAID/SAN >> controller. Just be mindful that when you expand you need to keep your >> geometry, i.e. stripe width, the same. Let's say some time in the >> future you want to expand but can only afford, or only need, one 14 disk >> chassis at the time, not another 3 for another RAID60. Here you could >> create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB. >> >> You could then carve it up into 1-3 pieces, each aligned to the >> start/end of a 4608KB stripe and evenly divisible by 4608KB, and add >> them to one of more of your LVs/XFS filesystems. This maintains the >> same overall stripe width geometry as the RAID60 to which all of your >> XFS filesystems are already aligned. > > OK, so the upshot is is that any additions to the volume group must be > array with su*sw=4608k, and all logical volumes and filesystems must > begin and end on multiples of 4608k from the start of the block device. > > As long as these things hold true, is it all right for logical > volumes/filesystems to begin on one physical device and end on another? Yes, that's one of the beauties of LVM. However, there are other reasons you may not want to do this. For example, if you have allocated space from two different JBOD or SAN units to a single LVM volume, and you lack multipath connections, if you have a cable, switch, HBA, or other failure disconnecting one LUN that will wreak havoc on your mounted XFS filesystem. If you have multipath and the storage device disappears due to some other failure such as backplane, UPS, etc, you have the same problem. This isn't a deal breaker. There are many large XFS filesystems in production that span multiple storage arrays. You just need to be mindful of your architecture at all times, and it needs to be documented. Scenario: XFS unmounts due to an IO error. You're not yet aware an entire chassis is offline. You can't remount the filesystem so you start a destructive xfs_repair thinking that will fix the problem. Doing so will wreck your filesystem and you'll likely lose access to all the files on the offline chassis, with no ability to get it back short of some magic and a full restore from tape or D2D backup server. We had a case similar to this reported a couple of years ago. >> If you remember only 3 words of my post, remember: >> >> Alignment, alignment, alignment. > > Yes, I am hearing you. :-) > >> For a RAID60 setup such as you're describing, you'll want to use LVM, >> and you must maintain consistent geometry throughout the stack, from >> array to filesystem. This means every physical volume you create must >> start and end on a 4608KB stripe boundary. Every volume group you >> create must do the same. And every logical volume must also start and >> end on a 4608KB stripe boundary. If you don't verify each layer is >> aligned all of your XFS filesystems will likely be unaligned. And >> again, performance will suffer, possibly horribly so. > > So, basically, --dataalignment is my friend during pvcreate and > lvcreate. If the logical sector size reported by your RAID controller is 512 bytes, then "--dataalignment=9216s" should start your data section on a RAID60 stripe boundary after the metadata section. Tthe PhysicalExtentSize should probably also match the 4608KB stripe width, but this is apparently not possible. PhysicalExtentSize must be a power of 2 value. I don't know if or how this will affect XFS aligned write out. You'll need to consult with someone more knowledgeable of LVM. > Thanks so much for your and Dave's help; this has been tremendously > helpful. You bet. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-01 21:06 ` Stan Hoeppner @ 2014-02-02 21:21 ` Dave Chinner 2014-02-03 16:12 ` C. Morgan Hamill 2014-02-03 16:07 ` C. Morgan Hamill 1 sibling, 1 reply; 27+ messages in thread From: Dave Chinner @ 2014-02-02 21:21 UTC (permalink / raw) To: Stan Hoeppner; +Cc: C. Morgan Hamill, xfs On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote: > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote: > > So, basically, --dataalignment is my friend during pvcreate and > > lvcreate. > > If the logical sector size reported by your RAID controller is 512 > bytes, then "--dataalignment=9216s" should start your data section on a > RAID60 stripe boundary after the metadata section. > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe > width, but this is apparently not possible. PhysicalExtentSize must be > a power of 2 value. I don't know if or how this will affect XFS aligned > write out. You'll need to consult with someone more knowledgeable of LVM. You can't do single IOs of that size, anyway, so this is where the BBWC on the raid controller does it's magic and caches sequntial IOs until it has full stripe writes cached.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-02 21:21 ` Dave Chinner @ 2014-02-03 16:12 ` C. Morgan Hamill 2014-02-03 21:41 ` Dave Chinner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-03 16:12 UTC (permalink / raw) To: Dave Chinner; +Cc: Stan Hoeppner, xfs Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500: > On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote: > > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote: > > > So, basically, --dataalignment is my friend during pvcreate and > > > lvcreate. > > > > If the logical sector size reported by your RAID controller is 512 > > bytes, then "--dataalignment=9216s" should start your data section on a > > RAID60 stripe boundary after the metadata section. > > > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe > > width, but this is apparently not possible. PhysicalExtentSize must be > > a power of 2 value. I don't know if or how this will affect XFS aligned > > write out. You'll need to consult with someone more knowledgeable of LVM. > > You can't do single IOs of that size, anyway, so this is where the > BBWC on the raid controller does it's magic and caches sequntial IOs > until it has full stripe writes cached.... So I am probably missing something here, could you clarify? Are you saying that I can't do single IOs of that size (by which I take your meaning to be IOs as small as 9216 sectors) because my RAID controllers controller won't let me (i.e., it will cache anything smaller than the stripe size anyway)? Or are you saying that XFS with these given settings won't make writes that small (which seems false, since I'm essentially telling it to do writes of precisely that size). I'm a bit unclear on that. In addition, does this in effect mean that when it comes to LVM, extent size makes no difference for alignment purposes? So I don't have to worry about anything other that aligning the beginning and ending of logical volumes, volume groups, etc. to 9216 sector multiples? Thanks again! -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-03 16:12 ` C. Morgan Hamill @ 2014-02-03 21:41 ` Dave Chinner 2014-02-04 8:00 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2014-02-03 21:41 UTC (permalink / raw) To: C. Morgan Hamill; +Cc: Stan Hoeppner, xfs On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote: > Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500: > > On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote: > > > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote: > > > > So, basically, --dataalignment is my friend during pvcreate and > > > > lvcreate. > > > > > > If the logical sector size reported by your RAID controller is 512 > > > bytes, then "--dataalignment=9216s" should start your data section on a > > > RAID60 stripe boundary after the metadata section. > > > > > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe > > > width, but this is apparently not possible. PhysicalExtentSize must be > > > a power of 2 value. I don't know if or how this will affect XFS aligned > > > write out. You'll need to consult with someone more knowledgeable of LVM. > > > > You can't do single IOs of that size, anyway, so this is where the > > BBWC on the raid controller does it's magic and caches sequntial IOs > > until it has full stripe writes cached.... > > So I am probably missing something here, could you clarify? Are you > saying that I can't do single IOs of that size (by which I take your > meaning to be IOs as small as 9216 sectors) because my RAID controllers > controller won't let me (i.e., it will cache anything smaller than the > stripe size anyway)? Typical limitations on IO size are the size of the hardware DMA scatter-gather rings of the HBA/raid controller. For example, the two hardware RAID controllers in my largest test box have limitations of 70 and 80 segments and maximum IO sizes of 280k and 320k. And looking at the IO being dispatched with blktrace, I see the maximum size is: 8,80 2 61 0.769857112 44866 D WS 12423408 + 560 [qemu-system-x86] 8,80 2 71 0.769877563 44866 D WS 12423968 + 560 [qemu-system-x86] 8,80 2 72 0.769889767 44866 D WS 12424528 + 560 [qemu-system-x86] ^^^ 560 sectors or 280k. So for this hardware, sequential 280k writes are hitting the BBWC. And because they are sequential, the BBWC is writing them back as fully stripe writes after aggregating them in NVRAM. Hence there are no performance diminishing RMW cycles occurring, even though the individual IO size is much smaller than the stripe unit/width.... > Or are you saying that XFS with these given > settings won't make writes that small (which seems false, since I'm > essentially telling it to do writes of precisely that size). I'm a bit > unclear on that. What su/sw tells XFs is how to align allocation of files, so that when we dispatch sequential IO to that file it is aligned to the underlying storage because the extents that the filesystem allocated for it are aligned. This means that if you write exactly one stripe width of data, it will hit each disk exactly once. It might take 10 IOs to get the data to the storage, but it will only hit each disk once. The function of the stripe cache (in software raid) and the BBWC (in hardware RAID) is to prevent RMW cycles while the filesystem/hardware is still flinging data at the RAID lun. Only once the controller has complete stripe widths will it calculate parity and write back the data, thereby avoiding a RMW cycle.... > In addition, does this in effect mean that when it comes to LVM, extent > size makes no difference for alignment purposes? So I don't have to > worry about anything other that aligning the beginning and ending of > logical volumes, volume groups, etc. to 9216 sector multiples? No, you still have to align everything to the underlying storage so that the filesystem on top of the volumes is correctly aligned. Where the data will be written (i.e. howthe filesystem allocates the underlying blocks) determines the IO alignment of sequential/large user IOs, and that matters far more than the size of the sequntial IOs the kernel uses to write the data. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-03 21:41 ` Dave Chinner @ 2014-02-04 8:00 ` Stan Hoeppner 2014-02-18 19:44 ` C. Morgan Hamill 0 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-02-04 8:00 UTC (permalink / raw) To: Dave Chinner, C. Morgan Hamill; +Cc: xfs On 2/3/2014 3:41 PM, Dave Chinner wrote: > On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote: >> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500: >>> On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote: >>>> On 1/31/2014 3:14 PM, C. Morgan Hamill wrote: >>>>> So, basically, --dataalignment is my friend during pvcreate and >>>>> lvcreate. >>>> >>>> If the logical sector size reported by your RAID controller is 512 >>>> bytes, then "--dataalignment=9216s" should start your data section on a >>>> RAID60 stripe boundary after the metadata section. >>>> >>>> Tthe PhysicalExtentSize should probably also match the 4608KB stripe >>>> width, but this is apparently not possible. PhysicalExtentSize must be >>>> a power of 2 value. I don't know if or how this will affect XFS aligned >>>> write out. You'll need to consult with someone more knowledgeable of LVM. >>> >>> You can't do single IOs of that size, anyway, so this is where the >>> BBWC on the raid controller does it's magic and caches sequntial IOs >>> until it has full stripe writes cached.... >> >> So I am probably missing something here, could you clarify? Are you >> saying that I can't do single IOs of that size (by which I take your >> meaning to be IOs as small as 9216 sectors) because my RAID controllers >> controller won't let me (i.e., it will cache anything smaller than the >> stripe size anyway)? > > Typical limitations on IO size are the size of the hardware DMA > scatter-gather rings of the HBA/raid controller. For example, the > two hardware RAID controllers in my largest test box have > limitations of 70 and 80 segments and maximum IO sizes of 280k and > 320k. > > And looking at the IO being dispatched with blktrace, I see the > maximum size is: > > 8,80 2 61 0.769857112 44866 D WS 12423408 + 560 [qemu-system-x86] > 8,80 2 71 0.769877563 44866 D WS 12423968 + 560 [qemu-system-x86] > 8,80 2 72 0.769889767 44866 D WS 12424528 + 560 [qemu-system-x86] > ^^^ > > 560 sectors or 280k. So for this hardware, sequential 280k writes > are hitting the BBWC. And because they are sequential, the BBWC is > writing them back as fully stripe writes after aggregating them in > NVRAM. Hence there are no performance diminishing RMW cycles > occurring, even though the individual IO size is much smaller than > the stripe unit/width.... > >> Or are you saying that XFS with these given >> settings won't make writes that small (which seems false, since I'm >> essentially telling it to do writes of precisely that size). I'm a bit >> unclear on that. > > What su/sw tells XFs is how to align allocation of files, so that > when we dispatch sequential IO to that file it is aligned to the > underlying storage because the extents that the filesystem allocated > for it are aligned. This means that if you write exactly one stripe > width of data, it will hit each disk exactly once. It might take 10 > IOs to get the data to the storage, but it will only hit each disk > once. > > The function of the stripe cache (in software raid) and the BBWC (in > hardware RAID) is to prevent RMW cycles while the > filesystem/hardware is still flinging data at the RAID lun. Only > once the controller has complete stripe widths will it calculate > parity and write back the data, thereby avoiding a RMW cycle.... ------- >> In addition, does this in effect mean that when it comes to LVM, extent >> size makes no difference for alignment purposes? So I don't have to >> worry about anything other that aligning the beginning and ending of >> logical volumes, volume groups, etc. to 9216 sector multiples? > > No, you still have to align everything to the underlying storage so > that the filesystem on top of the volumes is correctly aligned. > Where the data will be written (i.e. howthe filesystem allocates the > underlying blocks) determines the IO alignment of sequential/large > user IOs, and that matters far more than the size of the sequntial > IOs the kernel uses to write the data. After a little digging and thinking this through... The default PE size is 4MB but up to 16GB with LVM1, and apparently unlimited size with LVM2. It can be a few thousand times larger than any sane stripe width. This makes it pretty clear that PEs exist strictly for volume management operations, used by the LVM tools, but have no relationship to regular write IOs. Thus the PE size need not match nor be evenly divisible by the stripe width. It's not part of the alignment equation. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-04 8:00 ` Stan Hoeppner @ 2014-02-18 19:44 ` C. Morgan Hamill 2014-02-18 23:07 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-18 19:44 UTC (permalink / raw) To: xfs Howdy, sorry for digging up this thread, but I've run into an issue again, and am looking for advice. Excerpts from Stan Hoeppner's message of 2014-02-04 03:00:54 -0500: > After a little digging and thinking this through... > > The default PE size is 4MB but up to 16GB with LVM1, and apparently > unlimited size with LVM2. It can be a few thousand times larger than > any sane stripe width. This makes it pretty clear that PEs exist > strictly for volume management operations, used by the LVM tools, but > have no relationship to regular write IOs. Thus the PE size need not > match nor be evenly divisible by the stripe width. It's not part of the > alignment equation. So in the course of actually going about this, I realized that this actually is not true (I think). Logical volumes can only have sizes that are multiple of the physical extent size (by definition, really), and so there's no way to have logical volumes end on a multiple of the array's stripe width, given my stripe width of 9216s, there doesn't seem to be an abundance of integer solutions to 2^n mod 9216 = 0. So my question is, then, does it matter if logical volumes (or, really, XFS file systems) actually end right on a multiple of the stripe width, or only that it _begin_ on a multiple of it (leaving a bit of dead space before the next logical volume)? If not, I'll tweak things to ensure my stripe width is a power of 2. Thanks again! -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-18 19:44 ` C. Morgan Hamill @ 2014-02-18 23:07 ` Stan Hoeppner 2014-02-20 18:31 ` C. Morgan Hamill 0 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-02-18 23:07 UTC (permalink / raw) To: C. Morgan Hamill, xfs On 2/18/2014 1:44 PM, C. Morgan Hamill wrote: > Howdy, sorry for digging up this thread, but I've run into an issue > again, and am looking for advice. > > Excerpts from Stan Hoeppner's message of 2014-02-04 03:00:54 -0500: >> After a little digging and thinking this through... >> >> The default PE size is 4MB but up to 16GB with LVM1, and apparently >> unlimited size with LVM2. It can be a few thousand times larger than >> any sane stripe width. This makes it pretty clear that PEs exist >> strictly for volume management operations, used by the LVM tools, but >> have no relationship to regular write IOs. Thus the PE size need not >> match nor be evenly divisible by the stripe width. It's not part of the >> alignment equation. > > So in the course of actually going about this, I realized that this > actually is not true (I think). Two different issues. > Logical volumes can only have sizes that are multiple of the physical > extent size (by definition, really), and so there's no way to have > logical volumes end on a multiple of the array's stripe width, given my > stripe width of 9216s, there doesn't seem to be an abundance of integer > solutions to 2^n mod 9216 = 0. > > So my question is, then, does it matter if logical volumes (or, really, > XFS file systems) actually end right on a multiple of the stripe width, > or only that it _begin_ on a multiple of it (leaving a bit of dead space > before the next logical volume)? Create each LV starting on a stripe boundary. There will be some unallocated space between LVs. Use the mkfs.xfs -d size= option to create your filesystems inside of each LV such that the filesystem total size is evenly divisible by the stripe width. This results in an additional small amount of unallocated space within, and at the end of, each LV. It's nice if you can line everything up, but when using RAID6 and one or two bays for hot spares, one rarely ends up with 8 or 16 data spindles. > If not, I'll tweak things to ensure my stripe width is a power of 2. That's not possible with 12 data spindles per RAID, not possible with 42 drives in 3 chassis. Not without a bunch of idle drives. I still don't understand why you believe you need LVM in the mix, and more than one filesystem. > - I need to expose, in the end, three-ish (two or four would be OK) > filesystems to the backup software, which should come fairly close > to minimizing the effects of the archive maintenance jobs integrity > checks, mostly). CrashPlan will spawn 2 jobs per store point, so > a max of 8 at any given time should be a nice balance between > under-utilizing and saturating the IO. Backup software is unaware of mount points. It uses paths just like every other program. The number of XFS filesystems is irrelevant to "minimizing the effects of the archive maintenance jobs". You cannot bog down XFS. You will bog down the drives no matter how many filesystems when using RAID60. Here is what you should do: Format the RAID60 directly with XFS. Create 3 or 4 directories for CrashPlan to use as its "store points". If you need to expand in the future, as I said previously, simply add another 14 drive RAID6 chassis, format it directly with XFS, mount it at an appropriate place in the directory tree and give that path to CrashPlan. Does it have a limit on the number of "store points"? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-18 23:07 ` Stan Hoeppner @ 2014-02-20 18:31 ` C. Morgan Hamill 2014-02-21 3:33 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-20 18:31 UTC (permalink / raw) To: xfs Quoting Stan Hoeppner (2014-02-18 18:07:24) > Create each LV starting on a stripe boundary. There will be some > unallocated space between LVs. Use the mkfs.xfs -d size= option to > create your filesystems inside of each LV such that the filesystem total > size is evenly divisible by the stripe width. This results in an > additional small amount of unallocated space within, and at the end of, > each LV. Of course, this occurred to me just after sending the message... ;) > It's nice if you can line everything up, but when using RAID6 and one or > two bays for hot spares, one rarely ends up with 8 or 16 data spindles. > > > If not, I'll tweak things to ensure my stripe width is a power of 2. > > That's not possible with 12 data spindles per RAID, not possible with 42 > drives in 3 chassis. Not without a bunch of idle drives. The closest I can come is with 4 RAID 6 arrays of 10 disks each, then striped over: 8 * 128k = 1024k 1024k * 4 = 4096k Which leaves me with 5 disks unused. I might be able to live with that if it makes things work better. Sounds like I won't have to. > I still don't understand why you believe you need LVM in the mix, and > more than one filesystem. > Backup software is unaware of mount points. It uses paths just like > every other program. The number of XFS filesystems is irrelevant to > "minimizing the effects of the archive maintenance jobs". You cannot > bog down XFS. You will bog down the drives no matter how many > filesystems when using RAID60. A limitation of the software in question is that placing multiple archive paths onto a single filesystem is a bit ugly: the software does not let you specifiy a maximum size for the archive paths, and so will think all of them are the size of the filesystem. This isn't an issue in isolation, but we need to make use of a data-balancing feature the software has, which will not work if we place multiple archive paths on a single filesystem. It's a stupid issue to have, but it is what it is. > Here is what you should do: > > Format the RAID60 directly with XFS. Create 3 or 4 directories for > CrashPlan to use as its "store points". If you need to expand in the > future, as I said previously, simply add another 14 drive RAID6 chassis, > format it directly with XFS, mount it at an appropriate place in the > directory tree and give that path to CrashPlan. Does it have a limit on > the number of "store points"? Yes, this is what I *want* to do. There's a limit to the number of store points, but it's large, so this would work fine if not for the multiple-stores-on-one-filesystem issue. Which is frustrating. The *only* reason for LVM in the middle is to allow some flexibility of sizing without dealing with the annoyances of the partition table. I want to intentionally under-provision to start with because we are using a small corner of this storage for a separate purpose but do not know precisely how much yet. LVM lets me leave, say, 10TB empty, until I know exactly how big things are going to be. It's a pile of little annoyances, but so it goes with these kinds of things. It sounds like the little empty spots method will be fine though. Thanks, yet again, for all your help. -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-20 18:31 ` C. Morgan Hamill @ 2014-02-21 3:33 ` Stan Hoeppner 2014-02-21 8:57 ` Emmanuel Florac 2014-02-21 19:17 ` C. Morgan Hamill 0 siblings, 2 replies; 27+ messages in thread From: Stan Hoeppner @ 2014-02-21 3:33 UTC (permalink / raw) To: C. Morgan Hamill, xfs On 2/20/2014 12:31 PM, C. Morgan Hamill wrote: > Quoting Stan Hoeppner (2014-02-18 18:07:24) >> Create each LV starting on a stripe boundary. There will be some >> unallocated space between LVs. Use the mkfs.xfs -d size= option to >> create your filesystems inside of each LV such that the filesystem total >> size is evenly divisible by the stripe width. This results in an >> additional small amount of unallocated space within, and at the end of, >> each LV. > > Of course, this occurred to me just after sending the message... ;) That's the right way to do that, but you really don't want to do this with LVM. It's just a mess. You can easily do this with a single XFS filesystem and a concatenation, with none of these alignment and sizing headaches. Read on. ... > 8 * 128k = 1024k > 1024k * 4 = 4096k > > Which leaves me with 5 disks unused. I might be able to live with that > if it makes things work better. Sounds like I won't have to. Forget all of this. Forget RAID60. I think you'd be best served by a concatenation. You have a RAID chassis with 15 drives and two 15 drive JBODs daisy chained to it, all 4TB drives, correct? Your original setup was 1 spare and one 14 drive RAID6 array per chassis, 12 data spindles. Correct? Stick with that. Export each RAID6 as a distinct LUN to the host. Make an mdadm --linear array of the 3 RAID6 LUNs, devices. Then format the md linear device, e.g. /dev/md0 using the geometry of a single RAID6 array. We want to make sure each allocation group is wholly contained within a RAID6 array. You have 48TB per array and 3 arrays, 144TB total. 1TB=1000^4 and XFS deals with TebiBytes, or 1024^4. Max agsize is 1TiB. So to get exactly 48 AGs per array, 144 total AGs, we'd format with # mkfs.xfs -d su=128k,sw=12,agcount=144 The --linear array, or generically concatenation, stitches the RAID6 arrays together end-to-end. Here the filesystem starts at LBA0 on the first array and ends on the last LBA of the 3rd array, hence "linear". XFS performs all operations at the AG level. Since each AG sits atop only one RAID6, the filesystem alignment geometry is that of a single RAID6. Any individual write will peak at ~1.2GB/s. Since you're limited by the network to 100MB/s throughput this shouldn't be an issue. Using an md linear array you can easily expand in the future without all the LVM headaches, by simply adding another identical RAID6 array to the linear array (see mdadm grow) and then growing the filesystem with xfs_growfs. In doing so, you will want to add the new chassis before the filesystem reaches ~70% capacity. If you let it grow past that point, most of your new writes may go to only the new RAID6 where the bulk of your large free space extents now exist. This will create an IO hotspot on the new chassis, while the original 3 will see fewer writes. Also, don't forget to mount with the "inode64" option in fstab. ... > A limitation of the software in question is that placing multiple > archive paths onto a single filesystem is a bit ugly: the software does > not let you specifiy a maximum size for the archive paths, and so will > think all of them are the size of the filesystem. This isn't an issue > in isolation, but we need to make use of a data-balancing feature the > software has, which will not work if we place multiple archive paths on > a single filesystem. It's a stupid issue to have, but it is what it is. So the problem is capacity reported to the backup application. Easy to address, see below. ... > Yes, this is what I *want* to do. There's a limit to the number of > store points, but it's large, so this would work fine if not for the > multiple-stores-on-one-filesystem issue. Which is frustrating. ... > The *only* reason for LVM in the middle is to allow some flexibility of > sizing without dealing with the annoyances of the partition table. > I want to intentionally under-provision to start with because we are > using a small corner of this storage for a separate purpose but do not > know precisely how much yet. LVM lets me leave, say, 10TB empty, until > I know exactly how big things are going to be. XFS has had filesystem quotas for exactly this purpose, for almost as long as it has existed, well over 15 years. There are 3 types of quotas: user, group, and project. You must enable quotas with a mount option. You manipulate quotas with the xfs_quota command. See man xfs_quota man mount Project quotas are set on a directory tree level. Set a soft and hard project quota on a directory and the available space reported to any process writing into it or its subdirectories is that of the project quota, not the actual filesystem free space. The quota can be increased or decreased at will using xfs_quota. That solves your "sizing" problem rather elegantly. Now, when using a concatenation, md linear array, to reap the rewards of parallelism the requirement is that the application creates lots of directories with a fairly even spread of file IO. In this case, to get all 3 RAID6 arrays into play, that requires creation and use of at minimum 97 directories. Most backup applications make tons of directories so you should be golden here. > It's a pile of little annoyances, but so it goes with these kinds of things. > > It sounds like the little empty spots method will be fine though. No empty spaces required. No LVM required. XFS atop an md linear array with project quotas should solve all of your problems. > Thanks, yet again, for all your help. You're welcome Morgan. I hope this helps steer you towards what I think is a much better architecture for your needs. Dave and I both initially said RAID60 was an ok way to go, but the more I think this through, considering ease of expansion, using a single filesystem and project quotas, it's hard to beat the concat setup. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-21 3:33 ` Stan Hoeppner @ 2014-02-21 8:57 ` Emmanuel Florac 2014-02-22 2:21 ` Stan Hoeppner 2014-02-21 19:17 ` C. Morgan Hamill 1 sibling, 1 reply; 27+ messages in thread From: Emmanuel Florac @ 2014-02-21 8:57 UTC (permalink / raw) To: stan; +Cc: C. Morgan Hamill, xfs Le Thu, 20 Feb 2014 21:33:31 -0600 vous écriviez: > Forget all of this. Forget RAID60. I think you'd be best served by a > concatenation. I fully agree, though I'd use... LVM to perform the concatenation, much more convenient and easy to use than md IMO. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-21 8:57 ` Emmanuel Florac @ 2014-02-22 2:21 ` Stan Hoeppner 2014-02-25 17:04 ` C. Morgan Hamill 0 siblings, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-02-22 2:21 UTC (permalink / raw) To: Emmanuel Florac; +Cc: C. Morgan Hamill, xfs On 2/21/2014 2:57 AM, Emmanuel Florac wrote: > Le Thu, 20 Feb 2014 21:33:31 -0600 vous écriviez: > >> Forget all of this. Forget RAID60. I think you'd be best served by a >> concatenation. > > I fully agree, though I'd use... LVM to perform the concatenation, > much more convenient and easy to use than md IMO. Using md linear eliminates the LVM physical extent size non power of 2 misalignment issue we discussed at length up thread. Using LVM makes things decidedly more difficult and for zero gain. LVM just isn't appropriate for Morgan's situation. Now, it's possible he could do this entirely in the RAID firmware. However he has not stated which storage product he has, and thus I don't know its capabilities, whether it can create or seamlessly expand a concatenation. Linux md can do all of this very easily and is deployed by many people in this exact scenario. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-22 2:21 ` Stan Hoeppner @ 2014-02-25 17:04 ` C. Morgan Hamill 2014-02-25 17:17 ` Emmanuel Florac 2014-02-25 20:08 ` Stan Hoeppner 0 siblings, 2 replies; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-25 17:04 UTC (permalink / raw) To: stan; +Cc: xfs Excerpts from Stan Hoeppner's message of 2014-02-21 21:21:27 -0500: > Now, it's possible he could do this entirely in the RAID firmware. > However he has not stated which storage product he has, and thus I don't > know its capabilities, whether it can create or seamlessly expand a > concatenation. Linux md can do all of this very easily and is deployed > by many people in this exact scenario. On this note, I'm using an Areca ARC-1882. I've been looking for documentation regarding concatenation with this, and having a bit of trouble. Do you happen to be familiar with the product? -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-25 17:04 ` C. Morgan Hamill @ 2014-02-25 17:17 ` Emmanuel Florac 2014-02-25 20:08 ` Stan Hoeppner 1 sibling, 0 replies; 27+ messages in thread From: Emmanuel Florac @ 2014-02-25 17:17 UTC (permalink / raw) To: xfs Le Tue, 25 Feb 2014 12:04:10 -0500 "C. Morgan Hamill" <chamill@wesleyan.edu> écrivait: > On this note, I'm using an Areca ARC-1882. I've been looking for > documentation regarding concatenation with this, and having a bit of > trouble. > Unless Areca cards changed a lot in capabilities recently, it's not possible at all. You can expand a RAID set but it's generally a bad idea. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-25 17:04 ` C. Morgan Hamill 2014-02-25 17:17 ` Emmanuel Florac @ 2014-02-25 20:08 ` Stan Hoeppner 2014-02-26 14:19 ` C. Morgan Hamill 1 sibling, 1 reply; 27+ messages in thread From: Stan Hoeppner @ 2014-02-25 20:08 UTC (permalink / raw) To: C. Morgan Hamill; +Cc: xfs On 2/25/2014 11:04 AM, C. Morgan Hamill wrote: > Excerpts from Stan Hoeppner's message of 2014-02-21 21:21:27 -0500: >> Now, it's possible he could do this entirely in the RAID firmware. >> However he has not stated which storage product he has, and thus I don't >> know its capabilities, whether it can create or seamlessly expand a >> concatenation. Linux md can do all of this very easily and is deployed >> by many people in this exact scenario. > > On this note, I'm using an Areca ARC-1882. I've been looking for > documentation regarding concatenation with this, and having a bit of > trouble. > > Do you happen to be familiar with the product? Only enough to recommend you to replace it immediately with an LSI or Adaptec. Areca is an absolutely tiny Taiwanese company with inferior product and, from what I gather, horrible support for North American customers, and Linux customers in general. The vast majority of their customers seem to be SOHOs and individuals using the boards in MS Windows servers, with very few running more than a handful of drives, and few running lots of drives doing serious work. If you run into any kind of performance issue with their board, and explain to them your number of drives and arrays, OS platform and workload, they'll be baffled like a 3rd grader and have no idea how to respond. The odd thing is that this isn't reflected in the price of their products, which are not substantially less money than the best of breed LSI boards, which come with LSI's phenomenal support structure. And there are plenty of LSI Linux customers running hundreds of drives with real workloads. Areca has no real presence North America, or any country for that matter. They're headquartered in Taiwan and have a "global office" there. Speaking of their "North American support", their ~1000 ft^2 office is in an industrial park in Walnut Grove, CA, directly across the street from "Steve's Refrigeration Supply". Check out the Google street view for their office address, 150 Commerce Way, Walnut, CA 91789 Now let's have a look at LSI's North American presence. http://www.lsi.com/northamerica/pages/northamerica.aspx#tab/tab-contactus Now lets look at prices for the ARC-1882 and LSI's fastest 8P card. Areca ARC-1882I PCIe 2.0/3.0 x8, 1GB DDR3-1333, 800 MHz Dual Core RAID-on-Chip ASIC, 2x SFF-8088 6G SAS, supports 128 drives http://www.newegg.com/Product/Product.aspx?Item=N82E16816151105 $620 Battery Backed Write Cache module, 72hr max backup time, ARC-6120-T121 $130 Solution cost: $750 LSI 9361-8i PCIe 3.0 x8, 1GB DDR3-1866, 1.2GHz LSISAS3108 dual core RAID-On-Chip ASIC, 2x SFF-8643 12G SAS, supports 128 drives http://www.newegg.com/Product/Product.aspx?Item=N82E16816118230 $570 Flash Backed Write Cache module, LSICVM02, unlimited backup time http://www.newegg.com/Product/Product.aspx?Item=N82E16816118232 $190 Solution cost: $760 The Areca uses inferior older technology, has inferior performance, limited firmware feature set which doesn't support spans (concatenation), near non-existent US support especially for advanced Linux workloads/users, only offers battery cache backup, and is all of ... $10 USD cheaper than the category equivalent yet vastly superior LSI. By some off chance you don't already know, LSI is the industry gold standard RAID HBA. They are the sole RAID HBA OEM board supplier to Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used by many others on their in house designs. LSI's ASICs and firmware have seen more Linux workloads of all shapes and sizes than all other vendors' RAID HBAs combined. Given all of the above, and that there are at least 3 other LSI boards of superior performance, in the same price range for the past year, why did you go with Areca? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-25 20:08 ` Stan Hoeppner @ 2014-02-26 14:19 ` C. Morgan Hamill 2014-02-26 17:49 ` Stan Hoeppner 0 siblings, 1 reply; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-26 14:19 UTC (permalink / raw) To: stan; +Cc: xfs Excerpts from Stan Hoeppner's message of 2014-02-25 15:08:44 -0500: > Only enough to recommend you to replace it immediately with an LSI or > Adaptec. Areca is an absolutely tiny Taiwanese company with inferior > product and, from what I gather, horrible support for North American > customers, and Linux customers in general. The vast majority of their > customers seem to be SOHOs and individuals using the boards in MS > Windows servers, with very few running more than a handful of drives, > and few running lots of drives doing serious work. Noted. > If you run into any kind of performance issue with their board, and > explain to them your number of drives and arrays, OS platform and > workload, they'll be baffled like a 3rd grader and have no idea how to > respond. For better or worse, this will be in line with the "support" I've experienced from the vast majority of vendors I've had to deal with. > The Areca uses inferior older technology, has inferior performance, > limited firmware feature set which doesn't support spans > (concatenation), near non-existent US support especially for advanced > Linux workloads/users, only offers battery cache backup, and is all of ... > > $10 USD cheaper than the category equivalent yet vastly superior LSI. Does seem to be the case. > By some off chance you don't already know, LSI is the industry gold > standard RAID HBA. They are the sole RAID HBA OEM board supplier to > Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used > by many others on their in house designs. LSI's ASICs and firmware have > seen more Linux workloads of all shapes and sizes than all other > vendors' RAID HBAs combined. I am aware; all our servers have LSI in them for boot arrays and whatnot. > Given all of the above, and that there are at least 3 other LSI boards > of superior performance, in the same price range for the past year, why > did you go with Areca? For better or worse, they're what we were able to get from our white box vendor. It will, unfortunately, have to do for now. I'll be sure to make a note for future expansion. Until then, we'll just have to tread carefully. Thanks again for all of your help. -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-26 14:19 ` C. Morgan Hamill @ 2014-02-26 17:49 ` Stan Hoeppner 0 siblings, 0 replies; 27+ messages in thread From: Stan Hoeppner @ 2014-02-26 17:49 UTC (permalink / raw) To: C. Morgan Hamill; +Cc: xfs On 2/26/2014 8:19 AM, C. Morgan Hamill wrote: > Excerpts from Stan Hoeppner's message of 2014-02-25 15:08:44 -0500: >> Only enough to recommend you to replace it immediately with an LSI or >> Adaptec. Areca is an absolutely tiny Taiwanese company with inferior >> product and, from what I gather, horrible support for North American >> customers, and Linux customers in general. The vast majority of their >> customers seem to be SOHOs and individuals using the boards in MS >> Windows servers, with very few running more than a handful of drives, >> and few running lots of drives doing serious work. > > Noted. > >> If you run into any kind of performance issue with their board, and >> explain to them your number of drives and arrays, OS platform and >> workload, they'll be baffled like a 3rd grader and have no idea how to >> respond. > > For better or worse, this will be in line with the "support" I've > experienced from the vast majority of vendors I've had to deal with. Edu's often have tight(er) budgets so this often goes with the territory, unfortunately. On the bright side, one tends to learn quite a bit about the hardware industry, the secret sauce that separates two vendors using the same ASICs, where the value add comes from, etc. This, out of necessity. >> The Areca uses inferior older technology, has inferior performance, >> limited firmware feature set which doesn't support spans >> (concatenation), near non-existent US support especially for advanced >> Linux workloads/users, only offers battery cache backup, and is all of ... >> >> $10 USD cheaper than the category equivalent yet vastly superior LSI. > > Does seem to be the case. > >> By some off chance you don't already know, LSI is the industry gold >> standard RAID HBA. They are the sole RAID HBA OEM board supplier to >> Dell, IBM, Intel, Lenovo, Fujitsu/Siemens, etc, and their ASICs are used >> by many others on their in house designs. LSI's ASICs and firmware have >> seen more Linux workloads of all shapes and sizes than all other >> vendors' RAID HBAs combined. > > I am aware; all our servers have LSI in them for boot arrays and > whatnot. > >> Given all of the above, and that there are at least 3 other LSI boards >> of superior performance, in the same price range for the past year, why >> did you go with Areca? > > For better or worse, they're what we were able to get from our white box > vendor. It will, unfortunately, have to do for now. I'll be sure to > make a note for future expansion. In that case, exercise it mercilessly with your workload to surface any problems the firmware may have with the triple RAID6 setup. Yank a drive from each array while under full IO load, etc. Even if Areca can't provide answers or fixes to problems you uncover, if you can identify problem spots before production, you can document these and take steps to mitigate them. > Until then, we'll just have to tread carefully. >From what I understand their hardware QC is decent so board failure shouldn't be an issue. The issues usually deal with firmware immaturity. They're a tiny company with limited resources thus they simply can't do much workload testing with multiple array configurations. Thus their customers running higher end workloads often end up being guinea pigs and identifying firmware deficiencies for them, and suffering performance chasms in the process. LSI, Adaptec, etc do have firmware issues as well on occasion. But their test lab resources allow them to flesh most of these out before the boards reach customers. > Thanks again for all of your help. You bet. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-21 3:33 ` Stan Hoeppner 2014-02-21 8:57 ` Emmanuel Florac @ 2014-02-21 19:17 ` C. Morgan Hamill 1 sibling, 0 replies; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-21 19:17 UTC (permalink / raw) To: stan; +Cc: xfs On Thu, February 20, 2014 10:33 pm, Stan Hoeppner wrote: > Forget all of this. Forget RAID60. I think you'd be best served by a > concatenation. > > You have a RAID chassis with 15 drives and two 15 drive JBODs daisy > chained to it, all 4TB drives, correct? Your original setup was 1 spare > and one 14 drive RAID6 array per chassis, 12 data spindles. Correct? > Stick with that. It's all in one chassis, but correct. > Export each RAID6 as a distinct LUN to the host. Make an mdadm --linear > array of the 3 RAID6 LUNs, devices. Then format the md linear device, > e.g. /dev/md0 using the geometry of a single RAID6 array. We want to > make sure each allocation group is wholly contained within a RAID6 > array. You have 48TB per array and 3 arrays, 144TB total. 1TB=1000^4 > and XFS deals with TebiBytes, or 1024^4. Max agsize is 1TiB. So to get > exactly 48 AGs per array, 144 total AGs, we'd format with > > # mkfs.xfs -d su=128k,sw=12,agcount=144 I am intrigued... > The --linear array, or generically concatenation, stitches the RAID6 > arrays together end-to-end. Here the filesystem starts at LBA0 on the > first array and ends on the last LBA of the 3rd array, hence "linear". > XFS performs all operations at the AG level. Since each AG sits atop > only one RAID6, the filesystem alignment geometry is that of a single > RAID6. Any individual write will peak at ~1.2GB/s. Since you're > limited by the network to 100MB/s throughput this shouldn't be an issue. > > Using an md linear array you can easily expand in the future without all > the LVM headaches, by simply adding another identical RAID6 array to the > linear array (see mdadm grow) and then growing the filesystem with > xfs_growfs. How does this differ from standard linear LVM? Is it simply that we avoid the extent size issue? > In doing so, you will want to add the new chassis before > the filesystem reaches ~70% capacity. If you let it grow past that > point, most of your new writes may go to only the new RAID6 where the > bulk of your large free space extents now exist. This will create an IO > hotspot on the new chassis, while the original 3 will see fewer writes. Good to know. > XFS has had filesystem quotas for exactly this purpose, for almost as > long as it has existed, well over 15 years. There are 3 types of > quotas: user, group, and project. You must enable quotas with a mount > option. You manipulate quotas with the xfs_quota command. See > > man xfs_quota > man mount > > Project quotas are set on a directory tree level. Set a soft and hard > project quota on a directory and the available space reported to any > process writing into it or its subdirectories is that of the project > quota, not the actual filesystem free space. The quota can be increased > or decreased at will using xfs_quota. That solves your "sizing" problem > rather elegantly. Oh, I was unaware of project quotas. > Now, when using a concatenation, md linear array, to reap the rewards of > parallelism the requirement is that the application creates lots of > directories with a fairly even spread of file IO. In this case, to get > all 3 RAID6 arrays into play, that requires creation and use of at > minimum 97 directories. Most backup applications make tons of > directories so you should be golden here. Yes, quite a few directories are created. > You're welcome Morgan. I hope this helps steer you towards what I think > is a much better architecture for your needs. > > Dave and I both initially said RAID60 was an ok way to go, but the more > I think this through, considering ease of expansion, using a single > filesystem and project quotas, it's hard to beat the concat setup. Seems like this will work quite well. Thanks so much for all your help. -- Morgan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-02-01 21:06 ` Stan Hoeppner 2014-02-02 21:21 ` Dave Chinner @ 2014-02-03 16:07 ` C. Morgan Hamill 1 sibling, 0 replies; 27+ messages in thread From: C. Morgan Hamill @ 2014-02-03 16:07 UTC (permalink / raw) To: stan; +Cc: xfs Excerpts from Stan Hoeppner's message of 2014-02-01 16:06:17 -0500: > Yes, that's one of the beauties of LVM. However, there are other > reasons you may not want to do this. For example, if you have allocated > space from two different JBOD or SAN units to a single LVM volume, and > you lack multipath connections, if you have a cable, switch, HBA, or > other failure disconnecting one LUN that will wreak havoc on your > mounted XFS filesystem. If you have multipath and the storage device > disappears due to some other failure such as backplane, UPS, etc, you > have the same problem. Very true; I gather this would only take out any volumes which at least partially rest on the failed device, however? As in, I don't lose the whole volume group, correct? > This isn't a deal breaker. There are many large XFS filesystems in > production that span multiple storage arrays. You just need to be > mindful of your architecture at all times, and it needs to be > documented. Scenario: XFS unmounts due to an IO error. You're not yet > aware an entire chassis is offline. You can't remount the filesystem so > you start a destructive xfs_repair thinking that will fix the problem. > Doing so will wreck your filesystem and you'll likely lose access to all > the files on the offline chassis, with no ability to get it back short > of some magic and a full restore from tape or D2D backup server. We had > a case similar to this reported a couple of years ago. Oh God, that sounds terrible. My sysadmininess is wondering why the chassis wasn't monitored, but hindsight, etc. etc. ;-) > If the logical sector size reported by your RAID controller is 512 > bytes, then "--dataalignment=9216s" should start your data section on a > RAID60 stripe boundary after the metadata section. I see that 9216s == 2608k/512b, but I'm missing something: is the default metadata size guaranteed to be less than a single stripe, or is there more to it? Oh, wait, I think I just got it: '--dataalignment' will take care to start on some multiple of 9216 sectors, regardless of the size of the metadata section. Doy. > The PhysicalExtentSize should probably also match the 4608KB stripe > width, but this is apparently not possible. PhysicalExtentSize must be > a power of 2 value. I don't know if or how this will affect XFS aligned > write out. You'll need to consult with someone more knowledgeable of LVM. Makes sense. If it would have an impact, then I'd probably just end up going with RAID 0 on top of 2 or 4 RAID 6 groups, which looks like the math would work out there. > You bet. Honestly, this is the most helpful and straightforward I've ever found any project's mailing list, so kudos++. -- Morgan Hamill _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Question regarding XFS on LVM over hardware RAID. 2014-01-29 15:07 ` Eric Sandeen 2014-01-29 19:11 ` C. Morgan Hamill @ 2014-01-29 22:40 ` Stan Hoeppner 1 sibling, 0 replies; 27+ messages in thread From: Stan Hoeppner @ 2014-01-29 22:40 UTC (permalink / raw) To: Eric Sandeen, C. Morgan Hamill, xfs On 1/29/2014 9:07 AM, Eric Sandeen wrote: > On 1/29/14, 8:26 AM, C. Morgan Hamill wrote: >> Howdy folks, >> >> I understand that XFS have stripe unit and width configured according to >> the underlying RAID device when using LVM, but I was wondering if this >> is still the case when a given XFS-formatted logical volume takes up >> only part of the available space on the RAID. In particular, I could >> imagine that stripe width would need to be modified proportionally with >> the decrease in filesystem size. My intuition says that's false, but >> I wanted to check with folks who know for sure. > > The stripe unit and width are units of geometry of the underlying > storage; a filesystem will span some number of stripe units, depending > on its size. > > So no, the filesystem's notion of stripe geometry does not change > with the filesystem size. > > You do want to make sure that stripe geometry is correct and aligned > from top to bottom. This is correct if indeed stripe alignment is beneficial to the workload. But not all workloads benefit from stripe alignment. Some may perform worse when XFS is stripe aligned to the underlying storage. For instance, when a workload performs lots of allocations that are significantly smaller than the RAID stripe width. Here you end up with a small file allocated at the start of each stripe and the rest of the stripe left empty. This can create an IO hot spot on the first one or two drives in the array, and the others may sit idle. This obviously has a negative impact on throughput with such a workload. Thus for a workload that performs lots of predominantly small allocations, it is best to not align during mkfs.xfs with hardware RAID that doesn't provide geometry to Linux. If the underlying storage device does do so, or if it is is a striped md/RAID device, you will want to manually specify 4K alignment, as mkfs.xfs will auto align to md geometry. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2014-02-26 17:49 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-01-29 14:26 Question regarding XFS on LVM over hardware RAID C. Morgan Hamill 2014-01-29 15:07 ` Eric Sandeen 2014-01-29 19:11 ` C. Morgan Hamill 2014-01-29 23:55 ` Stan Hoeppner 2014-01-30 14:28 ` C. Morgan Hamill 2014-01-30 20:28 ` Dave Chinner 2014-01-31 5:58 ` Stan Hoeppner 2014-01-31 21:14 ` C. Morgan Hamill 2014-02-01 21:06 ` Stan Hoeppner 2014-02-02 21:21 ` Dave Chinner 2014-02-03 16:12 ` C. Morgan Hamill 2014-02-03 21:41 ` Dave Chinner 2014-02-04 8:00 ` Stan Hoeppner 2014-02-18 19:44 ` C. Morgan Hamill 2014-02-18 23:07 ` Stan Hoeppner 2014-02-20 18:31 ` C. Morgan Hamill 2014-02-21 3:33 ` Stan Hoeppner 2014-02-21 8:57 ` Emmanuel Florac 2014-02-22 2:21 ` Stan Hoeppner 2014-02-25 17:04 ` C. Morgan Hamill 2014-02-25 17:17 ` Emmanuel Florac 2014-02-25 20:08 ` Stan Hoeppner 2014-02-26 14:19 ` C. Morgan Hamill 2014-02-26 17:49 ` Stan Hoeppner 2014-02-21 19:17 ` C. Morgan Hamill 2014-02-03 16:07 ` C. Morgan Hamill 2014-01-29 22:40 ` Stan Hoeppner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.