All of lore.kernel.org
 help / color / mirror / Atom feed
* inode64 directory placement determinism
@ 2014-08-18  3:29 Stan Hoeppner
  2014-08-18  7:01 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Stan Hoeppner @ 2014-08-18  3:29 UTC (permalink / raw)
  To: xfs

Say I have a single 4TB disk in an md linear device.  The md device has a
filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I have
created 4 directories.  Each should reside in a different AG, the first in
AG0.  Now I expand the linear device with an identical 4TB disk and execute
xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more directories.

Will these 4 new dirs be created sequentially in AGs 4-7, or in the first
4 AGs?  Is this deterministic, or is there any chance involved?  On the
real system these 4TB drives are actually 48TB LUNs.  I'm after
deterministic parallel bandwidth to subsequently added RAIDs after each
grow operation by simply writing to the proper directory.

Currently we have kernel 3.4.26 to work with if that's relevant.  I may be
able to get the kernel team to go for 3.4.103 for the bugfixes, but I don't
know about anything newer.  This is an embedded type development process.

Thanks,
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-18  3:29 inode64 directory placement determinism Stan Hoeppner
@ 2014-08-18  7:01 ` Dave Chinner
  2014-08-18 16:16   ` Stan Hoeppner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-08-18  7:01 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Sun, Aug 17, 2014 at 10:29:21PM -0500, Stan Hoeppner wrote:
> Say I have a single 4TB disk in an md linear device.  The md device has a
> filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I have
> created 4 directories.  Each should reside in a different AG, the first in
> AG0.  Now I expand the linear device with an identical 4TB disk and execute
> xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more directories.
> 
> Will these 4 new dirs be created sequentially in AGs 4-7, or in the first
> 4 AGs?  Is this deterministic, or is there any chance involved?  On the

Deterministic, assuming single threaded *file-system-wide* directory
creation. Completely unpredictable under concurrent directory
creations.  See xfs_ialloc_ag_select/xfs_ialloc_next_ag.

Note that the rotor used to select the next AG is set to
zero at mount.

i.e. single threaded behaviour at agcount = 4:

dir number	rotor value	  destination AG
 1		  0			0
 2		  1			1
 3		  2			2
 4		  3			3
 5		  0			0
 6		  1			1
....

So, if you do what you suggest, and grow *after* the first 4 dirs
are created, the above is what you'll get because the rotor goes
back to zero on the fourth directory create. Now, with changing from
4 to 8 AGs after the first 4:

dir number	rotor value	  new inode location (AG)
 1		  0			0
 2		  1			1
 3		  2			2
 4		  3			3
<grow to 8 AGs>
 5		  0			0
 6		  1			1
 7		  2			2
 8		  3			3
 9		  4			4
 10		  5			5
 11		  6			6
 13		  7			7
 14		  0			0

> real system these 4TB drives are actually 48TB LUNs.  I'm after
> deterministic parallel bandwidth to subsequently added RAIDs after each
> grow operation by simply writing to the proper directory.

Just create new directories and use the inode number to
determine their location. If the directory is not in the correct AG,
remove it and create a new one, until you have directories located
in the AGs you want.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-18  7:01 ` Dave Chinner
@ 2014-08-18 16:16   ` Stan Hoeppner
  2014-08-18 22:48     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Stan Hoeppner @ 2014-08-18 16:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, 18 Aug 2014 17:01:53 +1000, Dave Chinner <david@fromorbit.com>
wrote:
> On Sun, Aug 17, 2014 at 10:29:21PM -0500, Stan Hoeppner wrote:
>> Say I have a single 4TB disk in an md linear device.  The md device has
a
>> filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I have
>> created 4 directories.  Each should reside in a different AG, the first
>> in
>> AG0.  Now I expand the linear device with an identical 4TB disk and
>> execute
>> xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more directories.
>> 
>> Will these 4 new dirs be created sequentially in AGs 4-7, or in the
first
>> 4 AGs?  Is this deterministic, or is there any chance involved?  On the
> 
> Deterministic, assuming single threaded *file-system-wide* directory
> creation. Completely unpredictable under concurrent directory
> creations.  See xfs_ialloc_ag_select/xfs_ialloc_next_ag.
> 
> Note that the rotor used to select the next AG is set to
> zero at mount.
> 
> i.e. single threaded behaviour at agcount = 4:
> 
> dir number	rotor value	  destination AG
>  1		  0			0
>  2		  1			1
>  3		  2			2
>  4		  3			3
>  5		  0			0
>  6		  1			1
> ....
> 
> So, if you do what you suggest, and grow *after* the first 4 dirs
> are created, the above is what you'll get because the rotor goes
> back to zero on the fourth directory create. Now, with changing from
> 4 to 8 AGs after the first 4:
> 
> dir number	rotor value	  new inode location (AG)
>  1		  0			0
>  2		  1			1
>  3		  2			2
>  4		  3			3
> <grow to 8 AGs>
>  5		  0			0
>  6		  1			1
>  7		  2			2
>  8		  3			3
>  9		  4			4
>  10		  5			5
>  11		  6			6
>  13		  7			7
>  14		  0			0
> 
>> real system these 4TB drives are actually 48TB LUNs.  I'm after
>> deterministic parallel bandwidth to subsequently added RAIDs after each
>> grow operation by simply writing to the proper directory.
> 
> Just create new directories and use the inode number to
> determine their location. If the directory is not in the correct AG,
> remove it and create a new one, until you have directories located
> in the AGs you want.
> 
> Cheers,
> 
> Dave.


Thanks for the info Dave.  Was hoping it would be more straightforward. 
Modifying the app for this is out of the question.  They've spent 3+ years
developing with EXT4 and decided to try XFS at the last minute.  Product is
to ship in October, so optimizations I can suggest are limited.


-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-18 16:16   ` Stan Hoeppner
@ 2014-08-18 22:48     ` Dave Chinner
  2014-08-19  0:02       ` Stan Hoeppner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-08-18 22:48 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Mon, Aug 18, 2014 at 11:16:12AM -0500, Stan Hoeppner wrote:
> On Mon, 18 Aug 2014 17:01:53 +1000, Dave Chinner <david@fromorbit.com>
> wrote:
> > On Sun, Aug 17, 2014 at 10:29:21PM -0500, Stan Hoeppner wrote:
> >> Say I have a single 4TB disk in an md linear device.  The md device has
> a
> >> filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I have
> >> created 4 directories.  Each should reside in a different AG, the first
> >> in
> >> AG0.  Now I expand the linear device with an identical 4TB disk and
> >> execute
> >> xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more directories.
> >> 
> >> Will these 4 new dirs be created sequentially in AGs 4-7, or in the
> first
> >> 4 AGs?  Is this deterministic, or is there any chance involved?  On the
> > 
> > Deterministic, assuming single threaded *file-system-wide* directory
> > creation. Completely unpredictable under concurrent directory
> > creations.  See xfs_ialloc_ag_select/xfs_ialloc_next_ag.
> > 
> > Note that the rotor used to select the next AG is set to
> > zero at mount.
> > 
> > i.e. single threaded behaviour at agcount = 4:
> > 
> > dir number	rotor value	  destination AG
> >  1		  0			0
> >  2		  1			1
> >  3		  2			2
> >  4		  3			3
> >  5		  0			0
> >  6		  1			1
> > ....
> > 
> > So, if you do what you suggest, and grow *after* the first 4 dirs
> > are created, the above is what you'll get because the rotor goes
> > back to zero on the fourth directory create. Now, with changing from
> > 4 to 8 AGs after the first 4:
> > 
> > dir number	rotor value	  new inode location (AG)
> >  1		  0			0
> >  2		  1			1
> >  3		  2			2
> >  4		  3			3
> > <grow to 8 AGs>
> >  5		  0			0
> >  6		  1			1
> >  7		  2			2
> >  8		  3			3
> >  9		  4			4
> >  10		  5			5
> >  11		  6			6
> >  13		  7			7
> >  14		  0			0
> > 
> >> real system these 4TB drives are actually 48TB LUNs.  I'm after
> >> deterministic parallel bandwidth to subsequently added RAIDs after each
> >> grow operation by simply writing to the proper directory.
> > 
> > Just create new directories and use the inode number to
> > determine their location. If the directory is not in the correct AG,
> > remove it and create a new one, until you have directories located
> > in the AGs you want.
> > 
> > Cheers,
> > 
> > Dave.
> 
> 
> Thanks for the info Dave.  Was hoping it would be more straightforward. 
> Modifying the app for this is out of the question.  They've spent 3+ years
> developing with EXT4 and decided to try XFS at the last minute.  Product is
> to ship in October, so optimizations I can suggest are limited.

Perhaps you could actually tell us what the requirement for
layout/separation is, and how they are acheiving it with ext4. We
really need a more "directed" allocation ability, but it's not clear
exactly what requirements need to drive that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-18 22:48     ` Dave Chinner
@ 2014-08-19  0:02       ` Stan Hoeppner
  2014-08-24 20:14         ` stan hoeppner
  0 siblings, 1 reply; 8+ messages in thread
From: Stan Hoeppner @ 2014-08-19  0:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 19 Aug 2014 08:48:53 +1000, Dave Chinner <david@fromorbit.com>
wrote:
> On Mon, Aug 18, 2014 at 11:16:12AM -0500, Stan Hoeppner wrote:
>> On Mon, 18 Aug 2014 17:01:53 +1000, Dave Chinner <david@fromorbit.com>
>> wrote:
>> > On Sun, Aug 17, 2014 at 10:29:21PM -0500, Stan Hoeppner wrote:
>> >> Say I have a single 4TB disk in an md linear device.  The md device
>> >> has
>> a
>> >> filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I
have
>> >> created 4 directories.  Each should reside in a different AG, the
>> >> first
>> >> in
>> >> AG0.  Now I expand the linear device with an identical 4TB disk and
>> >> execute
>> >> xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more
directories.
>> >> 
>> >> Will these 4 new dirs be created sequentially in AGs 4-7, or in the
>> first
>> >> 4 AGs?  Is this deterministic, or is there any chance involved?  On
>> >> the
>> > 
>> > Deterministic, assuming single threaded *file-system-wide* directory
>> > creation. Completely unpredictable under concurrent directory
>> > creations.  See xfs_ialloc_ag_select/xfs_ialloc_next_ag.
>> > 
>> > Note that the rotor used to select the next AG is set to
>> > zero at mount.
>> > 
>> > i.e. single threaded behaviour at agcount = 4:
>> > 
>> > dir number	rotor value	  destination AG
>> >  1		  0			0
>> >  2		  1			1
>> >  3		  2			2
>> >  4		  3			3
>> >  5		  0			0
>> >  6		  1			1
>> > ....
>> > 
>> > So, if you do what you suggest, and grow *after* the first 4 dirs
>> > are created, the above is what you'll get because the rotor goes
>> > back to zero on the fourth directory create. Now, with changing from
>> > 4 to 8 AGs after the first 4:
>> > 
>> > dir number	rotor value	  new inode location (AG)
>> >  1		  0			0
>> >  2		  1			1
>> >  3		  2			2
>> >  4		  3			3
>> > <grow to 8 AGs>
>> >  5		  0			0
>> >  6		  1			1
>> >  7		  2			2
>> >  8		  3			3
>> >  9		  4			4
>> >  10		  5			5
>> >  11		  6			6
>> >  13		  7			7
>> >  14		  0			0
>> > 
>> >> real system these 4TB drives are actually 48TB LUNs.  I'm after
>> >> deterministic parallel bandwidth to subsequently added RAIDs after
>> >> each
>> >> grow operation by simply writing to the proper directory.
>> > 
>> > Just create new directories and use the inode number to
>> > determine their location. If the directory is not in the correct AG,
>> > remove it and create a new one, until you have directories located
>> > in the AGs you want.
>> > 
>> > Cheers,
>> > 
>> > Dave.
>> 
>> 
>> Thanks for the info Dave.  Was hoping it would be more straightforward.

>> Modifying the app for this is out of the question.  They've spent 3+
>> years
>> developing with EXT4 and decided to try XFS at the last minute. 
Product
>> is
>> to ship in October, so optimizations I can suggest are limited.
> 
> Perhaps you could actually tell us what the requirement for
> layout/separation is, and how they are acheiving it with ext4. We
> really need a more "directed" allocation ability, but it's not clear
> exactly what requirements need to drive that.
> 
> Cheers,
> 
> Dave.

The test harness app writes to thousands of preallocated files in hundreds
of directories.  The target is ~250MB/s at the application per array, more
if achievable, writing a combination of fast and slow streams from up to
~1000 threads, to different files, circularly.  The mix of stream rates and
the files they write will depend on the end customers' needs.  Currently
they have 1 FS per array with 3 top level dirs each w/3 subdirs, 2 of these
with ~100 subdirs each, and hundreds files in each of those.  Simply doing
a concat, growing and just running with it might work fine.  The concern is
ending up with too many fast stream writers hitting AGs on a single array
which won't be able to keep up.  Currently they simply duplicate the layout
on each new filesystem they mount.  The application duplicates the same
layout on each filesystem and does its own load balancing among the group
of them.

Ideally they'd obviously like to simply add files to existing directories
after growing, but that won't achieve scalable bandwidth.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-19  0:02       ` Stan Hoeppner
@ 2014-08-24 20:14         ` stan hoeppner
  2014-08-25  2:15           ` Stan Hoeppner
  2014-08-25  2:19           ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: stan hoeppner @ 2014-08-24 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


On 08/18/2014 07:02 PM, Stan Hoeppner wrote:
> On Tue, 19 Aug 2014 08:48:53 +1000, Dave Chinner <david@fromorbit.com>
> wrote:
>> On Mon, Aug 18, 2014 at 11:16:12AM -0500, Stan Hoeppner wrote:
>>> On Mon, 18 Aug 2014 17:01:53 +1000, Dave Chinner <david@fromorbit.com>
>>> wrote:
>>>> On Sun, Aug 17, 2014 at 10:29:21PM -0500, Stan Hoeppner wrote:
>>>>> Say I have a single 4TB disk in an md linear device.  The md device
>>>>> has
>>> a
>>>>> filesystem on it formatted with defaults.  It has 4 AGs, 0-3.  I
> have
>>>>> created 4 directories.  Each should reside in a different AG, the
>>>>> first
>>>>> in
>>>>> AG0.  Now I expand the linear device with an identical 4TB disk and
>>>>> execute
>>>>> xfs_growfs.  I now have 4 more AGs, 4-7.  I create 4 more
> directories.
>>>>>
>>>>> Will these 4 new dirs be created sequentially in AGs 4-7, or in the
>>> first
>>>>> 4 AGs?  Is this deterministic, or is there any chance involved?  On
>>>>> the
>>>>
>>>> Deterministic, assuming single threaded *file-system-wide* directory
>>>> creation. Completely unpredictable under concurrent directory
>>>> creations.  See xfs_ialloc_ag_select/xfs_ialloc_next_ag.
>>>>
>>>> Note that the rotor used to select the next AG is set to
>>>> zero at mount.
>>>>
>>>> i.e. single threaded behaviour at agcount = 4:
>>>>
>>>> dir number	rotor value	  destination AG
>>>>   1		  0			0
>>>>   2		  1			1
>>>>   3		  2			2
>>>>   4		  3			3
>>>>   5		  0			0
>>>>   6		  1			1
>>>> ....
>>>>
>>>> So, if you do what you suggest, and grow *after* the first 4 dirs
>>>> are created, the above is what you'll get because the rotor goes
>>>> back to zero on the fourth directory create. Now, with changing from
>>>> 4 to 8 AGs after the first 4:
>>>>
>>>> dir number	rotor value	  new inode location (AG)
>>>>   1		  0			0
>>>>   2		  1			1
>>>>   3		  2			2
>>>>   4		  3			3
>>>> <grow to 8 AGs>
>>>>   5		  0			0
>>>>   6		  1			1
>>>>   7		  2			2
>>>>   8		  3			3
>>>>   9		  4			4
>>>>   10		  5			5
>>>>   11		  6			6
>>>>   13		  7			7
>>>>   14		  0			0
>>>>
>>>>> real system these 4TB drives are actually 48TB LUNs.  I'm after
>>>>> deterministic parallel bandwidth to subsequently added RAIDs after
>>>>> each
>>>>> grow operation by simply writing to the proper directory.
>>>>
>>>> Just create new directories and use the inode number to
>>>> determine their location. If the directory is not in the correct AG,
>>>> remove it and create a new one, until you have directories located
>>>> in the AGs you want.
>>>>
>>>> Cheers,
>>>>
>>>> Dave.
>>>
>>>
>>> Thanks for the info Dave.  Was hoping it would be more straightforward.
>
>>> Modifying the app for this is out of the question.  They've spent 3+
>>> years
>>> developing with EXT4 and decided to try XFS at the last minute.
> Product
>>> is
>>> to ship in October, so optimizations I can suggest are limited.
>>
>> Perhaps you could actually tell us what the requirement for
>> layout/separation is, and how they are acheiving it with ext4. We
>> really need a more "directed" allocation ability, but it's not clear
>> exactly what requirements need to drive that.
>>
>> Cheers,
>>
>> Dave.
>
> The test harness app writes to thousands of preallocated files in hundreds
> of directories.  The target is ~250MB/s at the application per array, more
> if achievable, writing a combination of fast and slow streams from up to
> ~1000 threads, to different files, circularly.  The mix of stream rates and
> the files they write will depend on the end customers' needs.  Currently
> they have 1 FS per array with 3 top level dirs each w/3 subdirs, 2 of these
> with ~100 subdirs each, and hundreds files in each of those.  Simply doing
> a concat, growing and just running with it might work fine.  The concern is
> ending up with too many fast stream writers hitting AGs on a single array
> which won't be able to keep up.  Currently they simply duplicate the layout
> on each new filesystem they mount.  The application duplicates the same
> layout on each filesystem and does its own load balancing among the group
> of them.
>
> Ideally they'd obviously like to simply add files to existing directories
> after growing, but that won't achieve scalable bandwidth.


My apologies Dave.  The above isn't really a description of a 
requirement, but simply how they do things currently.  So let me take 
another stab at this.  I think the generic requirement is best described 
as:

	Create a directory in the first AG in a range of specified
	AGs.  Create all child directories and files in AGs within the
	range of AGs, starting with the first AG.  In other words, we
	take the default behavior of the inode64 allocator and we apply
	it to a subset of AGs within the filesystem.  Something like...

agr = allocation group range

1.  mkdir $directory agr=0,47

2.  create $directory in AG0 and set flag in metadata to have inode64
     allocator rotor new child directories of this parent across only
     the AGs in the range specified

3.  file allocation policy need not be altered, files go in parent
     directory, parent AG.  If we spill due to AG free space do what
     we already do and allow writing outside of the AGs in agr


So when we expand the concat and grow XFS we simply do

~$ mkdir $directory agr=48,95

All child directories and files created in $directory will be allocated 
in AGs 48-95, only on the new LUN.  Rinse and repeat.

Such a feature would provide everything needed I think for this 
particular workload.  I can imagine there are similar workloads out 
there that would benefit from something like this given the prevalence 
of large concatenated RAID6s today.  Another scenario that might benefit 
from something like this is short stroking of mechanical storage, but 
controlling it at the filesystem level instead of the block or 
controller layer.

Setting AGR with an mkdir switch might not fly due to it being a generic 
command for all filesystems.  But it would sure be the most 
straightforward approach and easiest to use.

Due to the timetable and other restrictions I wouldn't be able to use 
patches that might come from fleshing out our ideas here, but I think it 
would be very useful functionality for others.

Cheers,

Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-24 20:14         ` stan hoeppner
@ 2014-08-25  2:15           ` Stan Hoeppner
  2014-08-25  2:19           ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Stan Hoeppner @ 2014-08-25  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 8/24/2014 3:14 PM, stan hoeppner wrote:

> Due to the timetable and other restrictions I wouldn't be able to use
> patches that might come from fleshing out our ideas here, but I think it
> would be very useful functionality for others.

Let me restate the above as I don't "think" we'd be able to use patches
in the short term for version 1 of the product.  That may change if said
hypothetical patches might become available within the next 3 weeks,
which is probably highly unlikely.  They brought me in very late in the
game, unfortunately, so I'm racing against the clock.  And of course I
wasn't able to assist in architectural planning, and make such a feature
request here long ago, allowing for sufficient lead time.


Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: inode64 directory placement determinism
  2014-08-24 20:14         ` stan hoeppner
  2014-08-25  2:15           ` Stan Hoeppner
@ 2014-08-25  2:19           ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2014-08-25  2:19 UTC (permalink / raw)
  To: stan hoeppner; +Cc: xfs

On Sun, Aug 24, 2014 at 03:14:44PM -0500, stan hoeppner wrote:
> >The test harness app writes to thousands of preallocated files in hundreds
> >of directories.  The target is ~250MB/s at the application per array, more
> >if achievable, writing a combination of fast and slow streams from up to
> >~1000 threads, to different files, circularly.  The mix of stream rates and
> >the files they write will depend on the end customers' needs.  Currently
> >they have 1 FS per array with 3 top level dirs each w/3 subdirs, 2 of these
> >with ~100 subdirs each, and hundreds files in each of those.  Simply doing
> >a concat, growing and just running with it might work fine.  The concern is
> >ending up with too many fast stream writers hitting AGs on a single array
> >which won't be able to keep up.  Currently they simply duplicate the layout
> >on each new filesystem they mount.  The application duplicates the same
> >layout on each filesystem and does its own load balancing among the group
> >of them.
> >
> >Ideally they'd obviously like to simply add files to existing directories
> >after growing, but that won't achieve scalable bandwidth.
> 
> 
> My apologies Dave.  The above isn't really a description of a
> requirement, but simply how they do things currently.  So let me
> take another stab at this.  I think the generic requirement is best
> described as:
> 
> 	Create a directory in the first AG in a range of specified
> 	AGs.  Create all child directories and files in AGs within the
> 	range of AGs, starting with the first AG.  In other words, we
> 	take the default behavior of the inode64 allocator and we apply
> 	it to a subset of AGs within the filesystem.  Something like...
> 
> agr = allocation group range
> 
> 1.  mkdir $directory agr=0,47
> 
> 2.  create $directory in AG0 and set flag in metadata to have inode64
>     allocator rotor new child directories of this parent across only
>     the AGs in the range specified
> 
> 3.  file allocation policy need not be altered, files go in parent
>     directory, parent AG.  If we spill due to AG free space do what
>     we already do and allow writing outside of the AGs in agr
> 
> 
> So when we expand the concat and grow XFS we simply do
> 
> ~$ mkdir $directory agr=48,95
> 
> All child directories and files created in $directory will be
> allocated in AGs 48-95, only on the new LUN.  Rinse and repeat.

So you want a persistent, configurable AG rotor for a specific
directory and all it's children?  That's not all that simple to do,
because there's no direct connection between the top level directory
and indirect children.

What you are really asking for is a specific instance of the more
generic concept of specifying per-file allocation policy. That's
been on the radar for a long time, but it's not as simple as it
first sounds.  This is something i started prototyping years ago
when I was back at SGI:

http://oss.sgi.com/archives/xfs/2009-02/msg00250.html

but that patch series is *extremely* experimental. There are parts
we should pull from it to start putting generic allocation policy
frameworks in place, but the really difficult part of per-file
allocation policy is the bit that I never got to:

	1. persistence and what to do with kernels that don't
	understand specific policies
	2. how to do the policies generically so that we don't make
	a huge mess of the code.
	3. user interface for managing policies is has not been
	really thought through.

SO, if someone wants a project that will keep them busy for many,
many months...

> Due to the timetable and other restrictions I wouldn't be able to
> use patches that might come from fleshing out our ideas here, but I
> think it would be very useful functionality for others.

Yes, such things have long been considered useful. The problem is
finding enough people to implement all the stuff we consider
useful...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-08-25  2:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-18  3:29 inode64 directory placement determinism Stan Hoeppner
2014-08-18  7:01 ` Dave Chinner
2014-08-18 16:16   ` Stan Hoeppner
2014-08-18 22:48     ` Dave Chinner
2014-08-19  0:02       ` Stan Hoeppner
2014-08-24 20:14         ` stan hoeppner
2014-08-25  2:15           ` Stan Hoeppner
2014-08-25  2:19           ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.