All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] Wide striping
@ 2011-10-03 20:15 Nathan Rutman
  2011-10-04  0:17 ` David Dillow
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Nathan Rutman @ 2011-10-03 20:15 UTC (permalink / raw)
  To: lustre-devel

Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST stripe counts via increased EXT4 EA sizes.
Some problems with this are:
1) increased MDT storage and network loading for transmitting the object list 
2) relative low new limit (1350 up from 160)


We have been thinking about a different wide-striping method that doesn't have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs.
 <https://lh4.googleusercontent.com/mxm5R4Yd000I_v5qNcpYH6ZzHBvryGEE6pjxOBWz6ysHUNK0Yjh1J81kmP-5zVaoCiOU8RJv04WMhNoe1JqipOOmtRd7otrZ0saWKUnNyNVvaWvLRD8> 

Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern.  (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes.

Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). Start index will still be used, but stripe order will be strictly round-robin (we will wrap around).  In other words, the stripe sequence will always be in linear OST order, starting from start_index, maybe skipping some holes, wrapping around to start_index-1.

Widestripe objects do not need a special sequence number (fid_seq); the MDT knows the file was created as widestriped and marks it as such (LOV_PATTERN_BITMAP).  There are two options for OST object identification: common object ID and FID-on-OST.
Common Object ID

The MDT tracks a special range of OST object ID?s (?wide stripe objectid? = WSO) that are used on all OSTs.  The MDT assigns the next available WSO to the file, and this objectid is used on all the OSTs.  The OSTs must never use these objects for regular striped files.  A special precreation group for these objects is probably necessary, as well as orphan cleanup (the MDT should purge "hole" objects that aren?t allocated from a particular OST). The MDT should track the last assigned WSO; this will be the starting point for new wide striped files after recovery.  Objects cannot be migrated from one OST to another, since this would result in out-of-order access. Similarly, stripes can never be added to holes.
FID-on-OST

Use a mapping of the MDT FID to uniquely determine an OST object.  The clients and MDT add in the OST number to the MDT FID (probably just reserve one sequence per OST).  (This allows the objects to potentially migrate to different OSTs).  The OSTs then internally must map the FID to a local object id.  Note this allows OST-local precreation pools, getting the MDT out of the precreate/orphan cleanup business and potentially improving create speeds, and also facilitates "create on write" semantics.  The FID can be assigned during the first access to OST object.
The big problem here is that FID>OBJID ( or better FID->inode id ) translation is absent from the OSTs today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf <http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf>  (what is the current state of this?)  There is also some work in this direction in the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4). 


There's a few questions here, probably the first of which is "is it worthwhile to spend effort on this, or is BZ4424 good enough?" Then there is the question of object identification, where FID-on-OST is more flexible, but also significantly more work (and risk). Also, I thought I understood from the EOFS Summit that WC also has a separate FID-on-OST project (separate from Orion that is) -- can someone tell me the state of that?









______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111003/02427d93/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-03 20:15 [Lustre-devel] Wide striping Nathan Rutman
@ 2011-10-04  0:17 ` David Dillow
  2011-10-04 17:44   ` Nathan Rutman
  2011-10-05  0:25 ` wangdi
  2011-10-20 16:24 ` Alex Kulyavtsev
  2 siblings, 1 reply; 22+ messages in thread
From: David Dillow @ 2011-10-04  0:17 UTC (permalink / raw)
  To: lustre-devel

On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote:

> Some OST?s may be down at file creation time, or new OSTs added later;
> hence there will likely be holes in the bitmap (but relatively few).
> Start index will still be used, but stripe order will be strictly
> round-robin (we will wrap around).  In other words, the stripe
> sequence will always be in linear OST order, starting from
> start_index, maybe skipping some holes, wrapping around to
> start_index-1.

It didn't occur to me when spoke at EOFS, but you'd need to store the
number of OSTs in the system when the mapping was created if you allow
it to wrap around -- otherwise, adding OSTs later would cause existing
files to loose track of the objects after the wrap point.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-04  0:17 ` David Dillow
@ 2011-10-04 17:44   ` Nathan Rutman
  2011-10-04 21:16     ` David Dillow
  0 siblings, 1 reply; 22+ messages in thread
From: Nathan Rutman @ 2011-10-04 17:44 UTC (permalink / raw)
  To: lustre-devel


On Oct 3, 2011, at 5:17 PM, David Dillow wrote:

> On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote:
> 
>> Some OST?s may be down at file creation time, or new OSTs added later;
>> hence there will likely be holes in the bitmap (but relatively few).
>> Start index will still be used, but stripe order will be strictly
>> round-robin (we will wrap around).  In other words, the stripe
>> sequence will always be in linear OST order, starting from
>> start_index, maybe skipping some holes, wrapping around to
>> start_index-1.
> 
> It didn't occur to me when spoke at EOFS, but you'd need to store the
> number of OSTs in the system when the mapping was created if you allow
> it to wrap around -- otherwise, adding OSTs later would cause existing
> files to loose track of the objects after the wrap point.

That's done inherently in the bitmap, where everything beyond the current number of OSTs is marked as a hole.
(So actually, there will typically be one giant hole at the end of every bitmap, and then maybe some singeltons for deactivated
OSTs.)

> 
> -- 
> Dave Dillow
> National Center for Computational Science
> Oak Ridge National Laboratory
> (865) 241-6602 office
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-04 17:44   ` Nathan Rutman
@ 2011-10-04 21:16     ` David Dillow
  2011-10-05 15:06       ` Nathan Rutman
  0 siblings, 1 reply; 22+ messages in thread
From: David Dillow @ 2011-10-04 21:16 UTC (permalink / raw)
  To: lustre-devel

On Tue, 2011-10-04 at 10:44 -0700, Nathan Rutman wrote:
> On Oct 3, 2011, at 5:17 PM, David Dillow wrote:
> 
> > On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote:
> > 
> >> Some OST?s may be down at file creation time, or new OSTs added later;
> >> hence there will likely be holes in the bitmap (but relatively few).
> >> Start index will still be used, but stripe order will be strictly
> >> round-robin (we will wrap around).  In other words, the stripe
> >> sequence will always be in linear OST order, starting from
> >> start_index, maybe skipping some holes, wrapping around to
> >> start_index-1.
> > 
> > It didn't occur to me when spoke at EOFS, but you'd need to store the
> > number of OSTs in the system when the mapping was created if you allow
> > it to wrap around -- otherwise, adding OSTs later would cause existing
> > files to loose track of the objects after the wrap point.
> 
> That's done inherently in the bitmap, where everything beyond the
> current number of OSTs is marked as a hole. (So actually, there will
> typically be one giant hole at the end of every bitmap, and then maybe
> some singeltons for deactivated OSTs.)

Perhaps I'm misunderstanding something, then.

I understood you to say that we would have a linear OST order that
starts from the start_index. So bitmap position 0 would be start_index,
position 1 would be start_index + 1, and so on. If those bits are on,
then there is a object for this file on those OSTs.

Am I on the same page so far?

Now, above you mention wrapping around to start_index - 1; I take this
to mean that at some point, we'd say bitmap position N is no longer OST
start_index + N, but would be OST 0. Bitmap position N + 1 would be OST
1, etc. This scheme may allow for a more compact bitmap when our file
consists of OSTs at the extreme ends of the ones available, but you have
to store the maximum OST number when creating the file to avoid having
the bitmap wrap point shift when you add new OSTs.

Or perhaps I just misunderstood what you meant by wrapping? Did you mean
bitmap position 0 is always OST 0, and the OST indicated by start_index
will hold the first object, and each set bit in turn indicates the next
OST/object, and if we run out of bits in the bitmap before we hit
stripe_count, we'll start checking again at bitmap position/OST 0?
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-03 20:15 [Lustre-devel] Wide striping Nathan Rutman
  2011-10-04  0:17 ` David Dillow
@ 2011-10-05  0:25 ` wangdi
  2011-10-05  9:28   ` Alexey Lyashkov
  2011-10-05 16:06   ` Nathan Rutman
  2011-10-20 16:24 ` Alex Kulyavtsev
  2 siblings, 2 replies; 22+ messages in thread
From: wangdi @ 2011-10-05  0:25 UTC (permalink / raw)
  To: lustre-devel

Hello, Nathan

On 10/03/2011 01:15 PM, Nathan Rutman wrote:
> Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST 
> stripe counts via increased EXT4 EA sizes.
> Some problems with this are:
> 1) increased MDT storage and network loading for transmitting the 
> object list
> 2) relative low new limit (1350 up from 160)
>
> We have been thinking about a different wide-striping method that 
> doesn't have these problems. The basic idea is to create a new stripe 
> type that encodes the list of OSTs compactly, and then using the same 
> (or a calculable) object identifier (or FID) on all these OSTs.
>
>
> Our version of widestriping does not involve increasing the EA size at 
> all, but instead utilizes a new stripe pattern.  (This will not be 
> understandable by older Lustre versions, which will generate an error 
> locally, or potentially we can convert into the BZ-4424 form if the 
> layout fits in that format). A bitmap will identify which OSTs hold a 
> stripe of this file. The bitmap should probably fit into current ext4 
> EA size limit, giving us ~32k stripes.
>
> Some OST?s may be down at file creation time, or new OSTs added later; 
> hence there will likely be holes in the bitmap (but relatively few). 
> Start index will still be used, but stripe order will be strictly 
> round-robin (we will wrap around). In other words, the stripe sequence 
> will always be in linear OST order, starting from start_index, maybe 
> skipping some holes, wrapping around to start_index-1.
>
> Widestripe objects do not need a special sequence number (fid_seq); 
> the MDT knows the file was created as widestriped and marks it as such 
> (LOV_PATTERN_BITMAP).  There are two options for OST object 
> identification: common object ID and FID-on-OST.

Actually, we also discussed to use real object (IAM or other index 
format) to store the stripe pattern, instead of using EA. Of course it 
would use more space, but it would give us the potential to explore the 
stripe pattern.

> Common Object ID
> The MDT tracks a special range of OST object ID?s (?wide stripe 
> objectid? = WSO) that are used on all OSTs.  The MDT assigns the next 
> available WSO to the file, and this objectid is used on all the OSTs. 
>  The OSTs must never use these objects for regular striped files.  A 
> special precreation group for these objects is probably necessary, as 
> well as orphan cleanup (the MDT should purge "hole" objects that 
> aren?t allocated from a particular OST). The MDT should track the last 
> assigned WSO; this will be the starting point for new wide striped 
> files after recovery. Objects cannot be migrated from one OST to 
> another, since this would result in out-of-order access. Similarly, 
> stripes can never be added to holes.
> FID-on-OST
> Use a mapping of the MDT FID to uniquely determine an OST object.  The 
> clients and MDT add in the OST number to the MDT FID (probably just 
> reserve one sequence per OST).  (This allows the objects to 
> potentially migrate to different OSTs).  The OSTs then internally must 
> map the FID to a local object id.  Note this allows OST-local 
> precreation pools, getting the MDT out of the precreate/orphan cleanup 
> business and potentially improving create speeds, and also facilitates 
> "create on write" semantics.  The FID can be assigned during the first 
> access to OST object.

I am not sure I follow your idea here. You mean the OST needs internally 
map MDT FID(added in OST number) to object id (or inode ino) ? So there 
are no real OST FID? But you also said "The FID can be assigned during 
the first access to OST object.", Could you please explain more here?


> The big problem here is that FID>OBJID ( or better FID->inode id ) 
> translation is absent from the OSTs today. See 
> http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf(what is the 
> current state of this?)  There is also some work in this direction in 
> the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled 
> for Lustre 2.4).
>
> There's a few questions here, probably the first of which is "is it 
> worthwhile to spend effort on this, or is BZ4424 good enough?" Then 
> there is the question of object identification, where FID-on-OST is 
> more flexible, but also significantly more work (and risk). Also, I 
> thought I understood from the EOFS Summit that WC also has a separate 
> FID-on-OST project (separate from Orion that is) -- can someone tell 
> me the state of that?

FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It 
basically follows current fid client server infrastructure.

1. MDT is the fid client, which requests fid from the OST and allocates 
fids for the object during pre-creation.
2. OST is the fid server, which will allocate the FIDs to MDTs and 
requests super fid sequence from fid control server (root MDT).
3. Similar as MDT FID, there will be OI to map FID to object inside OST.

The code will be release with DNE sometime next year.

Thanks
WangDi




>
>
>
>
>
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>
> Xyratex Technology Limited (03134912), Registered in England&  Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
>
>
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111004/c5a68cdf/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05  0:25 ` wangdi
@ 2011-10-05  9:28   ` Alexey Lyashkov
  2011-10-05 18:02     ` Eric Barton
  2011-10-05 18:18     ` Nathan Rutman
  2011-10-05 16:06   ` Nathan Rutman
  1 sibling, 2 replies; 22+ messages in thread
From: Alexey Lyashkov @ 2011-10-05  9:28 UTC (permalink / raw)
  To: lustre-devel

Hi All,
> 
> FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid client server infrastructure.
> 
> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 
> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT).
> 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
> 
> The code will be release with DNE sometime next year.
> 
I think we not need a special FID's for OST object, except we want to migrate one object via different data containers over cluster.
I think it's not a priority for now.
So we can simplify a FID management for OST now.
Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}.
In that case OST not need allocate any FID's, and MDT can reuse current reallocation scheme.
in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object.
in that case OST can preallocate some pool and report that size to MDT,
MDT know it's uses some objects from that pool, but not know which object id assigned to file. 
to avoid OST confusion client send a MDT FID to OST when need access to OST object.
OST look to OI database and check - is that FID assigned to something or not.
if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID.
that's all.

orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT.
open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index.



--------------------------------------------
Alexey Lyashkov
alexey_lyashkov at xyratex.com




______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-04 21:16     ` David Dillow
@ 2011-10-05 15:06       ` Nathan Rutman
  2011-10-05 15:33         ` David Dillow
  0 siblings, 1 reply; 22+ messages in thread
From: Nathan Rutman @ 2011-10-05 15:06 UTC (permalink / raw)
  To: lustre-devel


On Oct 4, 2011, at 2:16 PM, David Dillow wrote:

> On Tue, 2011-10-04 at 10:44 -0700, Nathan Rutman wrote:
>> On Oct 3, 2011, at 5:17 PM, David Dillow wrote:
>> 
>>> On Mon, 2011-10-03 at 13:15 -0700, Nathan Rutman wrote:
>>> 
>>>> Some OST?s may be down at file creation time, or new OSTs added later;
>>>> hence there will likely be holes in the bitmap (but relatively few).
>>>> Start index will still be used, but stripe order will be strictly
>>>> round-robin (we will wrap around).  In other words, the stripe
>>>> sequence will always be in linear OST order, starting from
>>>> start_index, maybe skipping some holes, wrapping around to
>>>> start_index-1.
>>> 
>>> It didn't occur to me when spoke at EOFS, but you'd need to store the
>>> number of OSTs in the system when the mapping was created if you allow
>>> it to wrap around -- otherwise, adding OSTs later would cause existing
>>> files to loose track of the objects after the wrap point.
>> 
>> That's done inherently in the bitmap, where everything beyond the
>> current number of OSTs is marked as a hole. (So actually, there will
>> typically be one giant hole at the end of every bitmap, and then maybe
>> some singeltons for deactivated OSTs.)
> 
> Perhaps I'm misunderstanding something, then.
> 
> I understood you to say that we would have a linear OST order that
> starts from the start_index. So bitmap position 0 would be start_index,
> position 1 would be start_index + 1, and so on. If those bits are on,
> then there is a object for this file on those OSTs.

Sorry if I'm being unclear.

start_index is just an offset into the bitmap.  That's the OST where the first 
stripe will be.  Next stripe will be on the next OST index (unless a hole). 
When we get to the big hole at the end of the used OSTs, these OST index 
locations are all skipped (since they are holes), and the next stripe will 
be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).



> 
> Am I on the same page so far?
> 
> Now, above you mention wrapping around to start_index - 1; I take this
> to mean that at some point, we'd say bitmap position N is no longer OST
> start_index + N, but would be OST 0. Bitmap position N + 1 would be OST
> 1, etc. This scheme may allow for a more compact bitmap when our file
> consists of OSTs at the extreme ends of the ones available, but you have
> to store the maximum OST number when creating the file to avoid having
> the bitmap wrap point shift when you add new OSTs.
> 
> Or perhaps I just misunderstood what you meant by wrapping? Did you mean
> bitmap position 0 is always OST 0, and the OST indicated by start_index
> will hold the first object, and each set bit in turn indicates the next
> OST/object, and if we run out of bits in the bitmap before we hit
> stripe_count, we'll start checking again at bitmap position/OST 0?
> -- 
> Dave Dillow
> National Center for Computational Science
> Oak Ridge National Laboratory
> (865) 241-6602 office
> 
> 
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 15:06       ` Nathan Rutman
@ 2011-10-05 15:33         ` David Dillow
  2011-10-06  1:51           ` Andreas Dilger
  0 siblings, 1 reply; 22+ messages in thread
From: David Dillow @ 2011-10-05 15:33 UTC (permalink / raw)
  To: lustre-devel

On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote:
> Sorry if I'm being unclear.
> 
> start_index is just an offset into the bitmap.  That's the OST where the first 
> stripe will be.  Next stripe will be on the next OST index (unless a hole). 
> When we get to the big hole at the end of the used OSTs, these OST index 
> locations are all skipped (since they are holes), and the next stripe will 
> be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).

Ok, so bitmap position 0 is always OST 0; thanks for clearing up my
misunderstanding.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05  0:25 ` wangdi
  2011-10-05  9:28   ` Alexey Lyashkov
@ 2011-10-05 16:06   ` Nathan Rutman
  2011-10-05 19:44     ` wangdi
  1 sibling, 1 reply; 22+ messages in thread
From: Nathan Rutman @ 2011-10-05 16:06 UTC (permalink / raw)
  To: lustre-devel


On Oct 4, 2011, at 5:25 PM, wangdi wrote:


	Hello, Nathan
	
	On 10/03/2011 01:15 PM, Nathan Rutman wrote: 

		Oracle BZ-4424 (continued in WC LU-80) adds support for larger OST stripe counts via increased EXT4 EA sizes.
		Some problems with this are:
		1) increased MDT storage and network loading for transmitting the object list 
		2) relative low new limit (1350 up from 160)
		
		
		We have been thinking about a different wide-striping method that doesn't have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs.
		 <https://lh4.googleusercontent.com/mxm5R4Yd000I_v5qNcpYH6ZzHBvryGEE6pjxOBWz6ysHUNK0Yjh1J81kmP-5zVaoCiOU8RJv04WMhNoe1JqipOOmtRd7otrZ0saWKUnNyNVvaWvLRD8> 
		
		Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern.  (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes.
		
		Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). Start index will still be used, but stripe order will be strictly round-robin (we will wrap around).  In other words, the stripe sequence will always be in linear OST order, starting from start_index, maybe skipping some holes, wrapping around to start_index-1.
		
		Widestripe objects do not need a special sequence number (fid_seq); the MDT knows the file was created as widestriped and marks it as such (LOV_PATTERN_BITMAP).  There are two options for OST object identification: common object ID and FID-on-OST.
		


	Actually, we also discussed to use real object (IAM or other index format) to store the stripe pattern, instead of using EA. Of course it would use more space, but it would give us the potential to explore the stripe pattern.
	


One of the main (the only?) benefits of our design (over current BZ4424 widestriping) is that it does not need any more space than the old MDT stripe pattern.  No additional storage, no additional network traffic to transmit pattern.




		Common Object ID
		
		The MDT tracks a special range of OST object ID?s (?wide stripe objectid? = WSO) that are used on all OSTs.  The MDT assigns the next available WSO to the file, and this objectid is used on all the OSTs.  The OSTs must never use these objects for regular striped files.  A special precreation group for these objects is probably necessary, as well as orphan cleanup (the MDT should purge "hole" objects that aren?t allocated from a particular OST). The MDT should track the last assigned WSO; this will be the starting point for new wide striped files after recovery.  Objects cannot be migrated from one OST to another, since this would result in out-of-order access. Similarly, stripes can never be added to holes.
		FID-on-OST
		
		Use a mapping of the MDT FID to uniquely determine an OST object.  The clients and MDT add in the OST number to the MDT FID (probably just reserve one sequence per OST).  (This allows the objects to potentially migrate to different OSTs).  The OSTs then internally must map the FID to a local object id.  Note this allows OST-local precreation pools, getting the MDT out of the precreate/orphan cleanup business and potentially improving create speeds, and also facilitates "create on write" semantics.  The FID can be assigned during the first access to OST object.


	I am not sure I follow your idea here. You mean the OST needs internally map MDT FID(added in OST number) to object id (or inode ino) ?

yes.


	So there are no real OST FID?

I suppose -- this is just a mapping of the MDT fid to the local OST object id, via a local lookup on the OST.  There would be something like the OI to do this mapping.


	But you also said "The FID can be assigned during the first access to OST object.", Could you please explain more here? 
	


Since the FID -> Objid mapping is performed locally, it doesn't need to be assigned until the first write.  This is not integral to the design, just a side effect.




		The big problem here is that FID>OBJID ( or better FID->inode id ) translation is absent from the OSTs today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf <http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf>  (what is the current state of this?)  There is also some work in this direction in the OST restructuring work (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4). 
		
		
		There's a few questions here, probably the first of which is "is it worthwhile to spend effort on this, or is BZ4424 good enough?" Then there is the question of object identification, where FID-on-OST is more flexible, but also significantly more work (and risk). Also, I thought I understood from the EOFS Summit that WC also has a separate FID-on-OST project (separate from Orion that is) -- can someone tell me the state of that?


	FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid client server infrastructure.
	
	1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 
	2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT).
	3. Similar as MDT FID, there will be OI to map FID to object inside OST.
	


To integrate with this, we would need to have a reserved sequence on each OST that the MDT could assign FIDs from --
the MDT would need to use the same Object ID on all OSTs.  For DNE, there would need to be a reserved sequence per OST per MDT.




	The code will be release with DNE sometime next year.
	
	Thanks
	WangDi
	
	
	   


______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111005/d8a50271/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05  9:28   ` Alexey Lyashkov
@ 2011-10-05 18:02     ` Eric Barton
  2011-10-05 18:44       ` Nathan Rutman
  2011-10-05 18:18     ` Nathan Rutman
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Barton @ 2011-10-05 18:02 UTC (permalink / raw)
  To: lustre-devel

Shadow,

Your comment describes create-on-write (CROW), which is vulnerable
to orphan creation by clients which have been evicted from the MDS
but are not actually dead, unless further safeguards are implemented
such as capabilities or server-cluster-wide client eviction.

I also think that the decision to use FIDs in the way you suggest
has architectural implications which would benefit from further
discussion.  The original idea was that a FID would be all you need
to identify any object (including its target) and that using them
uniformly in this way could help simplify the code and enable further
development - e.g. to allow unified targets which mix namespace and
data objects to better support small/sparse files.

Making the FID just a unique identifier which requires a target index
to specify a specific object doesn't have to be inconsistent with
uniform usage for data and metadata, but it has further knock-on
implications which must be acknowledged and debated explicitly
before we go further.  We really must be confident we've thought
through all the consequences of our architectural decisions before
we invest development effort in them.  It's just too expensive to
reverse a bad decision otherwise.

          Cheers,
                   Eric

> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
> Of Alexey Lyashkov
> Sent: 05 October 2011 10:29 AM
> To: wangdi
> Cc: Alexander Boyko; Lustre Development Mailing List; Artem Blagodarenko; Nathan Rutman
> Subject: Re: [Lustre-devel] Wide striping
> 
> Hi All,
> >
> > FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid
> client server infrastructure.
> >
> > 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during
> pre-creation.
> > 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from
> fid control server (root MDT).
> > 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
> >
> > The code will be release with DNE sometime next year.
> >
> I think we not need a special FID's for OST object, except we want to migrate one object via different
> data containers over cluster.
> I think it's not a priority for now.
> So we can simplify a FID management for OST now.
> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}.
> In that case OST not need allocate any FID's, and MDT can reuse current reallocation scheme.
> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we
> need a guaranteed free OST object exist when client tried to make access to that object.
> in that case OST can preallocate some pool and report that size to MDT,
> MDT know it's uses some objects from that pool, but not know which object id assigned to file.
> to avoid OST confusion client send a MDT FID to OST when need access to OST object.
> OST look to OI database and check - is that FID assigned to something or not.
> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and
> assign to that FID.
> that's all.
> 
> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will
> kill a unallocated objects and return last index to the MDT.
> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle
> FID as object index.
> 
> 
> 
> --------------------------------------------
> Alexey Lyashkov
> alexey_lyashkov at xyratex.com
> 
> 
> 
> 
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the
> purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such
> information. If you are not the intended recipient of this message, please notify the sender by return
> and delete it. You may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which
> Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this
> email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses
> in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone
> Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex
> International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia,
> Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan
> Limited registered in Japan.
> ______________________________________________________________________
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05  9:28   ` Alexey Lyashkov
  2011-10-05 18:02     ` Eric Barton
@ 2011-10-05 18:18     ` Nathan Rutman
  2011-10-05 18:23       ` Nathan Rutman
  1 sibling, 1 reply; 22+ messages in thread
From: Nathan Rutman @ 2011-10-05 18:18 UTC (permalink / raw)
  To: lustre-devel


On Oct 5, 2011, at 2:28 AM, Alexey Lyashkov wrote:

> Hi All,
>> 
>> FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid client server infrastructure.
>> 
>> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 
>> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT).
>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
>> 
>> The code will be release with DNE sometime next year.
>> 
> I think we not need a special FID's for OST object, except we want to migrate one object via different data containers over cluster.
> I think it's not a priority for now.
> So we can simplify a FID management for OST now.
> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}.
> In that case OST not need allocate any FID's, and MDT can reuse current reallocation scheme.
> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object.
> in that case OST can preallocate some pool and report that size to MDT,
> MDT know it's uses some objects from that pool, but not know which object id assigned to file. 
> to avoid OST confusion client send a MDT FID to OST when need access to OST object.
> OST look to OI database and check - is that FID assigned to something or not.
> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID.
> that's all.
> 
> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT.
> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index.
> 

What Shadow is saying here (correct me if I'm wrong) is that full-blown FIDs on OSTs are really needed; just a way to map the MDT fid to to the local object id.
(The other general class of solution being to reserve a specific range of common ost object id's, and do no mapping.)  Both of these are significantly less
complicated than the DNE FID-on-OST description.

As I was hinting at before, perhaps there's not a very strong case to be made for doing anything other than using the "just make it bigger" solution of BZ4424.
I was trying to gauge the interest of the community in an intermediate solution.=
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 18:18     ` Nathan Rutman
@ 2011-10-05 18:23       ` Nathan Rutman
  0 siblings, 0 replies; 22+ messages in thread
From: Nathan Rutman @ 2011-10-05 18:23 UTC (permalink / raw)
  To: lustre-devel


On Oct 5, 2011, at 11:18 AM, Nathan Rutman wrote:

> 
> On Oct 5, 2011, at 2:28 AM, Alexey Lyashkov wrote:
> 
>> Hi All,
>>> 
>>> FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid client server infrastructure.
>>> 
>>> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during pre-creation. 
>>> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from fid control server (root MDT).
>>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
>>> 
>>> The code will be release with DNE sometime next year.
>>> 
>> I think we not need a special FID's for OST object, except we want to migrate one object via different data containers over cluster.
>> I think it's not a priority for now.
>> So we can simplify a FID management for OST now.
>> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}.
>> In that case OST not need allocate any FID's, and MDT can reuse current reallocation scheme.
>> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we need a guaranteed free OST object exist when client tried to make access to that object.
>> in that case OST can preallocate some pool and report that size to MDT,
>> MDT know it's uses some objects from that pool, but not know which object id assigned to file. 
>> to avoid OST confusion client send a MDT FID to OST when need access to OST object.
>> OST look to OI database and check - is that FID assigned to something or not.
>> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and assign to that FID.
>> that's all.
>> 
>> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will kill a unallocated objects and return last index to the MDT.
>> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle FID as object index.
>> 
> 
> What Shadow is saying here (correct me if I'm wrong) is that full-blown FIDs on OSTs are really needed;
s/are/aren't/   :(

> just a way to map the MDT fid to to the local object id.
> (The other general class of solution being to reserve a specific range of common ost object id's, and do no mapping.)  Both of these are significantly less
> complicated than the DNE FID-on-OST description.
> 
> As I was hinting at before, perhaps there's not a very strong case to be made for doing anything other than using the "just make it bigger" solution of BZ4424.
> I was trying to gauge the interest of the community in an intermediate solution.=
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 18:02     ` Eric Barton
@ 2011-10-05 18:44       ` Nathan Rutman
  0 siblings, 0 replies; 22+ messages in thread
From: Nathan Rutman @ 2011-10-05 18:44 UTC (permalink / raw)
  To: lustre-devel


On Oct 5, 2011, at 11:02 AM, Eric Barton wrote:

> Shadow,
> 
> Your comment describes create-on-write (CROW), which is vulnerable
> to orphan creation by clients which have been evicted from the MDS
> but are not actually dead, unless further safeguards are implemented
> such as capabilities or server-cluster-wide client eviction.
create-on-write isn't really an integral part of this design, just a side thought.  Let's leave it out of the 
discussion for now.
> 
> I also think that the decision to use FIDs in the way you suggest
> has architectural implications which would benefit from further
> discussion.  The original idea was that a FID would be all you need
> to identify any object (including its target) and that using them
> uniformly in this way could help simplify the code and enable further
> development - e.g. to allow unified targets which mix namespace and
> data objects to better support small/sparse files.
> 
> Making the FID just a unique identifier which requires a target index
> to specify a specific object doesn't have to be inconsistent with
> uniform usage for data and metadata, but it has further knock-on
> implications which must be acknowledged and debated explicitly
> before we go further.  We really must be confident we've thought
> through all the consequences of our architectural decisions before
> we invest development effort in them.  It's just too expensive to
> reverse a bad decision otherwise.
That's what we're trying to do now :)

The issue as I see it is that we're thinking about a feature that could be useful 
today, and is implementable today, except for the fact that there are some 
longer term plans that might conflict.  Our wide-striping could be implemented
on top of WC's future FID-on-OST plans -- but would require that code to exist.
So then we have to decide if waiting is the best option, or whether a more 
minimal change (probably the "common object ID" from my original arch email)
could land first, and then DNE FID-on-OST could change it later. 


> 
>          Cheers,
>                   Eric
> 
>> -----Original Message-----
>> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
>> Of Alexey Lyashkov
>> Sent: 05 October 2011 10:29 AM
>> To: wangdi
>> Cc: Alexander Boyko; Lustre Development Mailing List; Artem Blagodarenko; Nathan Rutman
>> Subject: Re: [Lustre-devel] Wide striping
>> 
>> Hi All,
>>> 
>>> FID-on-OST is actually part of DNE(dirtribute name space) phase I.  It basically follows current fid
>> client server infrastructure.
>>> 
>>> 1. MDT is the fid client, which requests fid from the OST and allocates fids for the object during
>> pre-creation.
>>> 2. OST is the fid server, which will allocate the FIDs to MDTs and requests super fid sequence from
>> fid control server (root MDT).
>>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
>>> 
>>> The code will be release with DNE sometime next year.
>>> 
>> I think we not need a special FID's for OST object, except we want to migrate one object via different
>> data containers over cluster.
>> I think it's not a priority for now.
>> So we can simplify a FID management for OST now.
>> Each data object may identified via pair {OST_INDEX / OST_UUID, MDT_FID}.
>> In that case OST not need allocate any FID's, and MDT can reuse current reallocation scheme.
>> in fact we not need a assign a FID for OST object in file creation time (aka creating LSM), but we
>> need a guaranteed free OST object exist when client tried to make access to that object.
>> in that case OST can preallocate some pool and report that size to MDT,
>> MDT know it's uses some objects from that pool, but not know which object id assigned to file.
>> to avoid OST confusion client send a MDT FID to OST when need access to OST object.
>> OST look to OI database and check - is that FID assigned to something or not.
>> if assigned - IO will return a inode, otherwise OST need to grab any free object from a pool and
>> assign to that FID.
>> that's all.
>> 
>> orphan cleanup not need to be changed in that case - MDT send a last allocated objid, and OST will
>> kill a unallocated objects and return last index to the MDT.
>> open-unlink case need to be changed to put a fid in LLOG record and OST need to be changed to handle
>> FID as object index.
>> 
>> 
>> 
>> --------------------------------------------
>> Alexey Lyashkov
>> alexey_lyashkov at xyratex.com
>> 
>> 
>> 
>> 
>> ______________________________________________________________________
>> This email may contain privileged or confidential information, which should only be used for the
>> purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such
>> information. If you are not the intended recipient of this message, please notify the sender by return
>> and delete it. You may not use, copy, disclose or rely on the information contained in it.
>> 
>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which
>> Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this
>> email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses
>> in this email, nor for any losses caused as a result of viruses.
>> 
>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone
>> Road, Havant, Hampshire, PO9 1SA.
>> 
>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex
>> International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia,
>> Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan
>> Limited registered in Japan.
>> ______________________________________________________________________
>> 
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 16:06   ` Nathan Rutman
@ 2011-10-05 19:44     ` wangdi
  2011-10-05 23:31       ` Oleg Drokin
  0 siblings, 1 reply; 22+ messages in thread
From: wangdi @ 2011-10-05 19:44 UTC (permalink / raw)
  To: lustre-devel

On 10/05/2011 09:06 AM, Nathan Rutman wrote:
>
>>
>>> Common
>>> Object ID
>>> The MDT
>>> tracks a special range of OST object ID?s (?wide stripe
>>> objectid? = WSO) that are used on all OSTs.  The MDT
>>> assigns the next available WSO to the file, and this
>>> objectid is used on all the OSTs.  The OSTs must never use
>>> these objects for regular striped files.  A special
>>> precreation group for these objects is probably necessary,
>>> as well as orphan cleanup (the MDT should purge "hole"
>>> objects that aren?t allocated from a particular OST). The
>>> MDT should track the last assigned WSO; this will be the
>>> starting point for new wide striped files after recovery. Objects 
>>> cannot be migrated
>>> from one OST to another, since this would result in
>>> out-of-order access. Similarly, stripes can
>>> never be added to holes.
>>> FID-on-OST
>>> Use a
>>> mapping of the MDT FID to uniquely determine an OST
>>> object.  The clients and MDT add in the OST number to the
>>> MDT FID (probably just reserve one sequence per OST).
>>>  (This allows the objects to potentially migrate to
>>> different OSTs).  The OSTs then internally must map the
>>> FID to a local object id.  Note this allows OST-local
>>> precreation pools, getting the MDT out of the
>>> precreate/orphan cleanup business and potentially
>>> improving create speeds, and also facilitates "create on
>>> write" semantics.  The FID can be assigned during the
>>> first access to OST object.
>>
>> I am not sure I follow your idea here. You mean the OST needs 
>> internally map MDT FID(added in OST number) to object id (or inode ino) ?
> yes.
>
>> So there are no real OST FID?
> I suppose -- this is just a mapping of the MDT fid to the local OST 
> object id, via a local lookup on the OST.  There would be something 
> like the OI to do this mapping.
>
>> But you also said "The FID can be
>> assigned during the first access to OST object.", Could you please
>> explain more here?
>
> Since the FID -> Objid mapping is performed locally, it doesn't need 
> to be assigned until the first write.  This is not integral to the 
> design, just a side effect.

Ah, you mean the object ID can be assigned during the first access, 
instead of FID? This is indeed an interesting idea, and do not need 
extra space. But this may add some limits of the future. (what if we 
decides to store some small file data in MDT directly?) And also it will 
add another difference between MDT and OST, probably it conflicts with 
the efforts of unifying MDT and OST?  I still prefer to have real OST 
FID, i.e. every object has its own identification in the cluster. Please 
correct me if I miss the point of your suggestion.

Thanks
Wangdi

>
>>
>>
>>> The big
>>> problem here is that FID>OBJID ( or better
>>> FID->inode id ) translation is absent from the OSTs
>>> today. See http://wiki.lustre.org/images/e/e9/SC09-FID-on-OST.pdf
>>> (what is the current state of this?)  There is also some
>>> work in this direction in the OST restructuring work
>>> (?Orion? WC branch, ORI-300(?), scheduled for Lustre 2.4).
>>>
>>>
>>> There's
>>> a few questions here, probably the first of which is "is it 
>>> worthwhile to
>>> spend effort on this, or is BZ4424 good enough?" Then there
>>> is the question of object identification, where FID-on-OST
>>> is more flexible, but also significantly more work (and
>>> risk). Also, I thought I understood from the EOFS Summit
>>> that WC also has a separate FID-on-OST project (separate
>>> from Orion that is) -- can someone tell me the state of
>>> that?
>>
>> FID-on-OST is actually part of DNE(dirtribute name space) phase I.  
>> It basically follows current fid client server infrastructure.
>>
>> 1. MDT is the fid client, which requests fid from the OST and 
>> allocates fids for the object during pre-creation.
>> 2. OST is the fid server, which will allocate the FIDs to MDTs and 
>> requests super fid sequence from fid control server (root MDT).
>> 3. Similar as MDT FID, there will be OI to map FID to object inside OST.
>
> To integrate with this, we would need to have a reserved sequence on 
> each OST that the MDT could assign FIDs from --
> the MDT would need to use the same Object ID on all OSTs.  For DNE, 
> there would need to be a reserved sequence per OST per MDT.
>
>
>>
>> The code will be release with DNE sometime next year.
>>
>> Thanks
>> WangDi
>>
>>
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>
> Xyratex Technology Limited (03134912), Registered in England&  Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111005/837cc869/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 19:44     ` wangdi
@ 2011-10-05 23:31       ` Oleg Drokin
  2011-10-05 23:56         ` David Dillow
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Drokin @ 2011-10-05 23:31 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Oct 5, 2011, at 3:44 PM, wangdi wrote:
>> Since the FID -> Objid mapping is performed locally, it doesn't need to be assigned until the first write.  This is not integral to the design, just a side effect.
> Ah, you mean the object ID can be assigned during the first access, instead of FID? This is indeed an interesting idea, and do not need extra space. But this may add some limits of the future. (what if we decides to store some small file data in MDT directly?) And also it will add another difference between MDT and OST, probably it conflicts with the efforts of unifying MDT and OST?  I still prefer to have real OST FID, i.e. every object has its own identification in the cluster. Please correct me if I miss the point of your suggestion.

Another problems I see here are similar to create on write.
Say if we delete a file, do we purge this mapping table too? and then when stale client comes we recreate an orphan object?
Or we don't purge the table and let it grow indefinitely using more and more space and eventually slowing down lookups?
Or do we purge really old objects from it only, what triggers it, what failure scenarios are there for this process?
How do we recover from disasters that happened to this table?

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 23:31       ` Oleg Drokin
@ 2011-10-05 23:56         ` David Dillow
  2011-10-07 10:29           ` Oleg Drokin
  0 siblings, 1 reply; 22+ messages in thread
From: David Dillow @ 2011-10-05 23:56 UTC (permalink / raw)
  To: lustre-devel

On Wed, 2011-10-05 at 19:31 -0400, Oleg Drokin wrote:
> Hello!
> 
> On Oct 5, 2011, at 3:44 PM, wangdi wrote:
> >> Since the FID -> Objid mapping is performed locally, it doesn't
> need to be assigned until the first write.  This is not integral to
> the design, just a side effect.
> > Ah, you mean the object ID can be assigned during the first access,
> instead of FID? This is indeed an interesting idea, and do not need
> extra space. But this may add some limits of the future. (what if we
> decides to store some small file data in MDT directly?) And also it
> will add another difference between MDT and OST, probably it conflicts
> with the efforts of unifying MDT and OST?  I still prefer to have real
> OST FID, i.e. every object has its own identification in the cluster.
> Please correct me if I miss the point of your suggestion.
> 
> Another problems I see here are similar to create on write.
> Say if we delete a file, do we purge this mapping table too? and then
> when stale client comes we recreate an orphan object?
> Or we don't purge the table and let it grow indefinitely using more
> and more space and eventually slowing down lookups?
> Or do we purge really old objects from it only, what triggers it, what
> failure scenarios are there for this process?
> How do we recover from disasters that happened to this table?

Wouldn't the online lfsck work being done for OpenSFS catch and correct
these types of problems?

It seems like it could provide a base for purging/compacting the table
as well, but that's obviously going to be a complicated endeavor....
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 15:33         ` David Dillow
@ 2011-10-06  1:51           ` Andreas Dilger
  2011-10-06 19:09             ` Nathan Rutman
  0 siblings, 1 reply; 22+ messages in thread
From: Andreas Dilger @ 2011-10-06  1:51 UTC (permalink / raw)
  To: lustre-devel

On 2011-10-05, at 9:33 AM, David Dillow <dillowda@ornl.gov> wrote:
> On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote:
>> Sorry if I'm being unclear.
>> 
>> start_index is just an offset into the bitmap.  That's the OST where the first 
>> stripe will be.  Next stripe will be on the next OST index (unless a hole). 
>> When we get to the big hole at the end of the used OSTs, these OST index 
>> locations are all skipped (since they are holes), and the next stripe will 
>> be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).
> 
> Ok, so bitmap position 0 is always OST 0; thanks for clearing up my
> misunderstanding.

But this means that the table always needs to be as large as the maximum OST number. If the bitmap started at the starting OST index it would only need to be as large as the number of stripes. 

That said, the limitation if not being able to migrate objects with this layout is a major one. The ability to so online object migration is just arriving with the layout lock (from HSM), so I expect this to be useful to many users. 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-06  1:51           ` Andreas Dilger
@ 2011-10-06 19:09             ` Nathan Rutman
  0 siblings, 0 replies; 22+ messages in thread
From: Nathan Rutman @ 2011-10-06 19:09 UTC (permalink / raw)
  To: lustre-devel


On Oct 5, 2011, at 6:51 PM, Andreas Dilger wrote:

> On 2011-10-05, at 9:33 AM, David Dillow <dillowda@ornl.gov> wrote:
>> On Wed, 2011-10-05 at 11:06 -0400, Nathan Rutman wrote:
>>> Sorry if I'm being unclear.
>>> 
>>> start_index is just an offset into the bitmap.  That's the OST where the first 
>>> stripe will be.  Next stripe will be on the next OST index (unless a hole). 
>>> When we get to the big hole at the end of the used OSTs, these OST index 
>>> locations are all skipped (since they are holes), and the next stripe will 
>>> be at OST index 0, then 1, etc, up to start_index-1 (again, unless holes).
>> 
>> Ok, so bitmap position 0 is always OST 0; thanks for clearing up my
>> misunderstanding.
> 
> But this means that the table always needs to be as large as the maximum OST number. If the bitmap started at the starting OST index it would only need to be as large as the number of stripes. 
Yes, the table is as large as the maximum possible OST number.  32,000 stripes fit in a bitmap in the current (non-extended) EA size.  If you started at the starting OST index, you would need to record the last OST number also.  Either way I don't see as a problem.

> 
> That said, the limitation if not being able to migrate objects with this layout is a major one. The ability to so online object migration is just arriving with the layout lock (from HSM), so I expect this to be useful to many users. 
Well, that's why we added the complication of embedding the OST index into the object FIDs that the clients would ask for. Then you could migrate that object to a new OST - but really only for exceptional cases.  General migration for e.g. space rebalancing would result in a bunch of additional overhead to figure out where all the stripes moved to.  So I agree - this is a weakness of the bitmap design, which really implies a fixed ordering. =
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-05 23:56         ` David Dillow
@ 2011-10-07 10:29           ` Oleg Drokin
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Drokin @ 2011-10-07 10:29 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Oct 5, 2011, at 7:56 PM, David Dillow wrote:
>> Another problems I see here are similar to create on write.
>> Say if we delete a file, do we purge this mapping table too? and then
>> when stale client comes we recreate an orphan object?
>> Or we don't purge the table and let it grow indefinitely using more
>> and more space and eventually slowing down lookups?
>> Or do we purge really old objects from it only, what triggers it, what
>> failure scenarios are there for this process?
>> How do we recover from disasters that happened to this table?
> Wouldn't the online lfsck work being done for OpenSFS catch and correct
> these types of problems?

It probably would, once the online lfsck is implemented.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-03 20:15 [Lustre-devel] Wide striping Nathan Rutman
  2011-10-04  0:17 ` David Dillow
  2011-10-05  0:25 ` wangdi
@ 2011-10-20 16:24 ` Alex Kulyavtsev
  2011-10-20 18:45   ` Andreas Dilger
  2 siblings, 1 reply; 22+ messages in thread
From: Alex Kulyavtsev @ 2011-10-20 16:24 UTC (permalink / raw)
  To: lustre-devel


On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote:

> ... snip...
> We have been thinking about a different wide-striping method that  
> doesn't have these problems. The basic idea is to create a new  
> stripe type that encodes the list of OSTs compactly, and then using  
> the same (or a calculable) object identifier (or FID) on all these  
> OSTs.
>
>
> Our version of widestriping does not involve increasing the EA size  
> at all, but instead utilizes a new stripe pattern.  (This will not  
> be understandable by older Lustre versions, which will generate an  
> error locally, or potentially we can convert into the BZ-4424 form  
> if the layout fits in that format). A bitmap will identify which  
> OSTs hold a stripe of this file. The bitmap should probably fit into  
> current ext4 EA size limit, giving us ~32k stripes.
>
> Some OST?s may be down at file creation time, or new OSTs added  
> later; hence there will likely be holes in the bitmap (but  
> relatively few).

1) There will be holes when OST pools used: if the file can be written  
only to the set of OST from specific OST POOL and if by the virtue of  
configuration OSTs in the  pool do not represent continuous set then  
there will be holes in OST bit map even if all OSTs are online.

2) "relatively few holes [in bitmap]" - did you consider compressing  
bitmap? Like BBC or WAH described at en.wikipedia.org/wiki/ 
Bitmap_index#Compression ? Reportedly you can do bitwise operations  
without decompression. This way you can go up in number of stripes  
(well, 32k is big number). But it may help control RPC size - you may  
represent wide striping with few integers effectively representing  
continuous blocks and OST holes, the size of the descriptor is the  
function of # of blocks and holes and to the less extent function of  
number of stripes.
More:
It is possible to have two bitmaps:
0000000111111111000000111111 - one describing general "blocks" of OST  
= ((beg1,end1),(beg2,end2))
0000000000000010100100100000 - other describing "corrections" - drop  
two OST, add two OST ; here 4 bits, compressed to X bytes
0000000111111101100100011111 - OST map, computed on client as bitwise  
XOR to uncompressed maps (1) and (2)
Each of two maps is compressed for transfer, thus shall not take much  
space.

3) If metadata file format going to be changed, is it right time to  
reserve descriptors to have few replicas of the file data?

In such case we need to have number of replicas, and layout descriptor  
for each replica. Each replica may have different number of stripes,  
thus you can have widely striped file replica on SAS disks (or in  
flash) and replicate it to slower disk storage with one or "few"  
stripes for further tape archival.
  I assume after initial writes file has more or less "stable"  
content. Replicas can be on different media type, like flash/ SAS/  
SATA, fast / cheap disks, effectively Hierarchical Storage.
I'm thinking about "lazy" replication as you implemented to replicate  
data to another file system but in this case replication is within the  
same lustre file system. Client became aware of multiple replicas and  
can chose  what file replica to use (e.g when some OSTs down). It  
eliminates OST as single point of failure.

Alex.

> ______________________________________________________________________
> This email may contain privileged or confidential information, which  
> should only be used for the purpose for which it was sent by  
> Xyratex. No further rights or licenses are granted to use such  
> information. If you are not the intended recipient of this message,  
> please notify the sender by return and delete it. You may not use,  
> copy, disclose or rely on the information contained in it.
>
> Internet email is susceptible to data corruption, interception and  
> unauthorised amendment for which Xyratex does not accept liability.  
> While we have taken reasonable precautions to ensure that this email  
> is free of viruses, Xyratex does not accept liability for the  
> presence of any computer viruses in this email, nor for any losses  
> caused as a result of viruses.
>
> Xyratex Technology Limited (03134912), Registered in England &  
> Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>
> The Xyratex group of companies also includes, Xyratex Ltd,  
> registered in Bermuda, Xyratex International Inc, registered in  
> California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia,  
> Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic  
> of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111020/a01ee1d3/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
  2011-10-20 16:24 ` Alex Kulyavtsev
@ 2011-10-20 18:45   ` Andreas Dilger
       [not found]     ` <73AED5C780AE05478241DB067651A921023788CF@XYUS-EX22.xyus.xyratex.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Andreas Dilger @ 2011-10-20 18:45 UTC (permalink / raw)
  To: lustre-devel

On 2011-10-20, at 10:24 AM, Alex Kulyavtsev wrote:
> On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote:
>> We have been thinking about a different wide-striping method that doesn't have these problems. The basic idea is to create a new stripe type that encodes the list of OSTs compactly, and then using the same (or a calculable) object identifier (or FID) on all these OSTs.
>> 
>> Our version of widestriping does not involve increasing the EA size at all, but instead utilizes a new stripe pattern.  (This will not be understandable by older Lustre versions, which will generate an error locally, or potentially we can convert into the BZ-4424 form if the layout fits in that format). A bitmap will identify which OSTs hold a stripe of this file. The bitmap should probably fit into current ext4 EA size limit, giving us ~32k stripes.
>> 
>> Some OST?s may be down at file creation time, or new OSTs added later; hence there will likely be holes in the bitmap (but relatively few). 
> 
> 1) There will be holes when OST pools used: if the file can be written only to the set of OST from specific OST POOL and if by the virtue of configuration OSTs in the  pool do not represent continuous set then there will be holes in OST bit map even if all OSTs are online.

Since the membership in a pool can change after a file is allocated,
there cannot be anything in the layout that depends on the current
membership of the pool.  In this regard, the layout of a file that
is allocated in the pool should be identical to a non-pool file, with
the exception that it saves the pool name in which the file was created.
That allows future operations (migration, replication, etc) to take the
originally requested pool of the user into account.

> 2) "relatively few holes [in bitmap]" - did you consider compressing bitmap? Like BBC or WAH described at en.wikipedia.org/wiki/Bitmap_index#Compression ? Reportedly you can do bitwise operations without decompression. This way you can go up in number of stripes (well, 32k is big number). But it may help control RPC size - you may represent wide striping with few integers effectively representing continuous blocks and OST holes, the size of the descriptor is the function of # of blocks and holes and to the less extent function of number of stripes.

I think that having some kind of bitmap compression seems reasonable,
and extends the number of stripes that can be fit into a single layout
for most cases.  Originally I was thinking that in addition to saving
the starting index of the bitmap, we could also save the index at which
the bitmap wraps back to 0 (i.e. bit N = (start_idx + N) % wrap_idx),
but if there is bitmap compression then the run of zeroes between the
starting index and the (lower) ending index could be stored efficiently
as well.

> More:
> It is possible to have two bitmaps: 
> 0000000111111111000000111111 - one describing general "blocks" of OST = ((beg1,end1),(beg2,end2))
> 0000000000000010100100100000 - other describing "corrections" - drop two OST, add two OST ; here 4 bits, compressed to X bytes 
> 0000000111111101100100011111 - OST map, computed on client as bitwise XOR to uncompressed maps (1) and (2)
> Each of two maps is compressed for transfer, thus shall not take much space.

Originally, I was thinking that we don't need to do boolean operations
on the compressed bitmaps, but then I recall an idea I had many, many
years ago about clients sending the "desired" (AND "available") OSC
bitmap to the MDS.  When the MDS is allocating objects on the OSTs it
can AND the client bitmap with its allocation bitmap ("pool" bitmap AND
"available objects" bitmap) to get the subset of OSTs where objects can
be allocated.

If we can do operations directly on the compressed bitmaps, not only does
it save space, but it also saves cycles doing the operations.

> 3) If metadata file format going to be changed, is it right time to reserve descriptors to have few replicas of the file data?
>  
> In such case we need to have number of replicas, and layout descriptor for each replica. Each replica may have different number of stripes, thus you can have widely striped file replica on SAS disks (or in flash) and replicate it to slower disk storage with one or "few" stripes for further tape archival.

Right.  I've always thought that the different replicas of the file
would have completely independent layouts, to allow what you suggest.
The striping of a file would be completely different for nearline storage
and archival storage (different OST counts at each layer vs. tape drives).

>  I assume after initial writes file has more or less "stable" content. Replicas can be on different media type, like flash/ SAS/ SATA, fast / cheap disks, effectively Hierarchical Storage.
> I'm thinking about "lazy" replication as you implemented to replicate data to another file system but in this case replication is within the same lustre file system. Client became aware of multiple replicas and can chose  what file replica to use (e.g when some OSTs down). It eliminates OST as single point of failure.

Yes, my initial goal is to have background file replication as opposed
to real-time replication.  The main reason is due to the complexity of
the implementation being lower.  In fact, once we have decided on a new
layout format for RAID-1+0 files, background replication and internal
file migration can largely be implemented with the HSM code.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Lustre-devel] Wide striping
       [not found]     ` <73AED5C780AE05478241DB067651A921023788CF@XYUS-EX22.xyus.xyratex.com>
@ 2011-10-20 20:15       ` Alex Kulyavtsev
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Kulyavtsev @ 2011-10-20 20:15 UTC (permalink / raw)
  To: lustre-devel


On Oct 20, 2011, at 2:08 PM, Nathan Rutman wrote:

>
> On Oct 20, 2011, at 11:45 AM, Andreas Dilger wrote:
>
>> On 2011-10-20, at 10:24 AM, Alex Kulyavtsev wrote:
>>> On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote:
>>>> We have been thinking about a different wide-striping method that  
>>>> doesn't have these problems. The basic idea is to create a new  
>>>> stripe type that encodes the list of OSTs compactly, and then  
>>>> using the same (or a calculable) object identifier (or FID) on  
>>>> all these OSTs.
>>>>
>>>
>>> 1) There will be holes when OST pools used: if the file can be  
>>> written only to the set of OST from specific OST POOL and if by  
>>> the virtue of configuration OSTs in the  pool do not represent  
>>> continuous set then there will be holes in OST bit map even if all  
>>> OSTs are online.
>>
>> Since the membership in a pool can change after a file is allocated,
>> there cannot be anything in the layout that depends on the current
>> membership of the pool.  In this regard, the layout of a file that
>> is allocated in the pool should be identical to a non-pool file, with
>> the exception that it saves the pool name in which the file was  
>> created.
>> That allows future operations (migration, replication, etc) to take  
>> the
>> originally requested pool of the user into account.
> Yes, exactly like current striping works -- pool name is recorded,  
> but is only informational: actual striping is explicitly recorded.
Sorry for not being clear, I agree the file is laid out at creation  
time.
I'm just trying to make a point pool configuration is the other source  
of holes in bitmap in addition to OSTs down.

Suppose user purchased eight OST each year for three years, and  
allocated four OSTs to pool1, four to pool2.
OST numbering get mixed and  OSTs are assigned as follows:
1111 0000   1111 0000   1111 0000 - pool1
0000 1111   0000 1111   0000 1111 - pool2

All OSTs are up, and file was striped across all OSTs in pool1. Thus  
the file layout is like
1111 0000   1111 0000  1111 0000
The file has holes in OST layout because of pool configuration.

>
>>
>>> 2) "relatively few holes [in bitmap]" - did you consider  
>>> compressing bitmap? Like BBC or WAH described at en.wikipedia.org/ 
>>> wiki/Bitmap_index#Compression ? Reportedly you can do bitwise  
>>> operations without decompression. This way you can go up in number  
>>> of stripes (well, 32k is big number). But it may help control RPC  
>>> size - you may represent wide striping with few integers  
>>> effectively representing continuous blocks and OST holes, the size  
>>> of the descriptor is the function of # of blocks and holes and to  
>>> the less extent function of number of stripes.
>>
>> I think that having some kind of bitmap compression seems reasonable,
>> and extends the number of stripes that can be fit into a single  
>> layout
>> for most cases.  Originally I was thinking that in addition to saving
>> the starting index of the bitmap, we could also save the index at  
>> which
>> the bitmap wraps back to 0 (i.e. bit N = (start_idx + N) % wrap_idx),
>> but if there is bitmap compression then the run of zeroes between the
>> starting index and the (lower) ending index could be stored  
>> efficiently
>> as well.
>
> I don't think there's any point of compressing this.  32,000 stripes  
> fit in the old EA limit, and there's going to be plenty of other  
> limits hit before
> we start using 32,000 OSTs.  And even then, we can use the larger EA  
> size.   So perhaps we turn the question around and ask, "how many  
> stripes do you want to support"?
Frankly, we do not use wide striping at this point and 32k is a "large  
number."
Having said that, if you have flash OST on each compute node and/or  
have replication and can use local disk on compute node for  
opportunistic storage ("local file replica"), the number of OSTs is  
O(compute nodes) in the cluster and that can be "large number" too.

Best regards, Alex.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111020/eff8fe41/attachment.htm>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-10-20 20:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 20:15 [Lustre-devel] Wide striping Nathan Rutman
2011-10-04  0:17 ` David Dillow
2011-10-04 17:44   ` Nathan Rutman
2011-10-04 21:16     ` David Dillow
2011-10-05 15:06       ` Nathan Rutman
2011-10-05 15:33         ` David Dillow
2011-10-06  1:51           ` Andreas Dilger
2011-10-06 19:09             ` Nathan Rutman
2011-10-05  0:25 ` wangdi
2011-10-05  9:28   ` Alexey Lyashkov
2011-10-05 18:02     ` Eric Barton
2011-10-05 18:44       ` Nathan Rutman
2011-10-05 18:18     ` Nathan Rutman
2011-10-05 18:23       ` Nathan Rutman
2011-10-05 16:06   ` Nathan Rutman
2011-10-05 19:44     ` wangdi
2011-10-05 23:31       ` Oleg Drokin
2011-10-05 23:56         ` David Dillow
2011-10-07 10:29           ` Oleg Drokin
2011-10-20 16:24 ` Alex Kulyavtsev
2011-10-20 18:45   ` Andreas Dilger
     [not found]     ` <73AED5C780AE05478241DB067651A921023788CF@XYUS-EX22.xyus.xyratex.com>
2011-10-20 20:15       ` Alex Kulyavtsev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.