* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
@ 2009-04-05 20:50 Alexander Zarochentsev
0 siblings, 0 replies; 7+ messages in thread
From: Alexander Zarochentsev @ 2009-04-05 20:50 UTC (permalink / raw)
To: lustre-devel
Hello,
There are ideas about WBC client MD stack, WBC protocol and changes
needed at server side. They are Global OSD and another idea (let's name
it CMD3+) explained in the WBC HLD outline draft.
Brief descriptions of the ideas:
GOSD:
a portable component (called MDS in Alex's presentation) transates MD
operations into OSD operations (updates).
MDS may be at client side (WBC-client), proxy server or MD server.
The MDS component is very similar to current MDD (Local MD server) layer
in CMD3 server stack. I.e. it works like a local MD server, but the OSD
layer below is not local, it is GOSD.
It is simple as the local MD server and simplifies MD server stack a
lot. Current MD stack processes MD operations at any level of MDT, CMM
and MDD. First two levels should understand what is CMD and MDD layer
should understand that some MD operations can be partial. It sounds
like a unneeded complication. With GOSD those layers will be replaced
by only one as simple as MDD layer! (however LDLM locking should be
added).
CMD3+:
The component running on WBC client is based on MDT excluding transport
things. Code reuse is possible.
The WBC protocol logically is the current MD protocol with the partial
MD operations (object create w/o name, for example). Partial operations
are already used between MD servers for distributed MD operations. MD
operations will be packed into batches.
Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do
caching & redo-logging of operations.
I think CMD3+ has minimum impact to current Lustre-2.x design. It is
closer to the original goal of just implementation of WBC feature. But
the GOSD is an attractive idea and may be potentially better.
With GOSD I am worrying about making Lustre 2.x unstable for some period
of time. It would be good to think about a plan of incremental
integration of new stack into existing code.
It is a request for comments and new ideas because design mistakes would
be too costly.
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
2009-04-07 4:27 ` Alex Zhuravlev
@ 2009-05-18 21:01 ` Eric Barton
0 siblings, 0 replies; 7+ messages in thread
From: Eric Barton @ 2009-05-18 21:01 UTC (permalink / raw)
To: lustre-devel
Zam,
A couple of things to consider when splitting up operations
into updates....
1. Each update must contain some information about its peer
updates so that in the absence of the client (e.g. on
client eviction) we can check that all the operations's
updates have been applied and apply a correction if not.
I think there is an advantage if every update includes
sufficient information to reconstruct all its peer
updates.
2. The current security design grants capabilities to clients
to perform operations on Lustre objects. If you allow
remote "raw" OSD ops, you're effectively distributing the
Lustre clustered server further - i.e. nodes allowed to
do such operations are being trusted just as much as
servers to keep the filesystem consistent.
Cheers,
Eric
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
2009-04-06 22:02 ` di wang
@ 2009-04-07 4:27 ` Alex Zhuravlev
2009-05-18 21:01 ` Eric Barton
0 siblings, 1 reply; 7+ messages in thread
From: Alex Zhuravlev @ 2009-04-07 4:27 UTC (permalink / raw)
To: lustre-devel
>>>>> di wang (dw) writes:
dw> I am not sure you can( or should) translate all the MD partial
dw> operation into object RPC for these partial
dw> MD operation. For example rename, (a/b ---> c/d, a/b in MDS1, c/d in
dw> MDS2).
dw> RPC goes to MDS1.
dw> 1) delete d (entry and object) from c in MDS2.
dw> 2) create b entry under c in MDS2.
dw> 3) delete b entry under a in MDS1.
dw> So if you do 1) and 2) by object rpc (skip mdd), then you might need
dw> create create all 4 objects
dw> (a and b are local object, c and d are remote object), and permission
dw> check locally (whether
dw> you can delete d under c). Not sure it is a good way.
sorry, I don't quite understand what you mean. for this rename you'd need to:
if you're worried about additional RPCs, then we can (should) optimize this
with intents-like mechanism: mdd_rename() (or caller initialize intent
describing rename in terms of fids and names), then network-aware layer
(something like current MDC accepting OSD API) can put additional osd calls
into enqueue RPC.
for example, when it finds enqueue on c, it can form compound RPC consisting
of: enqueue itself, osd's attr_get(c), osd's lookup(c, d)
also notice, for such rename you need to check that you won't create
disconnected subtree, which is much worse than just few additional RPCs.
dw> And also some quota stuff are handled
dw> in these partial operation in remote MDD, so I am not sure we should
dw> skip mdd totally here.
dw> Am I miss sth?
can you explain why do we need MDD on a server for quota?
given we're about to use OSD for data and metadata I'd think that:
1) for chown/chgrp MDD (wherever it runs) finds LOV EA and uses epoch
(for example) to change uid/gid on MDS and all OSTs in atomical manner
2) quota code isn't part of data or metadata stack, rather it's side
service like ldlm
3) quota code register hooks (or probably adds very small module right
above OSD) to see all quota-related activity: attr_set, write,
index insert, etc
also I think we still don't have proper design for quota, this is yet to be
done to implement quota for DMU.
thanks, Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
2009-04-06 10:26 ` Alex Zhuravlev
@ 2009-04-06 22:02 ` di wang
2009-04-07 4:27 ` Alex Zhuravlev
0 siblings, 1 reply; 7+ messages in thread
From: di wang @ 2009-04-06 22:02 UTC (permalink / raw)
To: lustre-devel
> 2) "create w/o name" (this is what MDT accepts these days) isn't operation,
> it's partial operation. but for partial operations we already have OSD
> - clear, simple and generic. having one more "partial operations" adds
> nothing besides confusion, IMHO
>
I am not sure you can( or should) translate all the MD partial
operation into object RPC for these partial
MD operation. For example rename, (a/b ---> c/d, a/b in MDS1, c/d in
MDS2).
RPC goes to MDS1.
1) delete d (entry and object) from c in MDS2.
2) create b entry under c in MDS2.
3) delete b entry under a in MDS1.
So if you do 1) and 2) by object rpc (skip mdd), then you might need
create create all 4 objects
(a and b are local object, c and d are remote object), and permission
check locally (whether
you can delete d under c). Not sure it is a good way. And also some
quota stuff are handled
in these partial operation in remote MDD, so I am not sure we should
skip mdd totally here.
Am I miss sth?
Thanks
WangDi
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
2009-04-06 10:03 ` Andreas Dilger
@ 2009-04-06 10:26 ` Alex Zhuravlev
2009-04-06 22:02 ` di wang
0 siblings, 1 reply; 7+ messages in thread
From: Alex Zhuravlev @ 2009-04-06 10:26 UTC (permalink / raw)
To: lustre-devel
>>>>> Andreas Dilger (AD) writes:
AD> My internal thoughts (in the absence of ever haven taken a close look
AD> at the HEAD MD stack) have always been that we would essentially be
AD> moving the CMM to the client, and have it always connect to remote
AD> MDTs (i.e. no local MDD) if we want to split "operations" into "updates".
AD> I'd always visualized that the MDT accepts "operations" (as it does
AD> today) and CMM is the component that decides what parts of the operation
AD> are local (passed to MDD) and which are remote (passed to MDC).
few thoughts here:
1) in order to organize local cache with all this you'd need to do translate
once more before md stack (you can't cache create, you can cache directory
entries and objects). at same time you need local cache to access just made
changes. translation is already done by MDD. if you don't run MDD locally
you have to duplicate that code (to some extent) for WBC
2) "create w/o name" (this is what MDT accepts these days) isn't operation,
it's partial operation. but for partial operations we already have OSD
- clear, simple and generic. having one more "partial operations" adds
nothing besides confusion, IMHO
3) local MDD is meaningless with CMD. CMD is distributed thing and I think
any implementation of CMD using "metadata operations" (even partial,
in contrast with updates in terms of OSD API) is a hack. exactly like we
did in CMD1/CMD2 implementing local operations with calls to vfs_create()
and distributed operations with special entries in fsfilt. instead of all
this we should just use OSD always and properly.
4) the only rational reason behind current design in CMD3 was that rollback
reqiured to make remote operations before any local one (to align epoch)
- but it's very likely we don't this any more. thanks god (some ones will
understand what i meant ;)
5) running MDD on MDS for WBC clients also adds nothing in terms of functionality
or clearness, but adds code duplicating OSD
>> are already used between MD servers for distributed MD operations. MD
>> operations will be packed into batches.
>>
>> Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do
>> caching & redo-logging of operations.
>>
>> I think CMD3+ has minimum impact to current Lustre-2.x design. It is
>> closer to the original goal of just implementation of WBC feature. But
>> the GOSD is an attractive idea and may be potentially better.
>>
>> With GOSD I am worrying about making Lustre 2.x unstable for some period
>> of time. It would be good to think about a plan of incremental
>> integration of new stack into existing code.
AD> Wouldn't GOSD just end up being a new ptlrpc interface that exports the
AD> OSD protocol to the network? This would mean that we need to be able
AD> to have multiple services working on the same OSD (both MDD for classic
AD> clients, and GOSD for WBC clients). That isn't a terrible idea, because
AD> we have also discussed having both MDT and OST exports of the same OSD
AD> so that we can efficiently store small files directly on the MDT and/or
AD> scale the number of MDTs == OSTs for massive metadata performance.
yes, with gosd you essentially have your object storage exported in terms
of same API as local storage. you can use that to implement remote services
(proxy, wbc).
AD> I'd like to keep this kind of layering in mind also. Whether it makes
AD> sense to export yet another network protocol to clients, or instead to
AD> add new operations to the existing service handlers so that they can
AD> handle all of the operation types (with efficient passthrough to lower
AD> layers as needed) and be able to multiplex the underlying device
AD> to clients.
I think it's not "another" network protocol. I think it's right low level
protocol. meaning that instead of having very limited set of partial metadata
operations like "create w/o name", "link w/o inode", etc we may have very
simple, generic protocol allowing us to do anything with remote storage.
for example, the core of replication with this protocol could look like
at one node you log osd operations (optional module inbetween regular disk osd
and upper layers like mdd), then you just send those operations to virtially
any node in the cluster and execute them there - you got things replicated.
--
thanks, Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
2009-04-06 9:39 Alexander Zarochentsev
@ 2009-04-06 10:03 ` Andreas Dilger
2009-04-06 10:26 ` Alex Zhuravlev
0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2009-04-06 10:03 UTC (permalink / raw)
To: lustre-devel
On Apr 06, 2009 13:39 +0400, Alexander Zarochentsev wrote:
> There are ideas about WBC client MD stack, WBC protocol and changes
> needed at server side. They are Global OSD and another idea (let's name
> it CMD3+) explained in the WBC HLD outline draft.
>
> Brief descriptions of the ideas:
>
> GOSD:
>
> a portable component (called MDS in Alex's presentation) transates MD
> operations into OSD operations (updates).
>
> MDS may be at client side (WBC-client), proxy server or MD server.
>
> The MDS component is very similar to current MDD (Local MD server) layer
> in CMD3 server stack. I.e. it works like a local MD server, but the OSD
> layer below is not local, it is GOSD.
>
> It is simple as the local MD server and simplifies MD server stack a
> lot. Current MD stack processes MD operations at any level of MDT, CMM
> and MDD. First two levels should understand what is CMD and MDD layer
> should understand that some MD operations can be partial. It sounds
> like a unneeded complication. With GOSD those layers will be replaced
> by only one as simple as MDD layer! (however LDLM locking should be
> added).
My internal thoughts (in the absence of ever haven taken a close look
at the HEAD MD stack) have always been that we would essentially be
moving the CMM to the client, and have it always connect to remote
MDTs (i.e. no local MDD) if we want to split "operations" into "updates".
I'd always visualized that the MDT accepts "operations" (as it does
today) and CMM is the component that decides what parts of the operation
are local (passed to MDD) and which are remote (passed to MDC).
Maybe the MD stack layering isn't quite as clean as this?
> CMD3+:
>
> The component running on WBC client is based on MDT excluding transport
> things. Code reuse is possible.
>
> The WBC protocol logically is the current MD protocol with the partial
> MD operations (object create w/o name, for example). Partial operations
partial operations == updates?
> are already used between MD servers for distributed MD operations. MD
> operations will be packed into batches.
>
> Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do
> caching & redo-logging of operations.
>
> I think CMD3+ has minimum impact to current Lustre-2.x design. It is
> closer to the original goal of just implementation of WBC feature. But
> the GOSD is an attractive idea and may be potentially better.
>
> With GOSD I am worrying about making Lustre 2.x unstable for some period
> of time. It would be good to think about a plan of incremental
> integration of new stack into existing code.
Wouldn't GOSD just end up being a new ptlrpc interface that exports the
OSD protocol to the network? This would mean that we need to be able
to have multiple services working on the same OSD (both MDD for classic
clients, and GOSD for WBC clients). That isn't a terrible idea, because
we have also discussed having both MDT and OST exports of the same OSD
so that we can efficiently store small files directly on the MDT and/or
scale the number of MDTs == OSTs for massive metadata performance.
I'd like to keep this kind of layering in mind also. Whether it makes
sense to export yet another network protocol to clients, or instead to
add new operations to the existing service handlers so that they can
handle all of the operation types (with efficient passthrough to lower
layers as needed) and be able to multiplex the underlying device
to clients.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
@ 2009-04-06 9:39 Alexander Zarochentsev
2009-04-06 10:03 ` Andreas Dilger
0 siblings, 1 reply; 7+ messages in thread
From: Alexander Zarochentsev @ 2009-04-06 9:39 UTC (permalink / raw)
To: lustre-devel
... lustre-devel@ doesn't want to deliver the message, so I am adding CC
list this time.
Hello,
There are ideas about WBC client MD stack, WBC protocol and changes
needed at server side. They are Global OSD and another idea (let's name
it CMD3+) explained in the WBC HLD outline draft.
Brief descriptions of the ideas:
GOSD:
a portable component (called MDS in Alex's presentation) transates MD
operations into OSD operations (updates).
MDS may be at client side (WBC-client), proxy server or MD server.
The MDS component is very similar to current MDD (Local MD server) layer
in CMD3 server stack. I.e. it works like a local MD server, but the OSD
layer below is not local, it is GOSD.
It is simple as the local MD server and simplifies MD server stack a
lot. Current MD stack processes MD operations at any level of MDT, CMM
and MDD. First two levels should understand what is CMD and MDD layer
should understand that some MD operations can be partial. It sounds
like a unneeded complication. With GOSD those layers will be replaced
by only one as simple as MDD layer! (however LDLM locking should be
added).
CMD3+:
The component running on WBC client is based on MDT excluding transport
things. Code reuse is possible.
The WBC protocol logically is the current MD protocol with the partial
MD operations (object create w/o name, for example). Partial operations
are already used between MD servers for distributed MD operations. MD
operations will be packed into batches.
Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do
caching & redo-logging of operations.
I think CMD3+ has minimum impact to current Lustre-2.x design. It is
closer to the original goal of just implementation of WBC feature. But
the GOSD is an attractive idea and may be potentially better.
With GOSD I am worrying about making Lustre 2.x unstable for some period
of time. It would be good to think about a plan of incremental
integration of new stack into existing code.
It is a request for comments and new ideas because design mistakes would
be too costly.
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-05-18 21:01 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-05 20:50 [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache Alexander Zarochentsev
2009-04-06 9:39 Alexander Zarochentsev
2009-04-06 10:03 ` Andreas Dilger
2009-04-06 10:26 ` Alex Zhuravlev
2009-04-06 22:02 ` di wang
2009-04-07 4:27 ` Alex Zhuravlev
2009-05-18 21:01 ` Eric Barton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.