* Re: RBD/OSD questions
@ 2010-05-06 21:02 Martin Fick
2010-05-06 21:22 ` Sage Weil
2010-05-06 21:24 ` Cláudio Martins
0 siblings, 2 replies; 14+ messages in thread
From: Martin Fick @ 2010-05-06 21:02 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
--- On Thu, 5/6/10, Sage Weil <sage@newdream.net> wrote:
> > -Also, can reads be spreadout over replicas?
> >
> > This might be a nice optimization to reduce seek
> > times under certain conditions, when there are no
> > writers or the writer is the only reader (and thus
> > is aware of all the writes even before they complete).
> > Under these conditions it seems like it
> > would be possible to not enforce the "tail reading"
> > order of replicas and thus additionally benefit
> > from "read stripping" across the replicas the way
> > many raid implementations do with RAID1.
> >
> > I thought that this might be particularly useful
> > for RBD when it is used exclusively (say by mounting
> > a local FS) since even with replicas, it seems like
> > it could then relax the replica tail reading
> > constraint.
>
> The idea certainly has it's appeal, and I played with it
> for a while a few years back. At that time I had a
> _really_ hard time trying to manufacture a workload
> scenario where it actually mades things faster
> and not slower. In general, spreading out reads will
> pollute caches (e.g., spreading across two replicas means
> caches are half as effective).
Hmm, I wonder if using a local FS on top of RBD would
be such a different use case from ceph that this may
not be very difficult to produce such a workload with.
With a local FS on RBD I would expect massive local
kernel level caching. With this in mind I wonder how
effective OSD level caching would actually be.
I am particularly thinking of heavy seeky workloads
which perhaps are somewhat already spreadout due to
stripping. In other words RAID1 (mirroring) can
decrease latencies over a non RAID setup locally even
though that is not the objective of RAID1, but does
RAID01 decrease latencies much over RAID0, maybe not?
That might explain the difficulty in creating such
a scenario.
To put this in the perspective of OSD setups, if you
already have stripping, using the replicas also may
not make much of a difference, but I wonder how a two
node OSD setup with double redundancy would fair?
With such a setup there will not really be any
stripping will there? With such a setup (one that I
can easily see being popular for simple/minimal RBD
redundancy setups), perhaps replica "stripping"
would help. A 'smart' RBD could detect non
contiguous reads and spread the reads out in that
case.
All theory I know, but it seems worth investigating
various RBD specific workloads, at least for the
RBD users/developers. :)
Also, with ceph many seeky workloads (small multi
file writes) might additionally already be spread
out (and thus "stripped") due to CRUSH since they
are in different files. But with RBD, it is all
one file so CRUSH will not help as much in this
respect.
> What I tried to do was use fast heartbeats between OSDs to
> shared average request queue lengths, so that the primary
> could 'shed' a read request to a replica if it's queue
> length/request latency was significantly shorter.
> I wasn't really able to make it work.
This sounds more 'intelligent" than what I was
suggesting since it would take the status of the
entire OSD cluster into account, not just the
single RBD reads.
> For cold objects, shedding could help, but only if
> there is a sufficient load disparity between replicas to
> compensate for the overhead of shedding.
I could see how "shedding" as you mean it would
add some overhead, but a simple client based
fanout shouldn't really add much overhead. You
have designed CRUSH to allow fast direct IO with
the OSDs, shedding seems to be a step backwards
performance wise from this design, but client
fanout to replicas directly is really not much
different than stripping using CRUSH, it should
be fast!
If this client fanout does help, one way to make
it smarter, or more cluster responsive would be
to expose some OSD queue/length info via the
client APIs allowing clients themselves to do some
smart load balancing in these situations. This
could be applicable not just for seeky workloads,
but also for unusual workloads which for some
reason might bog down a particular OSD. CRUSH
should normally prevent this from happening in
a well balanced cluster, but if a cluster is not
very heterogenous and has many OSD nodes with
varying latencies and perhaps other external
(non OSD) loads on them, your queue length idea
with smart clients could help balance such a
cluster on the clients themselves.
That's a lot of armchair talking I know,
sorry. ;) Thanks for listening...
-Martin
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:02 RBD/OSD questions Martin Fick
@ 2010-05-06 21:22 ` Sage Weil
2010-05-06 21:24 ` Cláudio Martins
1 sibling, 0 replies; 14+ messages in thread
From: Sage Weil @ 2010-05-06 21:22 UTC (permalink / raw)
To: Martin Fick; +Cc: ceph-devel
On Thu, 6 May 2010, Martin Fick wrote:
> > For cold objects, shedding could help, but only if
> > there is a sufficient load disparity between replicas to
> > compensate for the overhead of shedding.
>
> I could see how "shedding" as you mean it would
> add some overhead, but a simple client based
> fanout shouldn't really add much overhead. You
> have designed CRUSH to allow fast direct IO with
> the OSDs, shedding seems to be a step backwards
> performance wise from this design, but client
> fanout to replicas directly is really not much
> different than stripping using CRUSH, it should
> be fast!
>
> If this client fanout does help, one way to make
> it smarter, or more cluster responsive would be
> to expose some OSD queue/length info via the
> client APIs allowing clients themselves to do some
> smart load balancing in these situations. This
> could be applicable not just for seeky workloads,
> but also for unusual workloads which for some
> reason might bog down a particular OSD. CRUSH
> should normally prevent this from happening in
> a well balanced cluster, but if a cluster is not
> very heterogenous and has many OSD nodes with
> varying latencies and perhaps other external
> (non OSD) loads on them, your queue length idea
> with smart clients could help balance such a
> cluster on the clients themselves.
Yeah, allowing a client to read from other replicas is pretty
straightforward. The normal caps mechanism even tells the client when
this is safe (no racing writes). The hard part is knowing when it is
useful (since, in general, it isn't). The in general the OSDs won't be
conversing with an individual client frequently enough for it to have
accurate load information. I suppose in some circumstances it might be
(small number of clients and osds, heavy load).
One thing I've thought about is having some way to osds piggyback "this
client is super hot!" on replies to clients, and clients to piggyback that
information back to the mds, so that future clients reading that hot file
can direct their reads to replicas on a per-file basis...
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:02 RBD/OSD questions Martin Fick
2010-05-06 21:22 ` Sage Weil
@ 2010-05-06 21:24 ` Cláudio Martins
2010-05-06 21:31 ` Sage Weil
1 sibling, 1 reply; 14+ messages in thread
From: Cláudio Martins @ 2010-05-06 21:24 UTC (permalink / raw)
To: Martin Fick; +Cc: Sage Weil, ceph-devel
On Thu, 6 May 2010 14:02:40 -0700 (PDT) Martin Fick <mogulguy@yahoo.com> wrote:
>
> Hmm, I wonder if using a local FS on top of RBD would
> be such a different use case from ceph that this may
> not be very difficult to produce such a workload with.
> With a local FS on RBD I would expect massive local
> kernel level caching. With this in mind I wonder how
> effective OSD level caching would actually be.
>
> I am particularly thinking of heavy seeky workloads
> which perhaps are somewhat already spreadout due to
> stripping. In other words RAID1 (mirroring) can
> decrease latencies over a non RAID setup locally even
> though that is not the objective of RAID1, but does
> RAID01 decrease latencies much over RAID0, maybe not?
> That might explain the difficulty in creating such
> a scenario.
>
> To put this in the perspective of OSD setups, if you
> already have stripping, using the replicas also may
> not make much of a difference, but I wonder how a two
> node OSD setup with double redundancy would fair?
> With such a setup there will not really be any
> stripping will there? With such a setup (one that I
> can easily see being popular for simple/minimal RBD
> redundancy setups), perhaps replica "stripping"
> would help. A 'smart' RBD could detect non
> contiguous reads and spread the reads out in that
> case.
>
Unless I understood wrongly the Ceph papers, the current situation is
not that bad.
IIRC, a big file will be stripped over many different objects. Each
object ID will map to its own primary replica, which will be vary from
object to object. Thus, given many clients reading different chunks of
that file, even 2 OSDs should see a fairly equal amount of traffic. The
same should be true for small files. Unless you have lots of clients
all reading the same file.
Am I getting it wrong?
Best regards.
Cláudio
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:24 ` Cláudio Martins
@ 2010-05-06 21:31 ` Sage Weil
2010-05-06 21:41 ` Martin Fick
0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2010-05-06 21:31 UTC (permalink / raw)
To: Cláudio Martins; +Cc: Martin Fick, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2155 bytes --]
On Thu, 6 May 2010, Cláudio Martins wrote:
> On Thu, 6 May 2010 14:02:40 -0700 (PDT) Martin Fick <mogulguy@yahoo.com> wrote:
> >
> > Hmm, I wonder if using a local FS on top of RBD would
> > be such a different use case from ceph that this may
> > not be very difficult to produce such a workload with.
> > With a local FS on RBD I would expect massive local
> > kernel level caching. With this in mind I wonder how
> > effective OSD level caching would actually be.
> >
> > I am particularly thinking of heavy seeky workloads
> > which perhaps are somewhat already spreadout due to
> > stripping. In other words RAID1 (mirroring) can
> > decrease latencies over a non RAID setup locally even
> > though that is not the objective of RAID1, but does
> > RAID01 decrease latencies much over RAID0, maybe not?
> > That might explain the difficulty in creating such
> > a scenario.
> >
> > To put this in the perspective of OSD setups, if you
> > already have stripping, using the replicas also may
> > not make much of a difference, but I wonder how a two
> > node OSD setup with double redundancy would fair?
> > With such a setup there will not really be any
> > stripping will there? With such a setup (one that I
> > can easily see being popular for simple/minimal RBD
> > redundancy setups), perhaps replica "stripping"
> > would help. A 'smart' RBD could detect non
> > contiguous reads and spread the reads out in that
> > case.
> >
>
> Unless I understood wrongly the Ceph papers, the current situation is
> not that bad.
>
> IIRC, a big file will be stripped over many different objects. Each
> object ID will map to its own primary replica, which will be vary from
> object to object. Thus, given many clients reading different chunks of
> that file, even 2 OSDs should see a fairly equal amount of traffic. The
> same should be true for small files. Unless you have lots of clients
> all reading the same file.
Yeah, you've got it right. The rbd image is striped over small objects,
which are independently assigned to OSDs. The load should be very well
distributed.
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:31 ` Sage Weil
@ 2010-05-06 21:41 ` Martin Fick
2010-05-06 21:54 ` Gregory Farnum
2010-05-06 22:20 ` Sage Weil
0 siblings, 2 replies; 14+ messages in thread
From: Martin Fick @ 2010-05-06 21:41 UTC (permalink / raw)
To: Cláudio Martins, Sage Weil; +Cc: ceph-devel
--- On Thu, 5/6/10, Sage Weil <sage@newdream.net> wrote:
> On Thu, 6 May 2010, Cláudio Martins
> wrote:
> > On Thu, 6 May 2010 14:02:40 -0700 (PDT) Martin Fick
> <mogulguy@yahoo.com>
> wrote:
> > > To put this in the perspective of OSD setups, if you
> > > already have stripping, using the replicas also may
> > > not make much of a difference, but I wonder how a two
> > > node OSD setup with double redundancy would fair?
> > > With such a setup there will not really be any
> > > stripping will there? With such a setup (one that I
> > > can easily see being popular for simple/minimal RBD
> > > redundancy setups), perhaps replica "stripping"
> > > would help. A 'smart' RBD could detect non
> > > contiguous reads and spread the reads out in that
> > > case.
> >
> > Unless I understood wrongly the Ceph papers, the
> > current situation is not that bad.
> >
> > IIRC, a big file will be stripped over many
> > different objects. Each object ID will map to
> > its own primary replica, which will be vary from
> > object to object. Thus, given many clients reading
> > different chunks of that file, even 2 OSDs should
> > see a fairly equal amount of traffic. The same
> > should be true for small files. Unless you have
> > lots of clients all reading the same file.
>
> Yeah, you've got it right. The rbd image is striped
> over small objects, which are independently assigned
> to OSDs. The load should be very well distributed.
How can that be on a 2 OSD setup with double redundancy?
In this case, if all of a replicas smaller objects are
not on a single node, how will it recover from an OSD
failure?
The only way I see this possible is if file foo is
split into small objects A1 A2 A3 A4 and replicas B1
B2 B3 B4 and you spread those across 2 OSDs like this:
replica 1 (A1 B2 A3 B4)
replica 2 (B1 A2 B3 A4)
but then A1 has to know that it is the same as B1. Is
that the case? If so, cool, that would mean that
redundancy would already be providing some stripping
and thus, it would indeed seem harder to find a case
where more stripping/fanout is needed.
Ciao,
-Martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:41 ` Martin Fick
@ 2010-05-06 21:54 ` Gregory Farnum
2010-05-06 22:20 ` Sage Weil
1 sibling, 0 replies; 14+ messages in thread
From: Gregory Farnum @ 2010-05-06 21:54 UTC (permalink / raw)
To: Martin Fick; +Cc: Cláudio Martins, Sage Weil, ceph-devel
2010/5/6 Martin Fick <mogulguy@yahoo.com>:
>> Yeah, you've got it right. The rbd image is striped
>> over small objects, which are independently assigned
>> to OSDs. The load should be very well distributed.
>
> How can that be on a 2 OSD setup with double redundancy?
> In this case, if all of a replicas smaller objects are
> not on a single node, how will it recover from an OSD
> failure?
>
> The only way I see this possible is if file foo is
> split into small objects A1 A2 A3 A4 and replicas B1
> B2 B3 B4 and you spread those across 2 OSDs like this:
>
> replica 1 (A1 B2 A3 B4)
> replica 2 (B1 A2 B3 A4)
>
> but then A1 has to know that it is the same as B1. Is
> that the case?
The hashing probably isn't quite even enough to alternate the objects,
but yes -- different objects (even those forming a single "file") will
have different primary replicas even in a small system.
Since the default RBD unit is 4MB in size, and the disk is presumably
several to hundreds of gigabytes, you've got a reasonably well-striped
system.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 21:41 ` Martin Fick
2010-05-06 21:54 ` Gregory Farnum
@ 2010-05-06 22:20 ` Sage Weil
2010-05-07 16:38 ` Andreas Grimm
1 sibling, 1 reply; 14+ messages in thread
From: Sage Weil @ 2010-05-06 22:20 UTC (permalink / raw)
To: Martin Fick; +Cc: Cláudio Martins, ceph-devel
> How can that be on a 2 OSD setup with double redundancy?
> In this case, if all of a replicas smaller objects are
> not on a single node, how will it recover from an OSD
> failure?
>
> The only way I see this possible is if file foo is
> split into small objects A1 A2 A3 A4 and replicas B1
> B2 B3 B4 and you spread those across 2 OSDs like this:
>
> replica 1 (A1 B2 A3 B4)
> replica 2 (B1 A2 B3 A4)
The image is striped over objects, _then_ the objects are replicated
across OSDs. Objects themselves aren't striped.
For example, if an image is striped over objects A B C D E F, each 4MB,
you might end up with
osd0: A B' C D E' F'
osd1: A' B C' D' E F
where A is the primary copy, A' is the replica, etc.
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 22:20 ` Sage Weil
@ 2010-05-07 16:38 ` Andreas Grimm
2010-05-07 16:43 ` Sage Weil
0 siblings, 1 reply; 14+ messages in thread
From: Andreas Grimm @ 2010-05-07 16:38 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
That sounds promising. I have another question about the OSDs.
Following scenario:
I have a couple servers with each having six disks. My idea is to
start an osd for every single disk on each server.
Will Ceph place the replicas on different hosts by design, or can i
solve this via a crush map?
Just a short question on terminus: what is a RBD (something about rados?)?
Thanks
Andreas
> The image is striped over objects, _then_ the objects are replicated
> across OSDs. Objects themselves aren't striped.
>
> For example, if an image is striped over objects A B C D E F, each 4MB,
> you might end up with
>
> osd0: A B' C D E' F'
> osd1: A' B C' D' E F
>
> where A is the primary copy, A' is the replica, etc.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-07 16:38 ` Andreas Grimm
@ 2010-05-07 16:43 ` Sage Weil
2010-05-11 8:39 ` Anton
0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2010-05-07 16:43 UTC (permalink / raw)
To: Andreas Grimm; +Cc: ceph-devel
On Fri, 7 May 2010, Andreas Grimm wrote:
> That sounds promising. I have another question about the OSDs.
> Following scenario:
>
> I have a couple servers with each having six disks. My idea is to
> start an osd for every single disk on each server.
> Will Ceph place the replicas on different hosts by design, or can i
> solve this via a crush map?
You need to set up a crush map. See this wiki page:
http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
> Just a short question on terminus: what is a RBD (something about rados?)?
rbd = rados block device
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-07 16:43 ` Sage Weil
@ 2010-05-11 8:39 ` Anton
2010-05-11 16:26 ` Sage Weil
0 siblings, 1 reply; 14+ messages in thread
From: Anton @ 2010-05-11 8:39 UTC (permalink / raw)
To: Sage Weil; +Cc: Andreas Grimm, ceph-devel
Sage, what about rebalancing of already existing replicas,
on crush-map change?
On Friday 07 May 2010, Sage Weil wrote:
> You need to set up a crush map. See this wiki page:
> http://ceph.newdream.net/wiki/Custom_data_placeme
> nt_with_CRUSH
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-11 8:39 ` Anton
@ 2010-05-11 16:26 ` Sage Weil
0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2010-05-11 16:26 UTC (permalink / raw)
To: Anton; +Cc: Andreas Grimm, ceph-devel
On Tue, 11 May 2010, Anton wrote:
> Sage, what about rebalancing of already existing replicas,
> on crush-map change?
The cluster will do this transparently, as soon as the map changes.
sage
>
> On Friday 07 May 2010, Sage Weil wrote:
> > You need to set up a crush map. See this wiki page:
> > http://ceph.newdream.net/wiki/Custom_data_placeme
> > nt_with_CRUSH
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
@ 2010-05-06 22:28 Martin Fick
0 siblings, 0 replies; 14+ messages in thread
From: Martin Fick @ 2010-05-06 22:28 UTC (permalink / raw)
To: Sage Weil; +Cc: Cláudio Martins, ceph-devel
--- On Thu, 5/6/10, Sage Weil <sage@newdream.net> wrote:
> The image is striped over objects, _then_ the objects are
> replicated across OSDs. Objects themselves aren't striped.
>
> For example, if an image is striped over objects A B C D E
> F, each 4MB, you might end up with
>
> osd0: A B' C D E' F'
> osd1: A' B C' D' E F
>
> where A is the primary copy, A' is the replica, etc.
Yes, I see now, much clearer, thanks.
So, as long as each object has one copy on each OSD it
should be safe. And there might be some hash based
extra non-perfect stripping as a benefit. Then, yeah,
it does seem like it would be hard to find a very
unbalanced workload on a heterogeneous OSD (even with
a 2 node cluster). Cool.
Thanks,
-Martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RBD/OSD questions
2010-05-06 16:07 Martin Fick
@ 2010-05-06 17:14 ` Sage Weil
0 siblings, 0 replies; 14+ messages in thread
From: Sage Weil @ 2010-05-06 17:14 UTC (permalink / raw)
To: Martin Fick; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2609 bytes --]
On Thu, 6 May 2010, Martin Fick wrote:
> I have a few more questions.
>
> -Can files stored in the OSD heal "incrementally"?
>
> Suppose there are 3 replicas for a large file and that
> a small byte range change occurs while replica 3 is
> down. Will replica 3 heal efficiently when it
> returns? Will only the small changed byte range
> be transferred?
Currently, no. This is a big item on the TODO list, both for efficiency
here, and also to facilitate better memory and network IO when
objects are large (recovery currently loads, sends, saves objects in their
entirety).
> -Also, can reads be spreadout over replicas?
>
> This might be a nice optimization to reduce seek
> times under certain conditions, when there are no
> writers or the writer is the only reader (and thus
> is aware of all the writes even before they
> complete). Under these conditions it seems like it
> would be possible to not enforce the "tail reading"
> order of replicas and thus additionally benefit
> from "read stripping" across the replicas the way
> many raid implementations do with RAID1.
>
> I thought that this might be particularly useful
> for RBD when it is used exclusively (say by mounting
> a local FS) since even with replicas, it seems like
> it could then relax the replica tail reading
> constraint.
The idea certainly has it's appeal, and I played with it for a while a few
years back. At that time I had a _really_ hard time trying to manufacture
a workload scenario where it actually mades things faster and not slower.
In general, spreading out reads will pollute caches (e.g., spreading
across two replicas means caches are half as effective).
What I tried to do was use fast heartbeats between OSDs to shared average
request queue lengths, so that the primary could 'shed' a read request to
a replica if it's queue length/request latency was significantly shorter.
I wasn't really able to make it work.
In the case of very hot objects, the primary will already have it in
cache, and the fastest thing is to just serve it up immediately. Unless
the network port is fully saturated. For cold objects, shedding could
help, but only if there is a sufficient load disparity between replicas to
compensate for the overhead of shedding. At the time I had trouble
simluating either situation. Also, the client/osd interface has changed
such that only clients initiate connections, so the previous shed
path (client -> osd1 -> osd2 -> client) won't work.
We're certainly open to any ideas in this area...
sage
^ permalink raw reply [flat|nested] 14+ messages in thread
* RBD/OSD questions
@ 2010-05-06 16:07 Martin Fick
2010-05-06 17:14 ` Sage Weil
0 siblings, 1 reply; 14+ messages in thread
From: Martin Fick @ 2010-05-06 16:07 UTC (permalink / raw)
To: ceph-devel
I have a few more questions.
-Can files stored in the OSD heal "incrementally"?
Suppose there are 3 replicas for a large file and that
a small byte range change occurs while replica 3 is
down. Will replica 3 heal efficiently when it
returns? Will only the small changed byte range
be transferred?
-Also, can reads be spreadout over replicas?
This might be a nice optimization to reduce seek
times under certain conditions, when there are no
writers or the writer is the only reader (and thus
is aware of all the writes even before they
complete). Under these conditions it seems like it
would be possible to not enforce the "tail reading"
order of replicas and thus additionally benefit
from "read stripping" across the replicas the way
many raid implementations do with RAID1.
I thought that this might be particularly useful
for RBD when it is used exclusively (say by mounting
a local FS) since even with replicas, it seems like
it could then relax the replica tail reading
constraint.
Any thoughts? Thanks,
-Martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-05-11 16:23 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-06 21:02 RBD/OSD questions Martin Fick
2010-05-06 21:22 ` Sage Weil
2010-05-06 21:24 ` Cláudio Martins
2010-05-06 21:31 ` Sage Weil
2010-05-06 21:41 ` Martin Fick
2010-05-06 21:54 ` Gregory Farnum
2010-05-06 22:20 ` Sage Weil
2010-05-07 16:38 ` Andreas Grimm
2010-05-07 16:43 ` Sage Weil
2010-05-11 8:39 ` Anton
2010-05-11 16:26 ` Sage Weil
-- strict thread matches above, loose matches on Subject: below --
2010-05-06 22:28 Martin Fick
2010-05-06 16:07 Martin Fick
2010-05-06 17:14 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.