All of lore.kernel.org
 help / color / mirror / Atom feed
* Reliability model for RADOS - effects during second failures
@ 2014-07-02 22:33 Koleos Fuscus
  2014-07-03  5:09 ` Kyle Bader
  2014-07-03  7:10 ` Loic Dachary
  0 siblings, 2 replies; 5+ messages in thread
From: Koleos Fuscus @ 2014-07-02 22:33 UTC (permalink / raw)
  To: Loic Dachary, Kyle Bader; +Cc: Sage Weil, ceph-devel

Hi Kyle, Loic,

The current code uses a “FIT rate multiplier” to include for instance
the effect of operations done in parallel. That multiplier (n) has an
effect on Pfail. In the initial failure, it is calculated using the
number of replicas and the stripe count as seen in
https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86.

The thing that doesn’t have sense to me is the way the multiplier is
calculated for the failure of the remaining copies in
https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92
Why the stripes are not taking into account? What is the purpose of
using the “declustering factor” on that equation? Is that equation
correct? I read this note by sage
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html
trying to clarify the role of PGs but didn’t help me to understand it.

Besides, I have a simple question related with the equation on L86 for
the initial failure. The stripping process splits user content in
#number of objects, which equivalent to the stripe count. That group
of objects constitutes an object set. Each object is composed by one
or more stripes units. All stripes units (stripe count) are written in
parallel. Typically each object is mapped to a different disk.  What
happen when the object set is full and a new object is started? Are
this new objects assigned to same disks used for the previous full
object set?

Best

koleosfuscus

________________________________________________________________
"My reply is: the software has no known bugs, therefore it has not
been updated."
Wietse Venema
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reliability model for RADOS - effects during second failures
  2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus
@ 2014-07-03  5:09 ` Kyle Bader
  2014-07-04  0:58   ` Koleos Fuscus
  2014-07-03  7:10 ` Loic Dachary
  1 sibling, 1 reply; 5+ messages in thread
From: Kyle Bader @ 2014-07-03  5:09 UTC (permalink / raw)
  To: Koleos Fuscus; +Cc: Loic Dachary, Sage Weil, ceph-devel

> The current code uses a “FIT rate multiplier” to include for instance
> the effect of operations done in parallel. That multiplier (n) has an
> effect on Pfail. In the initial failure, it is calculated using the
> number of replicas and the stripe count as seen in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86.

So I'm not sure what term we want to use for what we are calculating
the durability of but for the sake of this explanation I'll use
"artifact", which will refer to a collection of objects that compose
a:

1. RADOS object (stripe count=1)
2. RBD volume
3. RGW S3 or Swift object
4. RGW metadata pools
5. I'm probably forgetting something

My interpretation of the models progression is:

1. Global population of placement groups, perhaps because we need the
entire pool intact, eg. RGW metadata pools (upper bound for stripe
count).
2. Subsection of placement groups with which we will place portions of
our artifact eg. based on size of RBD/RGW artifacts striped across
RADOS objects.
3. Multiplier, to account for the fact that the placement group will
become degraded if any of it's members are marked out due to failure.

> The thing that doesn’t have sense to me is the way the multiplier is
> calculated for the failure of the remaining copies in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92
> Why the stripes are not taking into account?

Stripes are not taken into account because at this point in the model
we are calculating the chances of the degraded placement group
becoming further degraded by suffering the loss of another member.
Failures of other placement groups in the same stripe, during the
recovery of our placement group should be calculated as an independent
event.

> What is the purpose of
> using the “declustering factor” on that equation?

My understanding is the declustering factor is synonymous with
placement groups (pg)

> Is that equation
> correct? I read this note by sage
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html
> trying to clarify the role of PGs but didn’t help me to understand it.

To distribute objects across the cluster we need to divvy up objects
into groupings, in the context of Ceph those groupings are PGs
(placement groups). There is a cost associated with maintaining each
placement group, and the benefit is finer distribution granularity can
improve utilization at the high end. This should be reflected in the
full/nearfull tunables we set for our cluster:

http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity

> Besides, I have a simple question related with the equation on L86 for
> the initial failure. The stripping process splits user content in
> #number of objects, which equivalent to the stripe count. That group
> of objects constitutes an object set. Each object is composed by one
> or more stripes units. All stripes units (stripe count) are written in
> parallel. Typically each object is mapped to a different disk.  What
> happen when the object set is full and a new object is started?

It places a second (or more) object in one of the placement groups
that already has another object belonging to the same artifact. In
this way you can have arbitrarily sized artifacts and still limit the
number of placement groups in order to reduce the probability of
failure.

-- 
Kyle Bader - Inktank
Senior Solution Architect
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reliability model for RADOS - effects during second failures
  2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus
  2014-07-03  5:09 ` Kyle Bader
@ 2014-07-03  7:10 ` Loic Dachary
  2014-07-07 14:55   ` Koleos Fuscus
  1 sibling, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2014-07-03  7:10 UTC (permalink / raw)
  To: Koleos Fuscus; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

Hi koleosfuscus,

On 03/07/2014 00:33, Koleos Fuscus wrote:
> Hi Kyle, Loic,
> 
> The current code uses a “FIT rate multiplier” to include for instance
> the effect of operations done in parallel. That multiplier (n) has an
> effect on Pfail. In the initial failure, it is calculated using the
> number of replicas and the stripe count as seen in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86.
> 
> The thing that doesn’t have sense to me is the way the multiplier is
> calculated for the failure of the remaining copies in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92
> Why the stripes are not taking into account? What is the purpose of
> using the “declustering factor” on that equation? Is that equation
> correct? I read this note by sage
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg01650.html
> trying to clarify the role of PGs but didn’t help me to understand it.

At the risk of adding confusion to the discussion, does the current reliability model make room to take into account what is described in anrg.usc.edu/~maheswaran/Xorbas.pdf under "4. Reliability Analysis" ? In other words, is there a place where one could set things like "disk fail % of the time" and "network is X Gb/s" and "repairing a disk failure requires disk require reading B bytes from M disks" ? As far as I understand, such factors cannot be expressed with a single formula and this is why a Markov model is useful.

> Besides, I have a simple question related with the equation on L86 for
> the initial failure. The stripping process splits user content in
> #number of objects, which equivalent to the stripe count. That group
> of objects constitutes an object set. Each object is composed by one
> or more stripes units. All stripes units (stripe count) are written in
> parallel. Typically each object is mapped to a different disk.  What
> happen when the object set is full and a new object is started? Are
> this new objects assigned to same disks used for the previous full
> object set?

In an ideal situation, if a disk / OSD is full it means the whole cluster is full. Is it reasonable to ignore this situation when thinking about the reliability model ? If not could you explain how ?

Cheers 
> 
> Best
> 
> koleosfuscus
> 
> ________________________________________________________________
> "My reply is: the software has no known bugs, therefore it has not
> been updated."
> Wietse Venema
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reliability model for RADOS - effects during second failures
  2014-07-03  5:09 ` Kyle Bader
@ 2014-07-04  0:58   ` Koleos Fuscus
  0 siblings, 0 replies; 5+ messages in thread
From: Koleos Fuscus @ 2014-07-04  0:58 UTC (permalink / raw)
  To: Kyle Bader; +Cc: Loic Dachary, Sage Weil, ceph-devel

Hello Kyle,

Thanks for your e-mail.

> 1. RADOS object (stripe count=1)

If I understand correctly, a RADOS object can be store in a stripe
with count=n, maybe 1 is the default.

> My interpretation of the models progression is:
> 1. Global population of placement groups, perhaps because we need the
> entire pool intact, eg. RGW metadata pools (upper bound for stripe
> count).
> 2. Subsection of placement groups with which we will place portions of
> our artifact eg. based on size of RBD/RGW artifacts striped across
> RADOS objects.
> 3. Multiplier, to account for the fact that the placement group will
> become degraded if any of it's members are marked out due to failure.

I cannot understand what you said above. The current tool refers to a
RADOS object. Do we need to differentiate things in fine-grain (RBD,
RGW)? Not sure if it is relevant.

I will transcript some of the things from
https://github.com/ceph/ceph-tools/blob/master/models/reliability/README.html

"This is a model of the durability of a single, arbitrary
object....That object lives in a PG."

I think it is more correct to said that the object doesn't live in a
PG but in a pool. If the pool is replicated, the number of PGs inside
a pool is (OSDx#PG_per_OSD)/#replicas (rounded to the nearest power of
two).

Now, we can list what are the components that can fail in our model. A
OSD node can fail. A OSD node can contain many disk and each disk can
fail.

What means a PG failure? Does it have sense to have many PG(from the
same pool) in the same disk? If multiple PG reside in the same disk, a
failure of a PG can refer to a failure of a disk sector?

First failure:
At this time, we need to introduce stripes into the equation. Since
the original object gets stripped and stripes go to a different OSD
the stripe count is important. Therefore, the fit rate multiplier
includes "replicas*stripes" to calculate Pfail. That makes sense to
me.

>
> Stripes are not taken into account because at this point in the model
> we are calculating the chances of the degraded placement group
> becoming further degraded by suffering the loss of another member.
> Failures of other placement groups in the same stripe, during the
> recovery of our placement group should be calculated as an independent
> event.
>

I think I follow. But the concept of pg/declustering is still giving
me some concerns.

To illustrate, I will use a toy example:
1. Object (example object: block of 100KB)
2. Object is stripped in a 4 unit stripe: obj1 obj2 obj3 obj4 (each of 25KB)
3. Object is replicated 3-way: obj1_rep1, obj1_rep2, obj1_rep3, obj2_rep1, ....
4. Object is placed in different OSDs, and maybe in different PGs
inside the same OSDs
Imagine this situation for 4 OSD and 100 PGs per each OSD:
OSD1: obj1_rep1,obj2_rep2...
OSD2: obj2_rep1, obj3_rep2, obj1_rep3...
OSD3: obj3_rep1, obj4_rep2...
OSD4: obj4_rep1, obj1_rep2...

Now, imagine that OSD1 fails. Let's say OSD1 has only one PG, so all
the chunks inside OSD1 are missing. We focus our study on the
durability of obj1. With the first failure, obj1_rep is loss. In
addition, obj2_rep2 is also missing but we ignore other elements of
the same stripe. As you said, we are not interested in independent
elements on degraded stripes...(some doubts remain regarding whether
or not this obj2_rep2 should be consider in the repairing process)

The repairing process is launched after the first failure. It needs to
copy all replicas to a spare OSD. I understand that declustering is
necessary for perfomance, but...why it is used here in the model?

A second failure occurs. The FIT rate multiplier considers '#copies-1'
and the 'declustering factor/PGs'.
The period to calculate Pfail is not the life time of the object but
the repairing time. Repair time is the bytes to be recovered divided
by repair speed and decluster factor. Adding the declustering factor
to the FIT multiplier actually cancels the decluster factor of the
repair time. I wonder why it is consider in the repair the first time?
Is it equivalent to stripe (pg=4 instead of default value=100)?

Best,
koleosfuscus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reliability model for RADOS - effects during second failures
  2014-07-03  7:10 ` Loic Dachary
@ 2014-07-07 14:55   ` Koleos Fuscus
  0 siblings, 0 replies; 5+ messages in thread
From: Koleos Fuscus @ 2014-07-07 14:55 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel, Kyle Bader

Hi Loic,

> At the risk of adding confusion to the discussion, does

Indeed, you are right, answering questions with new questions adds confusion ;)
I will open another thread to discuss your e-mail.

I am aware that it might be difficult to answer to my previous mail
but I need to understand what parts of Cephs are being modelling in
the original tool. The documentation is too vague. The author even
ignores Markov in the whole code documentation.

Cheers,
Koleosfuscus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-07-07 14:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 22:33 Reliability model for RADOS - effects during second failures Koleos Fuscus
2014-07-03  5:09 ` Kyle Bader
2014-07-04  0:58   ` Koleos Fuscus
2014-07-03  7:10 ` Loic Dachary
2014-07-07 14:55   ` Koleos Fuscus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.