All of lore.kernel.org
 help / color / mirror / Atom feed
* Forcing Ceph into mapping all objects to a single PG
@ 2014-07-21 22:27 Daniel Hofmann
  2014-07-21 22:50 ` Gregory Farnum
  2014-07-31  0:29 ` Daniel Hofmann
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Hofmann @ 2014-07-21 22:27 UTC (permalink / raw)
  To: ceph-devel

Preamble: you might want to read the decent formatted version of this
mail at:
> https://gist.github.com/daniel-j-h/2daae2237bb21596c97d


Motivation
----------

Recently I was wondering how Ceph (i.e. Rados in particular) maps object
to Placement Groups (PGs).

It basically works like the following:

    pg = hash(object)
    osd = crush(pg)

By hashing the object's name (not the content!) Ceph determines the PG.
For assigning PGs to OSDs, the CRUSH algorithm is used.

But what happens if the hash function always returns the same value,
i.e. what happens on hash collisions?
Does Ceph really map objects to the same PG then?

I asked this on the IRC Channel and was a bit surprised as I got the
response, that if Ceph really maps all objects to a single PG I'm simply
out of luck:

> with enough PG and enough objects of reasonable sizes, it should yield
a decent data distribution

Confirming, that the hash function is indeed responsible for the
distribution of objects to PGs.

So, what could possibly go wrong? Let's try this!


Setting up the environment
--------------------------

I modified the librados usage example
(src/examples/librados/hello_world.cc) to write a objects with specific
names:

    std::string object_name("\x41\x5b\xd2\xc2\xc3");  // bear with me here
    //std::string object_name("\x41\x1e\xf8\x5e\x22");
    //std::string object_name("\x01\x01\x01\x17\x2e");

After setting up a local cluster via the vstart.sh tool and writing
those three objects, executing

    ./ceph pg dump | less -S

indeed shows our three **objects getting mapped to the same PG**!


Collisions
----------

How is it possible to come up with such colliding names? It's actually
quite simple:

Ceph defines two hash functions (src/common/ceph_hash.cc) from which the
so called rjenkins hash function is used by default.

(Note: src/crush/hash.c actually implements the rjenkins function again
-- why?)

The hash function's implementation is deterministic in the sense that
the output only depends on it's input.
No random or setup-specific seed is used.

What implications does this have?

Once we compute collisions, we're able to force the object mapping on
**every rados object store out there**.
And because the hash function's output is only 32 bits wide, collisions
are not that hard to find at all!

Here are a few if you're following along and want to test your setup:

| hash (u32) |byte sequence (hex) |
| -----------|--------------------|
| 381933444  | 41 5b d2 c2 c3     |
| 381933444  | 41 1e f8 5e 22     |
| 381933444  | 1 1 1 17 2e        |
| 381933444  | 1 a4 34 f6 c0      |
| 381933444  | 42 ca 78 4f 47     |
| 381933444  | 82 e8 ea fe 16     |


Implications
-------------

I did this as a proof-of-concept in a few hours but I'm still not sure
about the full implications.
I'm able to force object mappings. Can this be used as a DOS attack
vector? Maybe.

I looked through the RadosGW source, in order to find out how object
names are generated there.
I only found the second hash function implementation getting used -- but
I'm not familiar with the source to say more about it at this time.


Open questions
--------------

I'd like to know more about what an impact this could have. In particular:

* How do higher-level API's (e.g. RadosGW) determine the object's name?
Can this be influenced or set by the user?
* Is it possible to introduce some randomness, in order to mitigate the
explained scenario? Maybe on a per-cluster basis or deduced from some
internal state. (Note: SipHash is doing something similar)


Summary
-------

Ceph's object mapping depends on the rjenkins hash function.
It's possible to force Ceph into mapping all objects to a single PG.

Please discuss!

Thanks,
Daniel J. Hofmann

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-21 22:27 Forcing Ceph into mapping all objects to a single PG Daniel Hofmann
@ 2014-07-21 22:50 ` Gregory Farnum
  2014-07-22 22:44   ` Alex Elsayed
  2014-07-31  0:29 ` Daniel Hofmann
  1 sibling, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2014-07-21 22:50 UTC (permalink / raw)
  To: Daniel Hofmann; +Cc: ceph-devel

On Mon, Jul 21, 2014 at 3:27 PM, Daniel Hofmann <daniel@trvx.org> wrote:
> Preamble: you might want to read the decent formatted version of this
> mail at:
>> https://gist.github.com/daniel-j-h/2daae2237bb21596c97d
>
>
> Motivation
> ----------
>
> Recently I was wondering how Ceph (i.e. Rados in particular) maps object
> to Placement Groups (PGs).
>
> It basically works like the following:
>
>     pg = hash(object)
>     osd = crush(pg)
>
> By hashing the object's name (not the content!) Ceph determines the PG.
> For assigning PGs to OSDs, the CRUSH algorithm is used.
>
> But what happens if the hash function always returns the same value,
> i.e. what happens on hash collisions?
> Does Ceph really map objects to the same PG then?
>
> I asked this on the IRC Channel and was a bit surprised as I got the
> response, that if Ceph really maps all objects to a single PG I'm simply
> out of luck:
>
>> with enough PG and enough objects of reasonable sizes, it should yield
> a decent data distribution
>
> Confirming, that the hash function is indeed responsible for the
> distribution of objects to PGs.
>
> So, what could possibly go wrong? Let's try this!
>
>
> Setting up the environment
> --------------------------
>
> I modified the librados usage example
> (src/examples/librados/hello_world.cc) to write a objects with specific
> names:
>
>     std::string object_name("\x41\x5b\xd2\xc2\xc3");  // bear with me here
>     //std::string object_name("\x41\x1e\xf8\x5e\x22");
>     //std::string object_name("\x01\x01\x01\x17\x2e");
>
> After setting up a local cluster via the vstart.sh tool and writing
> those three objects, executing
>
>     ./ceph pg dump | less -S
>
> indeed shows our three **objects getting mapped to the same PG**!
>
>
> Collisions
> ----------
>
> How is it possible to come up with such colliding names? It's actually
> quite simple:
>
> Ceph defines two hash functions (src/common/ceph_hash.cc) from which the
> so called rjenkins hash function is used by default.
>
> (Note: src/crush/hash.c actually implements the rjenkins function again
> -- why?)
>
> The hash function's implementation is deterministic in the sense that
> the output only depends on it's input.
> No random or setup-specific seed is used.
>
> What implications does this have?
>
> Once we compute collisions, we're able to force the object mapping on
> **every rados object store out there**.
> And because the hash function's output is only 32 bits wide, collisions
> are not that hard to find at all!
>
> Here are a few if you're following along and want to test your setup:
>
> | hash (u32) |byte sequence (hex) |
> | -----------|--------------------|
> | 381933444  | 41 5b d2 c2 c3     |
> | 381933444  | 41 1e f8 5e 22     |
> | 381933444  | 1 1 1 17 2e        |
> | 381933444  | 1 a4 34 f6 c0      |
> | 381933444  | 42 ca 78 4f 47     |
> | 381933444  | 82 e8 ea fe 16     |
>
>
> Implications
> -------------
>
> I did this as a proof-of-concept in a few hours but I'm still not sure
> about the full implications.
> I'm able to force object mappings. Can this be used as a DOS attack
> vector? Maybe.
>
> I looked through the RadosGW source, in order to find out how object
> names are generated there.
> I only found the second hash function implementation getting used -- but
> I'm not familiar with the source to say more about it at this time.
>
>
> Open questions
> --------------
>
> I'd like to know more about what an impact this could have. In particular:
>
> * How do higher-level API's (e.g. RadosGW) determine the object's name?
> Can this be influenced or set by the user?
> * Is it possible to introduce some randomness, in order to mitigate the
> explained scenario? Maybe on a per-cluster basis or deduced from some
> internal state. (Note: SipHash is doing something similar)
>
>
> Summary
> -------
>
> Ceph's object mapping depends on the rjenkins hash function.
> It's possible to force Ceph into mapping all objects to a single PG.
>
> Please discuss!

Yes, this is an attack vector. It functions against...well, any system
using hash-based placement.
RGW mangles names on its own, although the mangling is deterministic
enough that an attacker could perhaps manipulate it into mangling them
onto the same PG. (Within the constraints, though, it'd be pretty
difficult.)
RBD names objects in a way that users can't really control, so I guess
it's safe, sort of? (But users of rbd will still have write permission
to some class of objects in which they may be able to find an attack.)

The real issue though, is that any user with permission to write to
*any* set of objects directly in the cluster will be able to exploit
this regardless of what barriers we erect. Deterministic placement, in
that anybody directly accessing the cluster can compute data
locations, is central to Ceph's design. We could add "salts" or
something to try and prevent attackers from *outside* the direct set
(eg, users of RGW) exploiting it directly, but anybody who can read or
write from the cluster would need to be able to read the salt in order
to compute locations themselves. So I'm pretty sure this attack vector
is:
1) Inherent to all hash-placement systems,
2) not something we can reasonably defend against *anyway*.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-21 22:50 ` Gregory Farnum
@ 2014-07-22 22:44   ` Alex Elsayed
  2014-07-22 22:46     ` Alex Elsayed
  2014-07-25 12:26     ` Daniel Hofmann
  0 siblings, 2 replies; 8+ messages in thread
From: Alex Elsayed @ 2014-07-22 22:44 UTC (permalink / raw)
  To: ceph-devel

Gregory Farnum wrote:

> On Mon, Jul 21, 2014 at 3:27 PM, Daniel Hofmann <daniel@trvx.org> wrote:
>> Preamble: you might want to read the decent formatted version of this
>> mail at:
>>> https://gist.github.com/daniel-j-h/2daae2237bb21596c97d
<snip aggressively>
>> -------
>>
>> Ceph's object mapping depends on the rjenkins hash function.
>> It's possible to force Ceph into mapping all objects to a single PG.
>>
>> Please discuss!
> 
> Yes, this is an attack vector. It functions against...well, any system
> using hash-based placement.

Sort of. How well it functions is a function (heh) of how easy it is to find 
a preimage against the hash (collision only allows a pair, you need 
preimages to get beyond that).

With fletcher4, preimages aren't particularly difficult to find. By using a 
more robust hash[1], then preimages become more computationally expensive 
since you need to brute-force for each value rather than taking advantage of 
a weakness in the algorithm.

This doesn't buy a huge amount since the bruteforce effort per iteration is 
still bounded by the number of PGs, but it does help - and it means that as 
PGs are split, resistance to the attack increases as well.

> RGW mangles names on its own, although the mangling is deterministic
> enough that an attacker could perhaps manipulate it into mangling them
> onto the same PG. (Within the constraints, though, it'd be pretty
> difficult.)
> RBD names objects in a way that users can't really control, so I guess
> it's safe, sort of? (But users of rbd will still have write permission
> to some class of objects in which they may be able to find an attack.)
> 
> The real issue though, is that any user with permission to write to
> *any* set of objects directly in the cluster will be able to exploit
> this regardless of what barriers we erect. Deterministic placement, in
> that anybody directly accessing the cluster can compute data
> locations, is central to Ceph's design. We could add "salts" or
> something to try and prevent attackers from *outside* the direct set
> (eg, users of RGW) exploiting it directly, but anybody who can read or
> write from the cluster would need to be able to read the salt in order
> to compute locations themselves.

Actually, doing (say) per-pool salts does help in a notable way: even 
someone who can write to two pools can't reuse the computation of colliding 
values across pools. It forces them to expend the work factor for each pool 
they attack, rather than being able to amortize.

> So I'm pretty sure this attack vector
> is:
> 1) Inherent to all hash-placement systems,
> 2) not something we can reasonably defend against *anyway*.

I'd agree that in the absolute sense it's inherent and insoluble, but that 
doesn't imply that _mitigations_ are worthless.

A more drastic option would be to look at how the sfq network scheduler 
handles it - it hashes flows onto a fixed number of queues, and gets around 
collisions by periodically perturbing the salt (resulting in a _stochastic_ 
avoidance of clumping). It'd definitely require some research to find a way 
to do this such that it doesn't cause huge data movement, but it might be 
worth thinking about for the longer term.

[1] I'm thinking along the lines of SipHash, not any heavy-weight 
cryptographic hash; however with network latencies on the table those might 
not be too bad regardless


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-22 22:44   ` Alex Elsayed
@ 2014-07-22 22:46     ` Alex Elsayed
  2014-07-25 12:26     ` Daniel Hofmann
  1 sibling, 0 replies; 8+ messages in thread
From: Alex Elsayed @ 2014-07-22 22:46 UTC (permalink / raw)
  To: ceph-devel

Gah, typed "fletcher4" when I meant "rjenkins" - still, the same applies.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-22 22:44   ` Alex Elsayed
  2014-07-22 22:46     ` Alex Elsayed
@ 2014-07-25 12:26     ` Daniel Hofmann
  2014-07-25 14:12       ` Sage Weil
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Hofmann @ 2014-07-25 12:26 UTC (permalink / raw)
  To: Alex Elsayed, ceph-devel

The main issue however is not the hash's strength, but the fact that
once pre-computed, I'm able to use preimages on **every Ceph cluster out
there**. (As the hash functions's output is a deterministic function of
the object's name only)

I agree in that the general issue is inherent in hash-placement systems.

But what I don't agree with is the following:

Why do I have to be able to calculate my object's placement for **every
Ceph cluster** out there?

Why does it not suffice for me to be able to calculate the placement
only for the cluster I'm currently accessing?

From a logical standpoint it seems reasonable. Why then, are we not able
to constrain the placement calculation in that regard?


If the placement is bound to a specific cluster it should suffice to
derive e.g. a key for SipHash based on cluster specifics.

Is this doable from an implementation point of view?


Note: I only did this as a proof-of-concept for the object store.
Think about the implications, if you're able to do this e.g. for every
RadosGW out there and servies using RadosGW.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-25 12:26     ` Daniel Hofmann
@ 2014-07-25 14:12       ` Sage Weil
  2014-07-25 18:14         ` Alex Elsayed
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-07-25 14:12 UTC (permalink / raw)
  To: Daniel Hofmann; +Cc: Alex Elsayed, ceph-devel

On Fri, 25 Jul 2014, Daniel Hofmann wrote:
> The main issue however is not the hash's strength, but the fact that
> once pre-computed, I'm able to use preimages on **every Ceph cluster out
> there**. (As the hash functions's output is a deterministic function of
> the object's name only)
> 
> I agree in that the general issue is inherent in hash-placement systems.
> 
> But what I don't agree with is the following:
> 
> Why do I have to be able to calculate my object's placement for **every
> Ceph cluster** out there?
> 
> Why does it not suffice for me to be able to calculate the placement
> only for the cluster I'm currently accessing?
> 
> >From a logical standpoint it seems reasonable. Why then, are we not able
> to constrain the placement calculation in that regard?
> 
> 
> If the placement is bound to a specific cluster it should suffice to
> derive e.g. a key for SipHash based on cluster specifics.
> 
> Is this doable from an implementation point of view?
> 
> 
> Note: I only did this as a proof-of-concept for the object store.
> Think about the implications, if you're able to do this e.g. for every
> RadosGW out there and servies using RadosGW.

It would be really easy to add a random salt to the pg_pool_t and feed 
that into the object -> pg hash mapping.  Note, by the way, that for new 
clusters the pool id is already fed in here, so there is a *tiny* bit of 
variation depending on what orders the pools were created in (although 
probably not enough to meaningfully improve security).

We could also add a new hash type that is more secure.  Rjenkins is used 
by default but the choice of hash is already parameterized.

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-25 14:12       ` Sage Weil
@ 2014-07-25 18:14         ` Alex Elsayed
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Elsayed @ 2014-07-25 18:14 UTC (permalink / raw)
  To: ceph-devel

Sage Weil wrote:

> On Fri, 25 Jul 2014, Daniel Hofmann wrote:
>> The main issue however is not the hash's strength, but the fact that
>> once pre-computed, I'm able to use preimages on **every Ceph cluster out
>> there**. (As the hash functions's output is a deterministic function of
>> the object's name only)
>> 
>> I agree in that the general issue is inherent in hash-placement systems.
>> 
>> But what I don't agree with is the following:
>> 
>> Why do I have to be able to calculate my object's placement for **every
>> Ceph cluster** out there?
>> 
>> Why does it not suffice for me to be able to calculate the placement
>> only for the cluster I'm currently accessing?
>> 
>> >From a logical standpoint it seems reasonable. Why then, are we not able
>> to constrain the placement calculation in that regard?
>> 
>> 
>> If the placement is bound to a specific cluster it should suffice to
>> derive e.g. a key for SipHash based on cluster specifics.
>> 
>> Is this doable from an implementation point of view?
>> 
>> 
>> Note: I only did this as a proof-of-concept for the object store.
>> Think about the implications, if you're able to do this e.g. for every
>> RadosGW out there and servies using RadosGW.
> 
> It would be really easy to add a random salt to the pg_pool_t and feed
> that into the object -> pg hash mapping.  Note, by the way, that for new
> clusters the pool id is already fed in here, so there is a *tiny* bit of
> variation depending on what orders the pools were created in (although
> probably not enough to meaningfully improve security).

I just had a thought - is the FSID a parameter? If it is, then there's 
actually no need for a further salt: FSID is per cluster, preventing cross-
cluster precomputation, and if the pool ID is fed in then each pool needs 
collisions calculated separately as well. In other words, the (fsid, poolid) 
tuple is almost certainly sufficient salt in and of itself.

> We could also add a new hash type that is more secure.  Rjenkins is used
> by default but the choice of hash is already parameterized.
> 
> sage




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forcing Ceph into mapping all objects to a single PG
  2014-07-21 22:27 Forcing Ceph into mapping all objects to a single PG Daniel Hofmann
  2014-07-21 22:50 ` Gregory Farnum
@ 2014-07-31  0:29 ` Daniel Hofmann
  1 sibling, 0 replies; 8+ messages in thread
From: Daniel Hofmann @ 2014-07-31  0:29 UTC (permalink / raw)
  To: ceph-devel

Follow up: I wrote a small Python script for evaluating the object to
placement mapping's quality. I'm using a Chi-Squared statistical
hypothesis test.

Here is the script and also an example output for a few rados benchmark
writes:

> https://gist.github.com/daniel-j-h/d7d87dfe5de3c5bbfd0f

Maybe you're interested in checking your cluster's mapping quality.

More on the method and the output's meaning:
> https://en.wikipedia.org/wiki/Chi-squared_test
> http://en.wikipedia.org/wiki/P-value

I'm still interested in a cluster-specific mapping as discussed in the
other mailinglist posts.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-07-31  0:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-21 22:27 Forcing Ceph into mapping all objects to a single PG Daniel Hofmann
2014-07-21 22:50 ` Gregory Farnum
2014-07-22 22:44   ` Alex Elsayed
2014-07-22 22:46     ` Alex Elsayed
2014-07-25 12:26     ` Daniel Hofmann
2014-07-25 14:12       ` Sage Weil
2014-07-25 18:14         ` Alex Elsayed
2014-07-31  0:29 ` Daniel Hofmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.