Re: ceph data locality

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: ceph data locality
       [not found]   ` <D02D5FAE.58B%johnugeo@cisco.com>
@ 2014-09-04 15:59     ` Milosz Tanski
  2014-09-08 19:51       ` Gregory Farnum
  0 siblings, 1 reply; 4+ messages in thread
From: Milosz Tanski @ 2014-09-04 15:59 UTC (permalink / raw)
  To: ceph-devel

Johnu,

Keep in mind HDFS was more less designed and thus optimized for MR
jobs versus general filesystem use. It was also optimized for a case
of hardware in the past, eg. slower networks then today (1gigE or
less). Theres's lots of little hacks in hadoop to optimize for that,
for example local mmaped reads in hdfs client). It will tough to beat
MR on HDFS in that scenario and hadoop. If hadoop is a smaller piece
in a large data-pipeline (that includes non-hadoop, regular fs work)
then it makes more sense.

Now if you're talking about the hardware and network of tomorrow
(10gigE or 40gigE) then locality of placement starts to matter less.
For example the Mellanox people claim that they are able to get 20%
more performance out of Ceph in the 40gigE scenario.

And if we're designing for the network for future then there's a lot
we can clean from the Quantcast hadoop filesystem
(http://quantcast.github.io/qfs/). Take a look at their recent
publication: http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf.
They essentially forked KFS, added erasure coding support created a
hadoop filesystem driver for it. They were able to get much better
write performance by reducing write amplifications (1.5x copies versus
3 copies) thus reducing network traffic and possibly freeing up that
previous bandwidth for read traffic. They claim to have improved read
performance compared to HDFS a tad.

QFS unlike Ceph places the erasure coding logic inside of the client
so it's not a apples-to-apples comparison. but I think you get my
point, and it would be possible to implement a rich Ceph
(filesystem/hadoop) client like this as well.

In summary, if Hadoop on Ceph is a major priority I think it would be
best to "borrow" the good ideas for QFS and implement them in Hadoop
Ceph filesystem and Ceph it self (letting a smart client get chunks
directly, write chunks directly). I don't doubt that it's a lot of
work but the results might be worth it in in terms of performance you
get for the cost.

Some food for though. I don't have a horse in this particular game but
I am interested in DFSs and VLDBs so I'm constantly reading into
research / what folks are building.

Cheers,
- Milosz

P.S: Forgot to Reply-to-all, haven't had my coffee yet.

On Thu, Sep 4, 2014 at 3:16 AM, Johnu George (johnugeo)
<johnugeo@cisco.com> wrote:
> Hi All,
>         I was reading more on Hadoop over ceph. I heard from Noah that
> tuning of Hadoop on Ceph is going on. I am just curious to know if there
> is any reason to keep default object size as 64MB. Is it because of the
> fact that it becomes difficult to encode
>  getBlockLocations if blocks are divided into objects and to choose the
> best location for tasks if no nodes in the system has a complete block.?
>
> I am wondering if someone any benchmark results for various object sizes.
> If you have them, it will be helpful if you share them.
>
> I see that Ceph doesn¹t place objects considering the client location or
> distance between client and the osds where data is stored.(data-locality)
> While, data locality is the key idea for HDFS block placement and
> retrieval for maximum throughput. So, how does ceph plan to perform better
> than HDFS as ceph relies on random placement
>  using hashing unlike HDFS block placement? Can someone also point out
> some performance results comparing ceph random placements vs hdfs locality
> aware placement?
>
> Also, Sage wrote about a way to specify a node to be primary for hadoop
> like environments.
> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is
> this through primary affinity configuration?
>
> Thanks,
> Johnu
>

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ceph data locality
  2014-09-04 15:59     ` ceph data locality Milosz Tanski
@ 2014-09-08 19:51       ` Gregory Farnum
  2014-09-08 22:53         ` Johnu George (johnugeo)
  0 siblings, 1 reply; 4+ messages in thread
From: Gregory Farnum @ 2014-09-08 19:51 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: ceph-devel

On Thu, Sep 4, 2014 at 12:16 AM, Johnu George (johnugeo)
<johnugeo@cisco.com> wrote:
> Hi All,
>         I was reading more on Hadoop over ceph. I heard from Noah that
> tuning of Hadoop on Ceph is going on. I am just curious to know if there
> is any reason to keep default object size as 64MB. Is it because of the
> fact that it becomes difficult to encode
>  getBlockLocations if blocks are divided into objects and to choose the
> best location for tasks if no nodes in the system has a complete block.?

We used 64MB because it's the HDFS default and in some *very* stupid
tests it seemed to be about the fastest. You could certainly make it
smaller if you wanted, and it would probably work to multiply it by
2-4x, but then you're using bigger objects than most people do.

> I see that Ceph doesn¹t place objects considering the client location or
> distance between client and the osds where data is stored.(data-locality)
> While, data locality is the key idea for HDFS block placement and
> retrieval for maximum throughput. So, how does ceph plan to perform better
> than HDFS as ceph relies on random placement
>  using hashing unlike HDFS block placement? Can someone also point out
> some performance results comparing ceph random placements vs hdfs locality
> aware placement?

I don't think we have any serious performance results; there hasn't
been enough focus on productizing it for that kind of work.
Anecdotally I've seen people on social media claim that it's as fast
or even many times faster than HDFS (I suspect if it's many times
faster they had a misconfiguration somewhere in HDFS, though!).
In any case, Ceph has two plans for being faster than HDFS:
1) big users indicate that always writing locally is often a mistake
and it tends to overfill certain nodes within your cluster. Plus,
networks are much faster now so it doesn't cost as much to write over
it, and Ceph *does* export locations so the follow-up jobs can be
scheduled appropriately.

>
> Also, Sage wrote about a way to specify a node to be primary for hadoop
> like environments.
> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is
> this through primary affinity configuration?

That mechanism ("preferred" PGs) is dead. Primary affinity is a
completely different thing.


On Thu, Sep 4, 2014 at 8:59 AM, Milosz Tanski <milosz@adfin.com> wrote:
> QFS unlike Ceph places the erasure coding logic inside of the client
> so it's not a apples-to-apples comparison. but I think you get my
> point, and it would be possible to implement a rich Ceph
> (filesystem/hadoop) client like this as well.
>
> In summary, if Hadoop on Ceph is a major priority I think it would be
> best to "borrow" the good ideas for QFS and implement them in Hadoop
> Ceph filesystem and Ceph it self (letting a smart client get chunks
> directly, write chunks directly). I don't doubt that it's a lot of
> work but the results might be worth it in in terms of performance you
> get for the cost.

Unfortunately implementing CephFS on top of RADOS' EC pools is going
to be a major project which we haven't done anything to scope out yet,
so it's going to be a while before that's really an option. But it is
a "real" filesystem, so we still have that going for us. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ceph data locality
  2014-09-08 19:51       ` Gregory Farnum
@ 2014-09-08 22:53         ` Johnu George (johnugeo)
  2014-09-08 23:11           ` Gregory Farnum
  0 siblings, 1 reply; 4+ messages in thread
From: Johnu George (johnugeo) @ 2014-09-08 22:53 UTC (permalink / raw)
  To: Gregory Farnum, Milosz Tanski; +Cc: ceph-devel

Hi Greg,
       Thanks. Can you explain more on "Ceph *does* export locations so
the follow-up jobs can be scheduled appropriately”?

Thanks,
Johnu


On 9/8/14, 12:51 PM, "Gregory Farnum" <greg@inktank.com> wrote:

>On Thu, Sep 4, 2014 at 12:16 AM, Johnu George (johnugeo)
><johnugeo@cisco.com> wrote:
>> Hi All,
>>         I was reading more on Hadoop over ceph. I heard from Noah that
>> tuning of Hadoop on Ceph is going on. I am just curious to know if there
>> is any reason to keep default object size as 64MB. Is it because of the
>> fact that it becomes difficult to encode
>>  getBlockLocations if blocks are divided into objects and to choose the
>> best location for tasks if no nodes in the system has a complete block.?
>
>We used 64MB because it's the HDFS default and in some *very* stupid
>tests it seemed to be about the fastest. You could certainly make it
>smaller if you wanted, and it would probably work to multiply it by
>2-4x, but then you're using bigger objects than most people do.
>
>> I see that Ceph doesn¹t place objects considering the client location or
>> distance between client and the osds where data is
>>stored.(data-locality)
>> While, data locality is the key idea for HDFS block placement and
>> retrieval for maximum throughput. So, how does ceph plan to perform
>>better
>> than HDFS as ceph relies on random placement
>>  using hashing unlike HDFS block placement? Can someone also point out
>> some performance results comparing ceph random placements vs hdfs
>>locality
>> aware placement?
>
>I don't think we have any serious performance results; there hasn't
>been enough focus on productizing it for that kind of work.
>Anecdotally I've seen people on social media claim that it's as fast
>or even many times faster than HDFS (I suspect if it's many times
>faster they had a misconfiguration somewhere in HDFS, though!).
>In any case, Ceph has two plans for being faster than HDFS:
>1) big users indicate that always writing locally is often a mistake
>and it tends to overfill certain nodes within your cluster. Plus,
>networks are much faster now so it doesn't cost as much to write over
>it, and Ceph *does* export locations so the follow-up jobs can be
>scheduled appropriately.
>
>>
>> Also, Sage wrote about a way to specify a node to be primary for hadoop
>> like environments.
>> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is
>> this through primary affinity configuration?
>
>That mechanism ("preferred" PGs) is dead. Primary affinity is a
>completely different thing.
>
>
>On Thu, Sep 4, 2014 at 8:59 AM, Milosz Tanski <milosz@adfin.com> wrote:
>> QFS unlike Ceph places the erasure coding logic inside of the client
>> so it's not a apples-to-apples comparison. but I think you get my
>> point, and it would be possible to implement a rich Ceph
>> (filesystem/hadoop) client like this as well.
>>
>> In summary, if Hadoop on Ceph is a major priority I think it would be
>> best to "borrow" the good ideas for QFS and implement them in Hadoop
>> Ceph filesystem and Ceph it self (letting a smart client get chunks
>> directly, write chunks directly). I don't doubt that it's a lot of
>> work but the results might be worth it in in terms of performance you
>> get for the cost.
>
>Unfortunately implementing CephFS on top of RADOS' EC pools is going
>to be a major project which we haven't done anything to scope out yet,
>so it's going to be a while before that's really an option. But it is
>a "real" filesystem, so we still have that going for us. ;)
>-Greg
>Software Engineer #42 @ http://inktank.com | http://ceph.com
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ceph data locality
  2014-09-08 22:53         ` Johnu George (johnugeo)
@ 2014-09-08 23:11           ` Gregory Farnum
  0 siblings, 0 replies; 4+ messages in thread
From: Gregory Farnum @ 2014-09-08 23:11 UTC (permalink / raw)
  To: Johnu George (johnugeo); +Cc: Milosz Tanski, ceph-devel

It implements the getBlockLocations() api (or whatever it is) in the
Hadoop FileSystem interface. The upshot of this is that the Hadoop
scheduler can do the exact same scheduling job on tasks with Ceph that
it does with HDFS.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Sep 8, 2014 at 3:53 PM, Johnu George (johnugeo)
<johnugeo@cisco.com> wrote:
> Hi Greg,
>        Thanks. Can you explain more on "Ceph *does* export locations so
> the follow-up jobs can be scheduled appropriately”?
>
> Thanks,
> Johnu
>
>
> On 9/8/14, 12:51 PM, "Gregory Farnum" <greg@inktank.com> wrote:
>
>>On Thu, Sep 4, 2014 at 12:16 AM, Johnu George (johnugeo)
>><johnugeo@cisco.com> wrote:
>>> Hi All,
>>>         I was reading more on Hadoop over ceph. I heard from Noah that
>>> tuning of Hadoop on Ceph is going on. I am just curious to know if there
>>> is any reason to keep default object size as 64MB. Is it because of the
>>> fact that it becomes difficult to encode
>>>  getBlockLocations if blocks are divided into objects and to choose the
>>> best location for tasks if no nodes in the system has a complete block.?
>>
>>We used 64MB because it's the HDFS default and in some *very* stupid
>>tests it seemed to be about the fastest. You could certainly make it
>>smaller if you wanted, and it would probably work to multiply it by
>>2-4x, but then you're using bigger objects than most people do.
>>
>>> I see that Ceph doesn¹t place objects considering the client location or
>>> distance between client and the osds where data is
>>>stored.(data-locality)
>>> While, data locality is the key idea for HDFS block placement and
>>> retrieval for maximum throughput. So, how does ceph plan to perform
>>>better
>>> than HDFS as ceph relies on random placement
>>>  using hashing unlike HDFS block placement? Can someone also point out
>>> some performance results comparing ceph random placements vs hdfs
>>>locality
>>> aware placement?
>>
>>I don't think we have any serious performance results; there hasn't
>>been enough focus on productizing it for that kind of work.
>>Anecdotally I've seen people on social media claim that it's as fast
>>or even many times faster than HDFS (I suspect if it's many times
>>faster they had a misconfiguration somewhere in HDFS, though!).
>>In any case, Ceph has two plans for being faster than HDFS:
>>1) big users indicate that always writing locally is often a mistake
>>and it tends to overfill certain nodes within your cluster. Plus,
>>networks are much faster now so it doesn't cost as much to write over
>>it, and Ceph *does* export locations so the follow-up jobs can be
>>scheduled appropriately.
>>
>>>
>>> Also, Sage wrote about a way to specify a node to be primary for hadoop
>>> like environments.
>>> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/1548 ) Is
>>> this through primary affinity configuration?
>>
>>That mechanism ("preferred" PGs) is dead. Primary affinity is a
>>completely different thing.
>>
>>
>>On Thu, Sep 4, 2014 at 8:59 AM, Milosz Tanski <milosz@adfin.com> wrote:
>>> QFS unlike Ceph places the erasure coding logic inside of the client
>>> so it's not a apples-to-apples comparison. but I think you get my
>>> point, and it would be possible to implement a rich Ceph
>>> (filesystem/hadoop) client like this as well.
>>>
>>> In summary, if Hadoop on Ceph is a major priority I think it would be
>>> best to "borrow" the good ideas for QFS and implement them in Hadoop
>>> Ceph filesystem and Ceph it self (letting a smart client get chunks
>>> directly, write chunks directly). I don't doubt that it's a lot of
>>> work but the results might be worth it in in terms of performance you
>>> get for the cost.
>>
>>Unfortunately implementing CephFS on top of RADOS' EC pools is going
>>to be a major project which we haven't done anything to scope out yet,
>>so it's going to be a while before that's really an option. But it is
>>a "real" filesystem, so we still have that going for us. ;)
>>-Greg
>>Software Engineer #42 @ http://inktank.com | http://ceph.com
>>--
>>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-09-08 23:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <D02D591B.57E%johnugeo@cisco.com>
     [not found] ` <D02D5C36.583%johnugeo@cisco.com>
     [not found]   ` <D02D5FAE.58B%johnugeo@cisco.com>
2014-09-04 15:59     ` ceph data locality Milosz Tanski
2014-09-08 19:51       ` Gregory Farnum
2014-09-08 22:53         ` Johnu George (johnugeo)
2014-09-08 23:11           ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.