* Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
@ 2012-12-13 14:54 Lachfeld, Jutta
2012-12-13 17:27 ` Sage Weil
2012-12-14 14:53 ` Mark Nelson
0 siblings, 2 replies; 12+ messages in thread
From: Lachfeld, Jutta @ 2012-12-13 14:54 UTC (permalink / raw)
To: ceph-devel
Hi all,
I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop. The benchmark measures the elapsed time of a sort run.
I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
OS: SLES 11 SP2
CEPH:
OSDs are distributed over several machines.
There is 1 MON and 1 MDS process on yet another machine.
Replication of the data pool is set to 1.
Underlying file systems for data are btrfs.
Mount options are only "rw,noatime".
For each CEPH OSD, we use a RAM disk of 256MB for the journal.
Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
HDFS:
HDFS is distributed over the same machines.
HDFS name node on yet another machine.
Replication level is set to 1.
HDFS block size is set to 64MB or even 512MB.
Underlying file systems for data are btrfs.
Mount options are only "rw,noatime".
Hadoop version is 1.0.3.
Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.
The same disk partitions are either formatted for HDFS or for CEPH usage.
CPU usage in both cases is almost 100 percent on all data related nodes.
There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.
Best regards,
Jutta Lachfeld.
--
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
@ 2012-12-13 17:27 ` Sage Weil
2012-12-13 17:41 ` Gregory Farnum
2012-12-14 14:53 ` Mark Nelson
1 sibling, 1 reply; 12+ messages in thread
From: Sage Weil @ 2012-12-13 17:27 UTC (permalink / raw)
To: Lachfeld, Jutta; +Cc: ceph-devel
Hi Jutta,
On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
> Hi all,
>
> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop. The benchmark measures the elapsed time of a sort run.
>
> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>
> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>
> OS: SLES 11 SP2
>
> CEPH:
> OSDs are distributed over several machines.
> There is 1 MON and 1 MDS process on yet another machine.
>
> Replication of the data pool is set to 1.
> Underlying file systems for data are btrfs.
> Mount options are only "rw,noatime".
> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>
> HDFS:
> HDFS is distributed over the same machines.
> HDFS name node on yet another machine.
>
> Replication level is set to 1.
> HDFS block size is set to 64MB or even 512MB.
I suspect that this is part of it. The default ceph block size is only
4MB. Especially since the differential increases with larger blocks.
I'm not sure if the setting of block sizees is properly wired up; it
depends on what version of the hadoop bindings you are using. Noah would
know more.
You can adjust the default block/object size for the fs with the cephfs
utility from a kernel mount. There isn't yet a convenient way to do this
via ceph-fuse.
sage
> Underlying file systems for data are btrfs.
> Mount options are only "rw,noatime".
>
> Hadoop version is 1.0.3.
> Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
> The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.
>
> The same disk partitions are either formatted for HDFS or for CEPH usage.
>
> CPU usage in both cases is almost 100 percent on all data related nodes.
> There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.
>
> Best regards,
>
> Jutta Lachfeld.
>
> --
> jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 17:27 ` Sage Weil
@ 2012-12-13 17:41 ` Gregory Farnum
2012-12-13 20:23 ` Cameron Bahar
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-12-13 17:41 UTC (permalink / raw)
To: Sage Weil, Lachfeld, Jutta; +Cc: ceph-devel, Noah Watkins, Joe Buck
On Thu, Dec 13, 2012 at 9:27 AM, Sage Weil <sage@inktank.com> wrote:
> Hi Jutta,
>
> On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
>> Hi all,
>>
>> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop. The benchmark measures the elapsed time of a sort run.
>>
>> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>>
>> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>>
>> OS: SLES 11 SP2
>>
>> CEPH:
>> OSDs are distributed over several machines.
>> There is 1 MON and 1 MDS process on yet another machine.
>>
>> Replication of the data pool is set to 1.
>> Underlying file systems for data are btrfs.
>> Mount options are only "rw,noatime".
>> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
>> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>>
>> HDFS:
>> HDFS is distributed over the same machines.
>> HDFS name node on yet another machine.
>>
>> Replication level is set to 1.
>> HDFS block size is set to 64MB or even 512MB.
>
> I suspect that this is part of it. The default ceph block size is only
> 4MB. Especially since the differential increases with larger blocks.
> I'm not sure if the setting of block sizees is properly wired up; it
> depends on what version of the hadoop bindings you are using. Noah would
> know more.
>
> You can adjust the default block/object size for the fs with the cephfs
> utility from a kernel mount. There isn't yet a convenient way to do this
> via ceph-fuse.
If Jutta is using the *old* ones I last worked on in 2009, then this
is already wired up for 64MB blocks. A "ceph pg dump" would let us get
a rough estimate of the block sizes in use.
"ceph -s" would also be useful to check that everything is set up reasonably.
Other than that, it would be fair to describe these bindings as
little-used — minimal performance tests indicated rough parity back in
2009, but those were only a couple minutes long and on very small
clusters, so 1.2x might be normal. Noah and Joe are working on new
bindings now, and those will be tuned and accompany some backend
changes if necessary. They might also have a better eye for typical
results.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 17:41 ` Gregory Farnum
@ 2012-12-13 20:23 ` Cameron Bahar
2012-12-13 20:27 ` Gregory Farnum
0 siblings, 1 reply; 12+ messages in thread
From: Cameron Bahar @ 2012-12-13 20:23 UTC (permalink / raw)
To: Gregory Farnum
Cc: Sage Weil, Lachfeld, Jutta, ceph-devel, Noah Watkins, Joe Buck
Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
Thanks,
Cameron
Sent from my iPhone
On Dec 13, 2012, at 9:41 AM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Dec 13, 2012 at 9:27 AM, Sage Weil <sage@inktank.com> wrote:
>> Hi Jutta,
>>
>> On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
>>> Hi all,
>>>
>>> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop. The benchmark measures the elapsed time of a sort run.
>>>
>>> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>>>
>>> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>>>
>>> OS: SLES 11 SP2
>>>
>>> CEPH:
>>> OSDs are distributed over several machines.
>>> There is 1 MON and 1 MDS process on yet another machine.
>>>
>>> Replication of the data pool is set to 1.
>>> Underlying file systems for data are btrfs.
>>> Mount options are only "rw,noatime".
>>> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
>>> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>>>
>>> HDFS:
>>> HDFS is distributed over the same machines.
>>> HDFS name node on yet another machine.
>>>
>>> Replication level is set to 1.
>>> HDFS block size is set to 64MB or even 512MB.
>>
>> I suspect that this is part of it. The default ceph block size is only
>> 4MB. Especially since the differential increases with larger blocks.
>> I'm not sure if the setting of block sizees is properly wired up; it
>> depends on what version of the hadoop bindings you are using. Noah would
>> know more.
>>
>> You can adjust the default block/object size for the fs with the cephfs
>> utility from a kernel mount. There isn't yet a convenient way to do this
>> via ceph-fuse.
>
> If Jutta is using the *old* ones I last worked on in 2009, then this
> is already wired up for 64MB blocks. A "ceph pg dump" would let us get
> a rough estimate of the block sizes in use.
>
> "ceph -s" would also be useful to check that everything is set up reasonably.
>
> Other than that, it would be fair to describe these bindings as
> little-used — minimal performance tests indicated rough parity back in
> 2009, but those were only a couple minutes long and on very small
> clusters, so 1.2x might be normal. Noah and Joe are working on new
> bindings now, and those will be tuned and accompany some backend
> changes if necessary. They might also have a better eye for typical
> results.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 20:23 ` Cameron Bahar
@ 2012-12-13 20:27 ` Gregory Farnum
2012-12-13 20:33 ` Noah Watkins
0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-12-13 20:27 UTC (permalink / raw)
To: Cameron Bahar
Cc: Sage Weil, Lachfeld, Jutta, ceph-devel, Noah Watkins, Joe Buck
On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
Yeah. You can set chunk size on a per-file basis; you just can't
change it once the file has any data written to it.
In the context of Hadoop the question is just if the bindings are
configured correctly to do so automatically.
-Greg
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 20:27 ` Gregory Farnum
@ 2012-12-13 20:33 ` Noah Watkins
2012-12-14 14:09 ` Lachfeld, Jutta
2013-01-09 15:11 ` Lachfeld, Jutta
0 siblings, 2 replies; 12+ messages in thread
From: Noah Watkins @ 2012-12-13 20:33 UTC (permalink / raw)
To: Gregory Farnum
Cc: Cameron Bahar, Sage Weil, Lachfeld, Jutta, ceph-devel,
Noah Watkins, Joe Buck
The bindings use the default Hadoop settings (e.g. 64 or 128 MB
chunks) when creating new files. The chunk size can also be specified
on a per-file basis using the same interface as Hadoop. Additionally,
while Hadoop doesn't provide an interface to configuration parameters
beyond chunk size, we will also let users fully configure for any Ceph
striping strategy. http://ceph.com/docs/master/dev/file-striping/
-Noah
On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
>> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
>
> Yeah. You can set chunk size on a per-file basis; you just can't
> change it once the file has any data written to it.
> In the context of Hadoop the question is just if the bindings are
> configured correctly to do so automatically.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 20:33 ` Noah Watkins
@ 2012-12-14 14:09 ` Lachfeld, Jutta
2013-01-05 0:17 ` Gregory Farnum
2013-01-09 15:11 ` Lachfeld, Jutta
1 sibling, 1 reply; 12+ messages in thread
From: Lachfeld, Jutta @ 2012-12-14 14:09 UTC (permalink / raw)
To: Noah Watkins, Gregory Farnum
Cc: Cameron Bahar, Sage Weil, ceph-devel, Noah Watkins, Joe Buck
Hi Noah, Gregory and Sage,
first of all, thanks for your quick replies. Here are some answers to your questions.
Gregory, I have got the output of "ceph -s" before and after this specific TeraSort run, and to me it looks ok; all 30 osds are "up":
health HEALTH_OK
monmap e1: 1 mons at {0=192.168.111.18:6789/0}, election epoch 0, quorum 0 0
osdmap e22: 30 osds: 30 up, 30 in
pgmap v13688: 5760 pgs: 5760 active+clean; 1862 GB data, 1868 GB used, 6142 GB / 8366 GB avail
mdsmap e4: 1/1/1 up {0=0=up:active}
health HEALTH_OK
monmap e1: 1 mons at {0=192.168.111.18:6789/0}, election epoch 0, quorum 0 0
osdmap e22: 30 osds: 30 up, 30 in
pgmap v19657: 5760 pgs: 5760 active+clean; 1862 GB data, 1868 GB used, 6142 GB / 8366 GB avail
mdsmap e4: 1/1/1 up {0=0=up:active}
I do not have the full output of "ceph pg dump" for that specific TeraSort run, but here is a typical output after automatically preparing CEPH for a benchmark run
(removed almost all lines in the long pg_stat table hoping that you do not need them):
dumped all in format plain
version 403
last_osdmap_epoch 22
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp
2.314 0 0 0 0 0 0 0 active+clean 2012-12-14 08:31:24.524152 0'0 11'17 [23,7] [23,7] 0'0 2012-12-14 08:31:24.524096
0.316 0 0 0 0 0 0 0 active+clean 2012-12-14 08:25:12.780643 0'0 11'19 [23] [23] 0'0 2012-12-14 08:24:08.394930
1.317 0 0 0 0 0 0 0 active+clean 2012-12-14 08:27:56.400997 0'0 3'17 [11,17] [11,17] 0'0 2012-12-14 08:27:56.400953
[...]
pool 0 1 0 0 0 4 136 136
pool 1 21 0 0 0 23745 5518 5518
pool 2 0 0 0 0 0 0 0
sum 22 0 0 0 23749 5654 5654
osdstat kbused kbavail kb hb in hb out
0 2724 279808588 292420608 [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] []
1 2892 279808588 292420608 [3,4,5,6,8,9,11,12,13,14,15,16,17,18,20,22,24,25,26,27,28] []
2 2844 279808588 292420608 [3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,29] []
3 2716 279808588 292420608 [0,1,2,6,7,8,9,10,11,12,13,14,15,16,17,19,20,22,23,24,25,26,27,28,29] []
4 2556 279808588 292420608 [1,2,7,8,9,12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29] []
5 2856 279808584 292420608 [0,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,28,29] []
6 2840 279808584 292420608 [0,1,2,3,4,5,9,10,11,12,13,14,15,16,17,18,19,20,22,24,25,26,27,28,29] []
7 2604 279808588 292420608 [1,2,3,4,5,9,10,11,12,13,15,17,18,19,20,21,23,24,25,26,27,28,29] []
8 2564 279808588 292420608 [1,2,3,4,5,9,10,11,12,14,16,17,18,19,20,21,22,23,24,25,27,28,29] []
9 2804 279808588 292420608 [1,2,3,4,5,6,8,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,29] []
10 2556 279808588 292420608 [0,1,2,4,5,6,7,8,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28] []
11 3084 279808588 292420608 [0,1,2,3,4,5,6,7,8,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] []
12 2572 279808588 292420608 [0,1,2,3,4,5,7,8,10,11,15,16,18,20,21,22,23,24,27,28,29] []
13 2912 279808560 292420608 [0,1,2,3,5,6,7,8,9,10,11,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] []
14 2992 279808584 292420608 [1,2,3,4,5,6,7,8,9,10,11,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] []
15 2652 279808588 292420608 [1,2,3,4,5,6,7,8,9,10,11,13,14,19,20,21,22,23,25,26,27,28,29] []
16 3028 279808588 292420608 [0,1,2,3,5,6,7,8,9,10,11,12,14,18,20,21,22,24,25,26,27,28,29] []
17 2772 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,18,19,21,22,23,24,25,26,27,28,29] []
18 2804 279808588 292420608 [0,1,2,3,5,6,8,9,10,11,12,14,15,16,17,21,22,23,24,25,26,27,29] []
19 2620 279808588 292420608 [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,21,22,23,25,26,27,28,29] []
20 2956 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,21,22,23,24,25,27,29] []
21 2876 279808588 292420608 [0,1,2,3,4,5,6,8,9,10,12,13,15,16,17,18,19,20,24,25,26,27,29] []
22 3044 279808588 292420608 [1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,26,27,28,29] []
23 2752 279808584 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,27,28,29] []
24 2948 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28,29] []
25 3068 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28,29] []
26 2540 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28] []
27 3012 279808588 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,19,20,21,22,23,24,25,26] []
28 2800 279808560 292420608 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26] []
29 3052 279808588 292420608 [1,2,3,4,5,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26] []
sum 84440 8394257568 8772618240
Does this information help? Is it really 64MB? That is what I had assumed.
As I am relatively new to CEPH, I need some time to digest and understand all your answers.
Regards,
Jutta.
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
-----Original Message-----
From: Noah Watkins [mailto:jayhawk@cs.ucsc.edu]
Sent: Thursday, December 13, 2012 9:33 PM
To: Gregory Farnum
Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; Noah Watkins; Joe Buck
Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
The bindings use the default Hadoop settings (e.g. 64 or 128 MB
chunks) when creating new files. The chunk size can also be specified on a per-file basis using the same interface as Hadoop. Additionally, while Hadoop doesn't provide an interface to configuration parameters beyond chunk size, we will also let users fully configure for any Ceph striping strategy. http://ceph.com/docs/master/dev/file-striping/
-Noah
On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
>> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
>
> Yeah. You can set chunk size on a per-file basis; you just can't
> change it once the file has any data written to it.
> In the context of Hadoop the question is just if the bindings are
> configured correctly to do so automatically.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
2012-12-13 17:27 ` Sage Weil
@ 2012-12-14 14:53 ` Mark Nelson
1 sibling, 0 replies; 12+ messages in thread
From: Mark Nelson @ 2012-12-14 14:53 UTC (permalink / raw)
To: Lachfeld, Jutta; +Cc: ceph-devel
On 12/13/2012 08:54 AM, Lachfeld, Jutta wrote:
> Hi all,
Hi! Sorry to send this a bit late, it looks like the reply I authored
yesterday from my phone got eaten by vger.
>
> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop. The benchmark measures the elapsed time of a sort run.
>
> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>
> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>
> OS: SLES 11 SP2
Beyond what the others have said, this could be an issue. If I recall,
that's an older version of SLES and won't have syncfs support in glibc
(you need 2.14+). In newer versions of Ceph you can still use syncfs if
your kernel is new enough (2.6.38+), but in 0.48 you need support for it
in glibc too. This will have a performance impact, especially if you
have more than one OSD per server.
>
> CEPH:
> OSDs are distributed over several machines.
> There is 1 MON and 1 MDS process on yet another machine.
>
> Replication of the data pool is set to 1.
> Underlying file systems for data are btrfs.
What kernel are you using? If it's older, this could also be an issue.
We've seen pretty bad btrfs fragmentation on older kernels that seems
to be related to degradation in performance over time.
> Mount options are only "rw,noatime".
> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>
> HDFS:
> HDFS is distributed over the same machines.
> HDFS name node on yet another machine.
>
> Replication level is set to 1.
> HDFS block size is set to 64MB or even 512MB.
> Underlying file systems for data are btrfs.
> Mount options are only "rw,noatime".
The large block size may be an issue (at least with some of our default
tunable settings). You might want to try 4 or 16MB and see if it's any
better or worse.
>
> Hadoop version is 1.0.3.
> Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
> The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.
>
> The same disk partitions are either formatted for HDFS or for CEPH usage.
>
> CPU usage in both cases is almost 100 percent on all data related nodes.
If you run sysprof, you can probably get an idea of where the time is
being spent. perf sort of works but doesn't seem to report ceph-osd
symbols properly.
> There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.
>
> Best regards,
>
> Jutta Lachfeld.
>
> --
> jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-14 14:09 ` Lachfeld, Jutta
@ 2013-01-05 0:17 ` Gregory Farnum
0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2013-01-05 0:17 UTC (permalink / raw)
To: Lachfeld, Jutta
Cc: Cameron Bahar, Sage Weil, ceph-devel, Noah Watkins, Joe Buck,
Mark Nelson
Sorry for the delay; I've been out on vacation...
On Fri, Dec 14, 2012 at 6:09 AM, Lachfeld, Jutta
<jutta.lachfeld@ts.fujitsu.com> wrote:
> I do not have the full output of "ceph pg dump" for that specific TeraSort run, but here is a typical output after automatically preparing CEPH for a benchmark run
> (removed almost all lines in the long pg_stat table hoping that you do not need them):
Actually those were exactly what I was after; they include output on
the total PG size and the number of objects so we can check on average
size. :) If you'd like to do it yourself, look at some of the PGs
which correspond to your data pool (the PG ids are all of the form
0.123a, and the number before the decimal point is the pool ID; by
default you'll be looking for 0).
On Fri, Dec 14, 2012 at 6:53 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> The large block size may be an issue (at least with some of our default
> tunable settings). You might want to try 4 or 16MB and see if it's any
> better or worse.
Unless you've got a specific reason to think this is busted, I am
pretty confident it's not a problem. :)
Jutta, do you have any finer-grained numbers than total run time
(specifically, how much time is spent on data generation versus the
read-and-sort for each FS)? HDFS doesn't do any journaling like Ceph
does and the fact that the Ceph journal is in-memory might not be
helping much since it's so small compared to the amount of data being
written.
-Greg
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2012-12-13 20:33 ` Noah Watkins
2012-12-14 14:09 ` Lachfeld, Jutta
@ 2013-01-09 15:11 ` Lachfeld, Jutta
2013-01-09 16:00 ` Noah Watkins
1 sibling, 1 reply; 12+ messages in thread
From: Lachfeld, Jutta @ 2013-01-09 15:11 UTC (permalink / raw)
To: Noah Watkins, Gregory Farnum
Cc: Cameron Bahar, Sage Weil, ceph-devel, Noah Watkins, Joe Buck
Hi Noah,
the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
Is it the CEPH equivalent to the "HDFS block size" parameter which I have been looking for?
Does the parameter ceph.object.size apply to version 0.56.1?
I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?
Regards,
Jutta.
-
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
> -----Original Message-----
> From: Noah Watkins [mailto:jayhawk@cs.ucsc.edu]
> Sent: Thursday, December 13, 2012 9:33 PM
> To: Gregory Farnum
> Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; Noah
> Watkins; Joe Buck
> Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark
> performance comparison issue
>
> The bindings use the default Hadoop settings (e.g. 64 or 128 MB
> chunks) when creating new files. The chunk size can also be specified on a per-file basis
> using the same interface as Hadoop. Additionally, while Hadoop doesn't provide an
> interface to configuration parameters beyond chunk size, we will also let users fully
> configure for any Ceph striping strategy. http://ceph.com/docs/master/dev/file-striping/
>
> -Noah
>
> On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> > On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
> >> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically
> configurable when a cluster is first installed?
> >
> > Yeah. You can set chunk size on a per-file basis; you just can't
> > change it once the file has any data written to it.
> > In the context of Hadoop the question is just if the bindings are
> > configured correctly to do so automatically.
> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2013-01-09 15:11 ` Lachfeld, Jutta
@ 2013-01-09 16:00 ` Noah Watkins
2013-01-10 21:42 ` Gregory Farnum
0 siblings, 1 reply; 12+ messages in thread
From: Noah Watkins @ 2013-01-09 16:00 UTC (permalink / raw)
To: Lachfeld, Jutta
Cc: Noah Watkins, Gregory Farnum, Cameron Bahar, Sage Weil,
ceph-devel, Joe Buck
Hi Jutta,
On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
<jutta.lachfeld@ts.fujitsu.com> wrote:
>
> the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
> Is it the CEPH equivalent to the "HDFS block size" parameter which I have been looking for?
Yes. By specifying ceph.object.size, the Hadoop will use a default
Ceph file layout with stripe unit = object size, and stripe count = 1.
This is effectively the same meaning as dfs.block.size for HDFS.
> Does the parameter ceph.object.size apply to version 0.56.1?
The Ceph/Hadoop file system plugin is being developed here:
git://github.com/ceph/hadoop-common cephfs/branch-1.0
There is an old version of the Hadoop plugin in the Ceph tree which
will be removed shortly. Regarding the versions, development is taking
place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
system in place for dealing with compatibility across versions because
the code is in heavy development.
If you are running 0.56.1 then a recent version of cephfs/branch-1.0
should work with that, but may not long, as development continues.
> I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?
I can't think of any reason why a large size would cause concern, but
maybe someone else can chime in?
- Noah
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
2013-01-09 16:00 ` Noah Watkins
@ 2013-01-10 21:42 ` Gregory Farnum
0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2013-01-10 21:42 UTC (permalink / raw)
To: Noah Watkins
Cc: Lachfeld, Jutta, Noah Watkins, Cameron Bahar, Sage Weil,
ceph-devel, Joe Buck
On Wed, Jan 9, 2013 at 8:00 AM, Noah Watkins <noah.watkins@inktank.com> wrote:
> Hi Jutta,
>
> On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
> <jutta.lachfeld@ts.fujitsu.com> wrote:
>>
>> the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
>> Is it the CEPH equivalent to the "HDFS block size" parameter which I have been looking for?
>
> Yes. By specifying ceph.object.size, the Hadoop will use a default
> Ceph file layout with stripe unit = object size, and stripe count = 1.
> This is effectively the same meaning as dfs.block.size for HDFS.
>
>> Does the parameter ceph.object.size apply to version 0.56.1?
>
> The Ceph/Hadoop file system plugin is being developed here:
>
> git://github.com/ceph/hadoop-common cephfs/branch-1.0
>
> There is an old version of the Hadoop plugin in the Ceph tree which
> will be removed shortly. Regarding the versions, development is taking
> place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
> system in place for dealing with compatibility across versions because
> the code is in heavy development.
If you are using the old version in the Ceph tree, you should be
setting fs.ceph.blockSize rather than ceph.object.size. :)
>> I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?
>
> I can't think of any reason why a large size would cause concern, but
> maybe someone else can chime in?
Yep, totally fine.
-Greg
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-01-10 21:42 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
2012-12-13 17:27 ` Sage Weil
2012-12-13 17:41 ` Gregory Farnum
2012-12-13 20:23 ` Cameron Bahar
2012-12-13 20:27 ` Gregory Farnum
2012-12-13 20:33 ` Noah Watkins
2012-12-14 14:09 ` Lachfeld, Jutta
2013-01-05 0:17 ` Gregory Farnum
2013-01-09 15:11 ` Lachfeld, Jutta
2013-01-09 16:00 ` Noah Watkins
2013-01-10 21:42 ` Gregory Farnum
2012-12-14 14:53 ` Mark Nelson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.