All of lore.kernel.org
 help / color / mirror / Atom feed
* CephFS Slow writes with 1MB files
@ 2015-03-27 16:40 Barclay Jameson
  2015-03-27 16:47 ` Barclay Jameson
  0 siblings, 1 reply; 15+ messages in thread
From: Barclay Jameson @ 2015-03-27 16:40 UTC (permalink / raw)
  To: ceph-users, ceph-devel

I did a Ceph cluster install 2 weeks ago where I was getting great
performance (~= PanFS) where I could write 100,000 1MB files in 61
Mins (Took PanFS 59 Mins). I thought I could increase the performance
by adding a better MDS server so I redid the entire build.

Now it takes 4 times as long to write the same data as it did before.
The only thing that changed was the MDS server. (I even tried moving
the MDS back on the old slower node and the performance was the same.)

The first install was on CentOS 7. I tried going down to CentOS 6.6
and it's the same results.
I use the same scripts to install the OSDs (which I created because I
can never get ceph-deploy to behave correctly. Although, I did use
ceph-deploy to create the MDS and MON and initial cluster creation.)

I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
-p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)

Could anybody think of a reason as to why I am now getting a huge regression.

Hardware Setup:
[OSDs]
64 GB 2133 MHz
Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
40Gb Mellanox NIC

[MDS/MON new]
128 GB 2133 MHz
Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
40Gb Mellanox NIC

[MDS/MON old]
32 GB 800 MHz
Dual Proc E5472  @ 3.00GHz (8 Cores)
10Gb Intel NIC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
  2015-03-27 16:40 CephFS Slow writes with 1MB files Barclay Jameson
@ 2015-03-27 16:47 ` Barclay Jameson
       [not found]   ` <CAMzumdbezcb-p1_MpcSL-h8tTR0RKATt93Om6NPejtn1G6yPeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-03-27 20:04   ` [ceph-users] " Gregory Farnum
  0 siblings, 2 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-03-27 16:47 UTC (permalink / raw)
  To: ceph-users, ceph-devel

Opps I should have said that I am not just writing the data but copying it :

time cp Small1/* Small2/*

Thanks,

BJ

On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
<almightybeeij@gmail.com> wrote:
> I did a Ceph cluster install 2 weeks ago where I was getting great
> performance (~= PanFS) where I could write 100,000 1MB files in 61
> Mins (Took PanFS 59 Mins). I thought I could increase the performance
> by adding a better MDS server so I redid the entire build.
>
> Now it takes 4 times as long to write the same data as it did before.
> The only thing that changed was the MDS server. (I even tried moving
> the MDS back on the old slower node and the performance was the same.)
>
> The first install was on CentOS 7. I tried going down to CentOS 6.6
> and it's the same results.
> I use the same scripts to install the OSDs (which I created because I
> can never get ceph-deploy to behave correctly. Although, I did use
> ceph-deploy to create the MDS and MON and initial cluster creation.)
>
> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>
> Could anybody think of a reason as to why I am now getting a huge regression.
>
> Hardware Setup:
> [OSDs]
> 64 GB 2133 MHz
> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
> 40Gb Mellanox NIC
>
> [MDS/MON new]
> 128 GB 2133 MHz
> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
> 40Gb Mellanox NIC
>
> [MDS/MON old]
> 32 GB 800 MHz
> Dual Proc E5472  @ 3.00GHz (8 Cores)
> 10Gb Intel NIC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]   ` <CAMzumdbezcb-p1_MpcSL-h8tTR0RKATt93Om6NPejtn1G6yPeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-27 16:50     ` Mark Nelson
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Nelson @ 2015-03-27 16:50 UTC (permalink / raw)
  To: Barclay Jameson, ceph-users-Qp0mS5GaXlQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

Specifically related to BTRFS, if you have random IO to existing objects 
it will cause terrible fragmentation due to COW.  BTRFS is often faster 
than XFS initially but after it starts fragmenting can become much 
slower for sequential reads.  You may want to try XFS again and see if 
you can improve the read performance (increasing read ahead both on the 
cephfs client and on the underlying OSD block devices to something like 
4MB might help).

Mark

On 03/27/2015 11:47 AM, Barclay Jameson wrote:
> Opps I should have said that I am not just writing the data but copying it :
>
> time cp Small1/* Small2/*
>
> Thanks,
>
> BJ
>
> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> I did a Ceph cluster install 2 weeks ago where I was getting great
>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>> by adding a better MDS server so I redid the entire build.
>>
>> Now it takes 4 times as long to write the same data as it did before.
>> The only thing that changed was the MDS server. (I even tried moving
>> the MDS back on the old slower node and the performance was the same.)
>>
>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>> and it's the same results.
>> I use the same scripts to install the OSDs (which I created because I
>> can never get ceph-deploy to behave correctly. Although, I did use
>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>
>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>
>> Could anybody think of a reason as to why I am now getting a huge regression.
>>
>> Hardware Setup:
>> [OSDs]
>> 64 GB 2133 MHz
>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>> 40Gb Mellanox NIC
>>
>> [MDS/MON new]
>> 128 GB 2133 MHz
>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>> 40Gb Mellanox NIC
>>
>> [MDS/MON old]
>> 32 GB 800 MHz
>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>> 10Gb Intel NIC
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-03-27 16:47 ` Barclay Jameson
       [not found]   ` <CAMzumdbezcb-p1_MpcSL-h8tTR0RKATt93Om6NPejtn1G6yPeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-27 20:04   ` Gregory Farnum
  2015-03-27 21:46     ` Barclay Jameson
  1 sibling, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2015-03-27 20:04 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: ceph-users, ceph-devel

So this is exactly the same test you ran previously, but now it's on
faster hardware and the test is slower?

Do you have more data in the test cluster? One obvious possibility is
that previously you were working entirely in the MDS' cache, but now
you've got more dentries and so it's kicking data out to RADOS and
then reading it back in.

If you've got the memory (you appear to) you can pump up the "mds
cache size" config option quite dramatically from it's default 100000.

Other things to check are that you've got an appropriately-sized
metadata pool, that you've not got clients competing against each
other inappropriately, etc.
-Greg

On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
<almightybeeij@gmail.com> wrote:
> Opps I should have said that I am not just writing the data but copying it :
>
> time cp Small1/* Small2/*
>
> Thanks,
>
> BJ
>
> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
> <almightybeeij@gmail.com> wrote:
>> I did a Ceph cluster install 2 weeks ago where I was getting great
>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>> by adding a better MDS server so I redid the entire build.
>>
>> Now it takes 4 times as long to write the same data as it did before.
>> The only thing that changed was the MDS server. (I even tried moving
>> the MDS back on the old slower node and the performance was the same.)
>>
>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>> and it's the same results.
>> I use the same scripts to install the OSDs (which I created because I
>> can never get ceph-deploy to behave correctly. Although, I did use
>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>
>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>
>> Could anybody think of a reason as to why I am now getting a huge regression.
>>
>> Hardware Setup:
>> [OSDs]
>> 64 GB 2133 MHz
>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>> 40Gb Mellanox NIC
>>
>> [MDS/MON new]
>> 128 GB 2133 MHz
>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>> 40Gb Mellanox NIC
>>
>> [MDS/MON old]
>> 32 GB 800 MHz
>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>> 10Gb Intel NIC
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-03-27 20:04   ` [ceph-users] " Gregory Farnum
@ 2015-03-27 21:46     ` Barclay Jameson
  2015-03-27 21:50       ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Barclay Jameson @ 2015-03-27 21:46 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-users, ceph-devel

Yes it's the exact same hardware except for the MDS server (although I
tried using the MDS on the old node).
I have not tried moving the MON back to the old node.

My default cache size is "mds cache size = 10000000"
The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
I created 2048 for data and metadata:
ceph osd pool create cephfs_data 2048 2048
ceph osd pool create cephfs_metadata 2048 2048


To your point on clients competing against each other... how would I check that?

Thanks for the input!


On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
> So this is exactly the same test you ran previously, but now it's on
> faster hardware and the test is slower?
>
> Do you have more data in the test cluster? One obvious possibility is
> that previously you were working entirely in the MDS' cache, but now
> you've got more dentries and so it's kicking data out to RADOS and
> then reading it back in.
>
> If you've got the memory (you appear to) you can pump up the "mds
> cache size" config option quite dramatically from it's default 100000.
>
> Other things to check are that you've got an appropriately-sized
> metadata pool, that you've not got clients competing against each
> other inappropriately, etc.
> -Greg
>
> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
> <almightybeeij@gmail.com> wrote:
>> Opps I should have said that I am not just writing the data but copying it :
>>
>> time cp Small1/* Small2/*
>>
>> Thanks,
>>
>> BJ
>>
>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>> <almightybeeij@gmail.com> wrote:
>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>> by adding a better MDS server so I redid the entire build.
>>>
>>> Now it takes 4 times as long to write the same data as it did before.
>>> The only thing that changed was the MDS server. (I even tried moving
>>> the MDS back on the old slower node and the performance was the same.)
>>>
>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>> and it's the same results.
>>> I use the same scripts to install the OSDs (which I created because I
>>> can never get ceph-deploy to behave correctly. Although, I did use
>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>
>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>
>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>
>>> Hardware Setup:
>>> [OSDs]
>>> 64 GB 2133 MHz
>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>> 40Gb Mellanox NIC
>>>
>>> [MDS/MON new]
>>> 128 GB 2133 MHz
>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>> 40Gb Mellanox NIC
>>>
>>> [MDS/MON old]
>>> 32 GB 800 MHz
>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>> 10Gb Intel NIC
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-03-27 21:46     ` Barclay Jameson
@ 2015-03-27 21:50       ` Gregory Farnum
       [not found]         ` <CAC6JEv-4D2kF7rrnGMncbE-_63+hdDzecqpB+HrMOKp1YGwv_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2015-03-27 21:50 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: ceph-users, ceph-devel

On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
<almightybeeij@gmail.com> wrote:
> Yes it's the exact same hardware except for the MDS server (although I
> tried using the MDS on the old node).
> I have not tried moving the MON back to the old node.
>
> My default cache size is "mds cache size = 10000000"
> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
> I created 2048 for data and metadata:
> ceph osd pool create cephfs_data 2048 2048
> ceph osd pool create cephfs_metadata 2048 2048
>
>
> To your point on clients competing against each other... how would I check that?

Do you have multiple clients mounted? Are they both accessing files in
the directory(ies) you're testing? Were they accessing the same
pattern of files for the old cluster?

If you happen to be running a hammer rc or something pretty new you
can use the MDS admin socket to explore a bit what client sessions
there are and what they have permissions on and check; otherwise
you'll have to figure it out from the client side.
-Greg

>
> Thanks for the input!
>
>
> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
>> So this is exactly the same test you ran previously, but now it's on
>> faster hardware and the test is slower?
>>
>> Do you have more data in the test cluster? One obvious possibility is
>> that previously you were working entirely in the MDS' cache, but now
>> you've got more dentries and so it's kicking data out to RADOS and
>> then reading it back in.
>>
>> If you've got the memory (you appear to) you can pump up the "mds
>> cache size" config option quite dramatically from it's default 100000.
>>
>> Other things to check are that you've got an appropriately-sized
>> metadata pool, that you've not got clients competing against each
>> other inappropriately, etc.
>> -Greg
>>
>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>> <almightybeeij@gmail.com> wrote:
>>> Opps I should have said that I am not just writing the data but copying it :
>>>
>>> time cp Small1/* Small2/*
>>>
>>> Thanks,
>>>
>>> BJ
>>>
>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>> <almightybeeij@gmail.com> wrote:
>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>> by adding a better MDS server so I redid the entire build.
>>>>
>>>> Now it takes 4 times as long to write the same data as it did before.
>>>> The only thing that changed was the MDS server. (I even tried moving
>>>> the MDS back on the old slower node and the performance was the same.)
>>>>
>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>> and it's the same results.
>>>> I use the same scripts to install the OSDs (which I created because I
>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>
>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>
>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>
>>>> Hardware Setup:
>>>> [OSDs]
>>>> 64 GB 2133 MHz
>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>> 40Gb Mellanox NIC
>>>>
>>>> [MDS/MON new]
>>>> 128 GB 2133 MHz
>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>> 40Gb Mellanox NIC
>>>>
>>>> [MDS/MON old]
>>>> 32 GB 800 MHz
>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>> 10Gb Intel NIC
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]         ` <CAC6JEv-4D2kF7rrnGMncbE-_63+hdDzecqpB+HrMOKp1YGwv_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-28 17:12           ` Barclay Jameson
       [not found]             ` <CAMzumda7VserizM5PEoT8mwYnTHeE4CFiUiLjnjqN2xXaeNVQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-03-31  3:59             ` [ceph-users] " Yan, Zheng
  0 siblings, 2 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-03-28 17:12 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

I redid my entire Ceph build going back to to CentOS 7 hoping to the
get the same performance I did last time.
The rados bench test was the best I have ever had with a time of 740
MB wr and 1300 MB rd. This was even better than the first rados bench
test that had performance equal to PanFS. I find that this does not
translate to my CephFS. Even with the following tweaking it still at
least twice as slow as PanFS and my first *Magical* build (that had
absolutely no tweaking):

OSD
 osd_op_treads 8
 /sys/block/sd*/queue/nr_requests 4096
 /sys/block/sd*/queue/read_ahead_kb 4096

Client
 rsize=16777216
 readdir_max_bytes=16777216
 readdir_max_entries=16777216

~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.

Strange thing is none of the resources are taxed.
CPU, ram, network, disks, are not even close to being taxed on either
the client,mon/mds, or the osd nodes.
The PanFS client node was a 10Gb network the same as the CephFS client
but you can see the huge difference in speed.

As per Gregs questions before:
There is only one client reading and writing (time cp Small1/*
Small2/.) but three clients have cephfs mounted, although they aren't
doing anything on the filesystem.

I have done another test where I stream data info a file as fast as
the processor can put it there.
(for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
seconds for CephFS although the first build did it in 130 seconds
without any tuning.

This leads me to believe the bottleneck is the mds. Does anybody have
any thoughts on this?
Are there any tuning parameters that I would need to speed up the mds?

On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Yes it's the exact same hardware except for the MDS server (although I
>> tried using the MDS on the old node).
>> I have not tried moving the MON back to the old node.
>>
>> My default cache size is "mds cache size = 10000000"
>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>> I created 2048 for data and metadata:
>> ceph osd pool create cephfs_data 2048 2048
>> ceph osd pool create cephfs_metadata 2048 2048
>>
>>
>> To your point on clients competing against each other... how would I check that?
>
> Do you have multiple clients mounted? Are they both accessing files in
> the directory(ies) you're testing? Were they accessing the same
> pattern of files for the old cluster?
>
> If you happen to be running a hammer rc or something pretty new you
> can use the MDS admin socket to explore a bit what client sessions
> there are and what they have permissions on and check; otherwise
> you'll have to figure it out from the client side.
> -Greg
>
>>
>> Thanks for the input!
>>
>>
>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>> So this is exactly the same test you ran previously, but now it's on
>>> faster hardware and the test is slower?
>>>
>>> Do you have more data in the test cluster? One obvious possibility is
>>> that previously you were working entirely in the MDS' cache, but now
>>> you've got more dentries and so it's kicking data out to RADOS and
>>> then reading it back in.
>>>
>>> If you've got the memory (you appear to) you can pump up the "mds
>>> cache size" config option quite dramatically from it's default 100000.
>>>
>>> Other things to check are that you've got an appropriately-sized
>>> metadata pool, that you've not got clients competing against each
>>> other inappropriately, etc.
>>> -Greg
>>>
>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>
>>>> time cp Small1/* Small2/*
>>>>
>>>> Thanks,
>>>>
>>>> BJ
>>>>
>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>> by adding a better MDS server so I redid the entire build.
>>>>>
>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>
>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>> and it's the same results.
>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>
>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>
>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>
>>>>> Hardware Setup:
>>>>> [OSDs]
>>>>> 64 GB 2133 MHz
>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>> 40Gb Mellanox NIC
>>>>>
>>>>> [MDS/MON new]
>>>>> 128 GB 2133 MHz
>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>> 40Gb Mellanox NIC
>>>>>
>>>>> [MDS/MON old]
>>>>> 32 GB 800 MHz
>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>> 10Gb Intel NIC
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]             ` <CAMzumda7VserizM5PEoT8mwYnTHeE4CFiUiLjnjqN2xXaeNVQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-30 18:30               ` Gregory Farnum
       [not found]                 ` <CAC6JEv_mMbKn+=nHjQmwxAk8T7=MjpJ7bXtHTeWKda7ogJ1GPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2015-03-30 18:30 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

On Sat, Mar 28, 2015 at 10:12 AM, Barclay Jameson
<almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I redid my entire Ceph build going back to to CentOS 7 hoping to the
> get the same performance I did last time.
> The rados bench test was the best I have ever had with a time of 740
> MB wr and 1300 MB rd. This was even better than the first rados bench
> test that had performance equal to PanFS. I find that this does not
> translate to my CephFS. Even with the following tweaking it still at
> least twice as slow as PanFS and my first *Magical* build (that had
> absolutely no tweaking):
>
> OSD
>  osd_op_treads 8
>  /sys/block/sd*/queue/nr_requests 4096
>  /sys/block/sd*/queue/read_ahead_kb 4096
>
> Client
>  rsize=16777216
>  readdir_max_bytes=16777216
>  readdir_max_entries=16777216
>
> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>
> Strange thing is none of the resources are taxed.
> CPU, ram, network, disks, are not even close to being taxed on either
> the client,mon/mds, or the osd nodes.
> The PanFS client node was a 10Gb network the same as the CephFS client
> but you can see the huge difference in speed.
>
> As per Gregs questions before:
> There is only one client reading and writing (time cp Small1/*
> Small2/.) but three clients have cephfs mounted, although they aren't
> doing anything on the filesystem.
>
> I have done another test where I stream data info a file as fast as
> the processor can put it there.
> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
> seconds for CephFS although the first build did it in 130 seconds
> without any tuning.
>
> This leads me to believe the bottleneck is the mds. Does anybody have
> any thoughts on this?
> Are there any tuning parameters that I would need to speed up the mds?

This is pretty likely, but 10 creates/second is just impossibly slow.
The only other thing I can think of is that you might have enabled
fragmentation but aren't now, which might make an impact on a
directory with 100k entries.

Or else your hardware is just totally wonky, which we've seen in the
past but your server doesn't look quite large enough to be hitting any
of the nasty NUMA stuff...but that's something else to look at which I
can't help you with, although maybe somebody else can.

If you're interested in diving into it and depending on the Ceph
version you're running you can also examine the mds perfcounters
(http://ceph.com/docs/master/dev/perf_counters/) and the op history
(dump_ops_in_flight etc) and look for any operations which are
noticeably slow.
-Greg

>
> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Yes it's the exact same hardware except for the MDS server (although I
>>> tried using the MDS on the old node).
>>> I have not tried moving the MON back to the old node.
>>>
>>> My default cache size is "mds cache size = 10000000"
>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>> I created 2048 for data and metadata:
>>> ceph osd pool create cephfs_data 2048 2048
>>> ceph osd pool create cephfs_metadata 2048 2048
>>>
>>>
>>> To your point on clients competing against each other... how would I check that?
>>
>> Do you have multiple clients mounted? Are they both accessing files in
>> the directory(ies) you're testing? Were they accessing the same
>> pattern of files for the old cluster?
>>
>> If you happen to be running a hammer rc or something pretty new you
>> can use the MDS admin socket to explore a bit what client sessions
>> there are and what they have permissions on and check; otherwise
>> you'll have to figure it out from the client side.
>> -Greg
>>
>>>
>>> Thanks for the input!
>>>
>>>
>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>> So this is exactly the same test you ran previously, but now it's on
>>>> faster hardware and the test is slower?
>>>>
>>>> Do you have more data in the test cluster? One obvious possibility is
>>>> that previously you were working entirely in the MDS' cache, but now
>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>> then reading it back in.
>>>>
>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>> cache size" config option quite dramatically from it's default 100000.
>>>>
>>>> Other things to check are that you've got an appropriately-sized
>>>> metadata pool, that you've not got clients competing against each
>>>> other inappropriately, etc.
>>>> -Greg
>>>>
>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>
>>>>> time cp Small1/* Small2/*
>>>>>
>>>>> Thanks,
>>>>>
>>>>> BJ
>>>>>
>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>
>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>
>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>> and it's the same results.
>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>
>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>
>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>
>>>>>> Hardware Setup:
>>>>>> [OSDs]
>>>>>> 64 GB 2133 MHz
>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>> 40Gb Mellanox NIC
>>>>>>
>>>>>> [MDS/MON new]
>>>>>> 128 GB 2133 MHz
>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>> 40Gb Mellanox NIC
>>>>>>
>>>>>> [MDS/MON old]
>>>>>> 32 GB 800 MHz
>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>> 10Gb Intel NIC
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]                 ` <CAC6JEv_mMbKn+=nHjQmwxAk8T7=MjpJ7bXtHTeWKda7ogJ1GPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-30 18:46                   ` Barclay Jameson
  0 siblings, 0 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-03-30 18:46 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

I will take a look into the perf counters.
Thanks for the pointers!

On Mon, Mar 30, 2015 at 1:30 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
> On Sat, Mar 28, 2015 at 10:12 AM, Barclay Jameson
> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>> get the same performance I did last time.
>> The rados bench test was the best I have ever had with a time of 740
>> MB wr and 1300 MB rd. This was even better than the first rados bench
>> test that had performance equal to PanFS. I find that this does not
>> translate to my CephFS. Even with the following tweaking it still at
>> least twice as slow as PanFS and my first *Magical* build (that had
>> absolutely no tweaking):
>>
>> OSD
>>  osd_op_treads 8
>>  /sys/block/sd*/queue/nr_requests 4096
>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>
>> Client
>>  rsize=16777216
>>  readdir_max_bytes=16777216
>>  readdir_max_entries=16777216
>>
>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>
>> Strange thing is none of the resources are taxed.
>> CPU, ram, network, disks, are not even close to being taxed on either
>> the client,mon/mds, or the osd nodes.
>> The PanFS client node was a 10Gb network the same as the CephFS client
>> but you can see the huge difference in speed.
>>
>> As per Gregs questions before:
>> There is only one client reading and writing (time cp Small1/*
>> Small2/.) but three clients have cephfs mounted, although they aren't
>> doing anything on the filesystem.
>>
>> I have done another test where I stream data info a file as fast as
>> the processor can put it there.
>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>> seconds for CephFS although the first build did it in 130 seconds
>> without any tuning.
>>
>> This leads me to believe the bottleneck is the mds. Does anybody have
>> any thoughts on this?
>> Are there any tuning parameters that I would need to speed up the mds?
>
> This is pretty likely, but 10 creates/second is just impossibly slow.
> The only other thing I can think of is that you might have enabled
> fragmentation but aren't now, which might make an impact on a
> directory with 100k entries.
>
> Or else your hardware is just totally wonky, which we've seen in the
> past but your server doesn't look quite large enough to be hitting any
> of the nasty NUMA stuff...but that's something else to look at which I
> can't help you with, although maybe somebody else can.
>
> If you're interested in diving into it and depending on the Ceph
> version you're running you can also examine the mds perfcounters
> (http://ceph.com/docs/master/dev/perf_counters/) and the op history
> (dump_ops_in_flight etc) and look for any operations which are
> noticeably slow.
> -Greg
>
>>
>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>> tried using the MDS on the old node).
>>>> I have not tried moving the MON back to the old node.
>>>>
>>>> My default cache size is "mds cache size = 10000000"
>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>> I created 2048 for data and metadata:
>>>> ceph osd pool create cephfs_data 2048 2048
>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>
>>>>
>>>> To your point on clients competing against each other... how would I check that?
>>>
>>> Do you have multiple clients mounted? Are they both accessing files in
>>> the directory(ies) you're testing? Were they accessing the same
>>> pattern of files for the old cluster?
>>>
>>> If you happen to be running a hammer rc or something pretty new you
>>> can use the MDS admin socket to explore a bit what client sessions
>>> there are and what they have permissions on and check; otherwise
>>> you'll have to figure it out from the client side.
>>> -Greg
>>>
>>>>
>>>> Thanks for the input!
>>>>
>>>>
>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>> faster hardware and the test is slower?
>>>>>
>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>> then reading it back in.
>>>>>
>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>
>>>>> Other things to check are that you've got an appropriately-sized
>>>>> metadata pool, that you've not got clients competing against each
>>>>> other inappropriately, etc.
>>>>> -Greg
>>>>>
>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>
>>>>>> time cp Small1/* Small2/*
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> BJ
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>
>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>
>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>> and it's the same results.
>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>
>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>
>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>
>>>>>>> Hardware Setup:
>>>>>>> [OSDs]
>>>>>>> 64 GB 2133 MHz
>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>> 40Gb Mellanox NIC
>>>>>>>
>>>>>>> [MDS/MON new]
>>>>>>> 128 GB 2133 MHz
>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>> 40Gb Mellanox NIC
>>>>>>>
>>>>>>> [MDS/MON old]
>>>>>>> 32 GB 800 MHz
>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>> 10Gb Intel NIC
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-03-28 17:12           ` Barclay Jameson
       [not found]             ` <CAMzumda7VserizM5PEoT8mwYnTHeE4CFiUiLjnjqN2xXaeNVQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-31  3:59             ` Yan, Zheng
       [not found]               ` <CAMzumdZG4Zqv5SWGTRTS_FTEUe3EwVbbBThjXVTP2gQadMhFsw@mail.gmail.com>
  1 sibling, 1 reply; 15+ messages in thread
From: Yan, Zheng @ 2015-03-31  3:59 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: Gregory Farnum, ceph-devel, ceph-users

On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
<almightybeeij@gmail.com> wrote:
> I redid my entire Ceph build going back to to CentOS 7 hoping to the
> get the same performance I did last time.
> The rados bench test was the best I have ever had with a time of 740
> MB wr and 1300 MB rd. This was even better than the first rados bench
> test that had performance equal to PanFS. I find that this does not
> translate to my CephFS. Even with the following tweaking it still at
> least twice as slow as PanFS and my first *Magical* build (that had
> absolutely no tweaking):
>
> OSD
>  osd_op_treads 8
>  /sys/block/sd*/queue/nr_requests 4096
>  /sys/block/sd*/queue/read_ahead_kb 4096
>
> Client
>  rsize=16777216
>  readdir_max_bytes=16777216
>  readdir_max_entries=16777216
>
> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>
> Strange thing is none of the resources are taxed.
> CPU, ram, network, disks, are not even close to being taxed on either
> the client,mon/mds, or the osd nodes.
> The PanFS client node was a 10Gb network the same as the CephFS client
> but you can see the huge difference in speed.
>
> As per Gregs questions before:
> There is only one client reading and writing (time cp Small1/*
> Small2/.) but three clients have cephfs mounted, although they aren't
> doing anything on the filesystem.
>
> I have done another test where I stream data info a file as fast as
> the processor can put it there.
> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
> seconds for CephFS although the first build did it in 130 seconds
> without any tuning.
>
> This leads me to believe the bottleneck is the mds. Does anybody have
> any thoughts on this?
> Are there any tuning parameters that I would need to speed up the mds?

could you enable mds debugging for a few seconds (ceph daemon mds.x
config set debug_mds 10; sleep 10; ceph daemon mds.x config set
debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.

Regards
Yan, Zheng

>
> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@gregs42.com> wrote:
>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>> <almightybeeij@gmail.com> wrote:
>>> Yes it's the exact same hardware except for the MDS server (although I
>>> tried using the MDS on the old node).
>>> I have not tried moving the MON back to the old node.
>>>
>>> My default cache size is "mds cache size = 10000000"
>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>> I created 2048 for data and metadata:
>>> ceph osd pool create cephfs_data 2048 2048
>>> ceph osd pool create cephfs_metadata 2048 2048
>>>
>>>
>>> To your point on clients competing against each other... how would I check that?
>>
>> Do you have multiple clients mounted? Are they both accessing files in
>> the directory(ies) you're testing? Were they accessing the same
>> pattern of files for the old cluster?
>>
>> If you happen to be running a hammer rc or something pretty new you
>> can use the MDS admin socket to explore a bit what client sessions
>> there are and what they have permissions on and check; otherwise
>> you'll have to figure it out from the client side.
>> -Greg
>>
>>>
>>> Thanks for the input!
>>>
>>>
>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>> So this is exactly the same test you ran previously, but now it's on
>>>> faster hardware and the test is slower?
>>>>
>>>> Do you have more data in the test cluster? One obvious possibility is
>>>> that previously you were working entirely in the MDS' cache, but now
>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>> then reading it back in.
>>>>
>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>> cache size" config option quite dramatically from it's default 100000.
>>>>
>>>> Other things to check are that you've got an appropriately-sized
>>>> metadata pool, that you've not got clients competing against each
>>>> other inappropriately, etc.
>>>> -Greg
>>>>
>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>> <almightybeeij@gmail.com> wrote:
>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>
>>>>> time cp Small1/* Small2/*
>>>>>
>>>>> Thanks,
>>>>>
>>>>> BJ
>>>>>
>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>> <almightybeeij@gmail.com> wrote:
>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>
>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>
>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>> and it's the same results.
>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>
>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>
>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>
>>>>>> Hardware Setup:
>>>>>> [OSDs]
>>>>>> 64 GB 2133 MHz
>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>> 40Gb Mellanox NIC
>>>>>>
>>>>>> [MDS/MON new]
>>>>>> 128 GB 2133 MHz
>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>> 40Gb Mellanox NIC
>>>>>>
>>>>>> [MDS/MON old]
>>>>>> 32 GB 800 MHz
>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>> 10Gb Intel NIC
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]                 ` <CAMzumdZG4Zqv5SWGTRTS_FTEUe3EwVbbBThjXVTP2gQadMhFsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-04-02 11:03                   ` Yan, Zheng
  2015-04-02 15:18                     ` [ceph-users] " Barclay Jameson
  0 siblings, 1 reply; 15+ messages in thread
From: Yan, Zheng @ 2015-04-02 11:03 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
<almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Here is the mds output from the command you requested. I did this
> during the small data run . ( time cp small1/* small2/ )
> It is 20MB in size so I couldn't find a place online that would accept
> that much data.
>
> Please find attached file.
>
> Thanks,

In the log file, each 'create' request is followed by several
'getattr' requests. I guess these 'getattr' requests resulted from
some kinds of permission check, but I can't reproduce this situation
locally.

which version of ceph/kernel are you using? do you use ceph-fuse or
kernel client, what's the mount options?

Regards
Yan, Zheng


>
> Beeij
>
>
> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>> get the same performance I did last time.
>>> The rados bench test was the best I have ever had with a time of 740
>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>> test that had performance equal to PanFS. I find that this does not
>>> translate to my CephFS. Even with the following tweaking it still at
>>> least twice as slow as PanFS and my first *Magical* build (that had
>>> absolutely no tweaking):
>>>
>>> OSD
>>>  osd_op_treads 8
>>>  /sys/block/sd*/queue/nr_requests 4096
>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>
>>> Client
>>>  rsize=16777216
>>>  readdir_max_bytes=16777216
>>>  readdir_max_entries=16777216
>>>
>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>
>>> Strange thing is none of the resources are taxed.
>>> CPU, ram, network, disks, are not even close to being taxed on either
>>> the client,mon/mds, or the osd nodes.
>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>> but you can see the huge difference in speed.
>>>
>>> As per Gregs questions before:
>>> There is only one client reading and writing (time cp Small1/*
>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>> doing anything on the filesystem.
>>>
>>> I have done another test where I stream data info a file as fast as
>>> the processor can put it there.
>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>> seconds for CephFS although the first build did it in 130 seconds
>>> without any tuning.
>>>
>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>> any thoughts on this?
>>> Are there any tuning parameters that I would need to speed up the mds?
>>
>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>
>> Regards
>> Yan, Zheng
>>
>>>
>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>> tried using the MDS on the old node).
>>>>> I have not tried moving the MON back to the old node.
>>>>>
>>>>> My default cache size is "mds cache size = 10000000"
>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>> I created 2048 for data and metadata:
>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>
>>>>>
>>>>> To your point on clients competing against each other... how would I check that?
>>>>
>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>> the directory(ies) you're testing? Were they accessing the same
>>>> pattern of files for the old cluster?
>>>>
>>>> If you happen to be running a hammer rc or something pretty new you
>>>> can use the MDS admin socket to explore a bit what client sessions
>>>> there are and what they have permissions on and check; otherwise
>>>> you'll have to figure it out from the client side.
>>>> -Greg
>>>>
>>>>>
>>>>> Thanks for the input!
>>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>> faster hardware and the test is slower?
>>>>>>
>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>> then reading it back in.
>>>>>>
>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>
>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>> metadata pool, that you've not got clients competing against each
>>>>>> other inappropriately, etc.
>>>>>> -Greg
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>
>>>>>>> time cp Small1/* Small2/*
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> BJ
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>
>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>
>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>> and it's the same results.
>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>
>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>
>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>
>>>>>>>> Hardware Setup:
>>>>>>>> [OSDs]
>>>>>>>> 64 GB 2133 MHz
>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>
>>>>>>>> [MDS/MON new]
>>>>>>>> 128 GB 2133 MHz
>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>
>>>>>>>> [MDS/MON old]
>>>>>>>> 32 GB 800 MHz
>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>> 10Gb Intel NIC
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-04-02 11:03                   ` Yan, Zheng
@ 2015-04-02 15:18                     ` Barclay Jameson
       [not found]                       ` <CAMzumdYdeTv0qF9VdNHb4CM=DNV_HC2nQ7sVnqJ+5w6=TrvCiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-04-03  2:12                       ` [ceph-users] " Yan, Zheng
  0 siblings, 2 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-04-02 15:18 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, ceph-devel, ceph-users

I am using the Giant release. The OSDs and MON/MDS are using default
RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
cephaux.

I may have found something.
I did the build manually as such I did _NOT_ set up these config settings:
filestore xattr use omap = false
filestore max inline xattr size = 65536,
filestore_max_inline_xattr_size_xfs = 65536
filestore_max_inline_xattr_size_other = 512
filestore_max_inline_xattrs_xfs = 10

I just changed these settings to see if it will make a difference.
I copied data from one directory that had files I created before I set
these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
to copy 1600 files.
If I took the files I just copied from small2 and copy them to a
different directory ( time cp small2/* small3/.) it only takes 5 mins
to copy 10000 files!

Could this be part of the problem?


On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
> <almightybeeij@gmail.com> wrote:
>> Here is the mds output from the command you requested. I did this
>> during the small data run . ( time cp small1/* small2/ )
>> It is 20MB in size so I couldn't find a place online that would accept
>> that much data.
>>
>> Please find attached file.
>>
>> Thanks,
>
> In the log file, each 'create' request is followed by several
> 'getattr' requests. I guess these 'getattr' requests resulted from
> some kinds of permission check, but I can't reproduce this situation
> locally.
>
> which version of ceph/kernel are you using? do you use ceph-fuse or
> kernel client, what's the mount options?
>
> Regards
> Yan, Zheng
>
>
>>
>> Beeij
>>
>>
>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@gmail.com> wrote:
>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>> <almightybeeij@gmail.com> wrote:
>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>> get the same performance I did last time.
>>>> The rados bench test was the best I have ever had with a time of 740
>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>> test that had performance equal to PanFS. I find that this does not
>>>> translate to my CephFS. Even with the following tweaking it still at
>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>> absolutely no tweaking):
>>>>
>>>> OSD
>>>>  osd_op_treads 8
>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>
>>>> Client
>>>>  rsize=16777216
>>>>  readdir_max_bytes=16777216
>>>>  readdir_max_entries=16777216
>>>>
>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>
>>>> Strange thing is none of the resources are taxed.
>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>> the client,mon/mds, or the osd nodes.
>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>> but you can see the huge difference in speed.
>>>>
>>>> As per Gregs questions before:
>>>> There is only one client reading and writing (time cp Small1/*
>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>> doing anything on the filesystem.
>>>>
>>>> I have done another test where I stream data info a file as fast as
>>>> the processor can put it there.
>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>> seconds for CephFS although the first build did it in 130 seconds
>>>> without any tuning.
>>>>
>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>> any thoughts on this?
>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>
>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>>>
>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>> <almightybeeij@gmail.com> wrote:
>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>> tried using the MDS on the old node).
>>>>>> I have not tried moving the MON back to the old node.
>>>>>>
>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>> I created 2048 for data and metadata:
>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>
>>>>>>
>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>
>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>> pattern of files for the old cluster?
>>>>>
>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>> there are and what they have permissions on and check; otherwise
>>>>> you'll have to figure it out from the client side.
>>>>> -Greg
>>>>>
>>>>>>
>>>>>> Thanks for the input!
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>> faster hardware and the test is slower?
>>>>>>>
>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>> then reading it back in.
>>>>>>>
>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>
>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>> other inappropriately, etc.
>>>>>>> -Greg
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>
>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> BJ
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>
>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>
>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>> and it's the same results.
>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>
>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>
>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>
>>>>>>>>> Hardware Setup:
>>>>>>>>> [OSDs]
>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>
>>>>>>>>> [MDS/MON new]
>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>
>>>>>>>>> [MDS/MON old]
>>>>>>>>> 32 GB 800 MHz
>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>> 10Gb Intel NIC
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: CephFS Slow writes with 1MB files
       [not found]                       ` <CAMzumdYdeTv0qF9VdNHb4CM=DNV_HC2nQ7sVnqJ+5w6=TrvCiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-04-02 21:36                         ` Barclay Jameson
  0 siblings, 0 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-04-02 21:36 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

Nope,
I redid the cluster with the above config options and it did not fix it.
It must have cached the files from the first copy.

Any thoughts on this?

On Thu, Apr 2, 2015 at 10:18 AM, Barclay Jameson
<almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I am using the Giant release. The OSDs and MON/MDS are using default
> RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
> cephaux.
>
> I may have found something.
> I did the build manually as such I did _NOT_ set up these config settings:
> filestore xattr use omap = false
> filestore max inline xattr size = 65536,
> filestore_max_inline_xattr_size_xfs = 65536
> filestore_max_inline_xattr_size_other = 512
> filestore_max_inline_xattrs_xfs = 10
>
> I just changed these settings to see if it will make a difference.
> I copied data from one directory that had files I created before I set
> these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
> to copy 1600 files.
> If I took the files I just copied from small2 and copy them to a
> different directory ( time cp small2/* small3/.) it only takes 5 mins
> to copy 10000 files!
>
> Could this be part of the problem?
>
>
> On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Here is the mds output from the command you requested. I did this
>>> during the small data run . ( time cp small1/* small2/ )
>>> It is 20MB in size so I couldn't find a place online that would accept
>>> that much data.
>>>
>>> Please find attached file.
>>>
>>> Thanks,
>>
>> In the log file, each 'create' request is followed by several
>> 'getattr' requests. I guess these 'getattr' requests resulted from
>> some kinds of permission check, but I can't reproduce this situation
>> locally.
>>
>> which version of ceph/kernel are you using? do you use ceph-fuse or
>> kernel client, what's the mount options?
>>
>> Regards
>> Yan, Zheng
>>
>>
>>>
>>> Beeij
>>>
>>>
>>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>>> get the same performance I did last time.
>>>>> The rados bench test was the best I have ever had with a time of 740
>>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>>> test that had performance equal to PanFS. I find that this does not
>>>>> translate to my CephFS. Even with the following tweaking it still at
>>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>>> absolutely no tweaking):
>>>>>
>>>>> OSD
>>>>>  osd_op_treads 8
>>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>>
>>>>> Client
>>>>>  rsize=16777216
>>>>>  readdir_max_bytes=16777216
>>>>>  readdir_max_entries=16777216
>>>>>
>>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>>
>>>>> Strange thing is none of the resources are taxed.
>>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>>> the client,mon/mds, or the osd nodes.
>>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>>> but you can see the huge difference in speed.
>>>>>
>>>>> As per Gregs questions before:
>>>>> There is only one client reading and writing (time cp Small1/*
>>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>>> doing anything on the filesystem.
>>>>>
>>>>> I have done another test where I stream data info a file as fast as
>>>>> the processor can put it there.
>>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>>> seconds for CephFS although the first build did it in 130 seconds
>>>>> without any tuning.
>>>>>
>>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>>> any thoughts on this?
>>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>>
>>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>>> tried using the MDS on the old node).
>>>>>>> I have not tried moving the MON back to the old node.
>>>>>>>
>>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>>> I created 2048 for data and metadata:
>>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>>
>>>>>>>
>>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>>
>>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>>> pattern of files for the old cluster?
>>>>>>
>>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>>> there are and what they have permissions on and check; otherwise
>>>>>> you'll have to figure it out from the client side.
>>>>>> -Greg
>>>>>>
>>>>>>>
>>>>>>> Thanks for the input!
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg-3KCAGdo1P2hBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>>> faster hardware and the test is slower?
>>>>>>>>
>>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>>> then reading it back in.
>>>>>>>>
>>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>>
>>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>>> other inappropriately, etc.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>>
>>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> BJ
>>>>>>>>>
>>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>>> <almightybeeij-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>>
>>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>>
>>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>>> and it's the same results.
>>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>>
>>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>>
>>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>>
>>>>>>>>>> Hardware Setup:
>>>>>>>>>> [OSDs]
>>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON new]
>>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON old]
>>>>>>>>>> 32 GB 800 MHz
>>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>>> 10Gb Intel NIC
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-04-02 15:18                     ` [ceph-users] " Barclay Jameson
       [not found]                       ` <CAMzumdYdeTv0qF9VdNHb4CM=DNV_HC2nQ7sVnqJ+5w6=TrvCiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-04-03  2:12                       ` Yan, Zheng
  2015-04-03 20:57                         ` Barclay Jameson
  1 sibling, 1 reply; 15+ messages in thread
From: Yan, Zheng @ 2015-04-03  2:12 UTC (permalink / raw)
  To: Barclay Jameson; +Cc: Gregory Farnum, ceph-devel, ceph-users

[-- Attachment #1: Type: text/plain, Size: 9251 bytes --]

On Thu, Apr 2, 2015 at 11:18 PM, Barclay Jameson
<almightybeeij@gmail.com> wrote:
> I am using the Giant release. The OSDs and MON/MDS are using default
> RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
> cephaux.

I reproduced this issue by using giant release. It's a bug in the MDS
code. Could you try the newest development version ceph (It includes
the fix). Or apply the attached patch to source of  giant release.

Regards
Yan, Zheng

>
> I may have found something.
> I did the build manually as such I did _NOT_ set up these config settings:
> filestore xattr use omap = false
> filestore max inline xattr size = 65536,
> filestore_max_inline_xattr_size_xfs = 65536
> filestore_max_inline_xattr_size_other = 512
> filestore_max_inline_xattrs_xfs = 10
>
> I just changed these settings to see if it will make a difference.
> I copied data from one directory that had files I created before I set
> these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
> to copy 1600 files.
> If I took the files I just copied from small2 and copy them to a
> different directory ( time cp small2/* small3/.) it only takes 5 mins
> to copy 10000 files!
>
> Could this be part of the problem?
>
>
> On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
>> <almightybeeij@gmail.com> wrote:
>>> Here is the mds output from the command you requested. I did this
>>> during the small data run . ( time cp small1/* small2/ )
>>> It is 20MB in size so I couldn't find a place online that would accept
>>> that much data.
>>>
>>> Please find attached file.
>>>
>>> Thanks,
>>
>> In the log file, each 'create' request is followed by several
>> 'getattr' requests. I guess these 'getattr' requests resulted from
>> some kinds of permission check, but I can't reproduce this situation
>> locally.
>>
>> which version of ceph/kernel are you using? do you use ceph-fuse or
>> kernel client, what's the mount options?
>>
>> Regards
>> Yan, Zheng
>>
>>
>>>
>>> Beeij
>>>
>>>
>>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@gmail.com> wrote:
>>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>>> <almightybeeij@gmail.com> wrote:
>>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>>> get the same performance I did last time.
>>>>> The rados bench test was the best I have ever had with a time of 740
>>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>>> test that had performance equal to PanFS. I find that this does not
>>>>> translate to my CephFS. Even with the following tweaking it still at
>>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>>> absolutely no tweaking):
>>>>>
>>>>> OSD
>>>>>  osd_op_treads 8
>>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>>
>>>>> Client
>>>>>  rsize=16777216
>>>>>  readdir_max_bytes=16777216
>>>>>  readdir_max_entries=16777216
>>>>>
>>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>>
>>>>> Strange thing is none of the resources are taxed.
>>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>>> the client,mon/mds, or the osd nodes.
>>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>>> but you can see the huge difference in speed.
>>>>>
>>>>> As per Gregs questions before:
>>>>> There is only one client reading and writing (time cp Small1/*
>>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>>> doing anything on the filesystem.
>>>>>
>>>>> I have done another test where I stream data info a file as fast as
>>>>> the processor can put it there.
>>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>>> seconds for CephFS although the first build did it in 130 seconds
>>>>> without any tuning.
>>>>>
>>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>>> any thoughts on this?
>>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>>
>>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>>> tried using the MDS on the old node).
>>>>>>> I have not tried moving the MON back to the old node.
>>>>>>>
>>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>>> I created 2048 for data and metadata:
>>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>>
>>>>>>>
>>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>>
>>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>>> pattern of files for the old cluster?
>>>>>>
>>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>>> there are and what they have permissions on and check; otherwise
>>>>>> you'll have to figure it out from the client side.
>>>>>> -Greg
>>>>>>
>>>>>>>
>>>>>>> Thanks for the input!
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>>> faster hardware and the test is slower?
>>>>>>>>
>>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>>> then reading it back in.
>>>>>>>>
>>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>>
>>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>>> other inappropriately, etc.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>>
>>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> BJ
>>>>>>>>>
>>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>>
>>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>>
>>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>>> and it's the same results.
>>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>>
>>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>>
>>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>>
>>>>>>>>>> Hardware Setup:
>>>>>>>>>> [OSDs]
>>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON new]
>>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON old]
>>>>>>>>>> 32 GB 800 MHz
>>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>>> 10Gb Intel NIC
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1216 bytes --]

commit 306fb2f5e9661a8b85238f065a0dedcf06c4e725
Author: Yan, Zheng <zyan@redhat.com>
Date:   Mon Sep 15 21:39:26 2014 +0800

    mds: set new inode's xattr version to 1
    
    set new inode's xattr version to 1 even if it has no xattr. This allow
    client to differentiate no xattr in inode from MDS skips sending xattr
    to client (because MDS think client already has uptodate xattr).
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index ea5bb36..6b9d6c7 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -332,6 +332,7 @@ CInode *MDCache::create_system_inode(inodeno_t ino, int mode)
   CInode *in = new CInode(this);
   in->inode.ino = ino;
   in->inode.version = 1;
+  in->inode.xattr_version = 1;
   in->inode.mode = 0500 | mode;
   in->inode.size = 0;
   in->inode.ctime = 
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index beb4696..bf5b98a 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -2005,6 +2005,7 @@ CInode* Server::prepare_new_inode(MDRequestRef& mdr, CDir *dir, inodeno_t useino
   }
 
   in->inode.version = 1;
+  in->inode.xattr_version = 1;
   in->inode.nlink = 1;   // FIXME
 
   in->inode.mode = mode;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [ceph-users] CephFS Slow writes with 1MB files
  2015-04-03  2:12                       ` [ceph-users] " Yan, Zheng
@ 2015-04-03 20:57                         ` Barclay Jameson
  0 siblings, 0 replies; 15+ messages in thread
From: Barclay Jameson @ 2015-04-03 20:57 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: Gregory Farnum, ceph-devel, ceph-users

I pulled down the gitbuilder package (ceph version 0.93-223-g5c2ecc3
(5c2ecc3b8901e6491f1fde8858b51794ffa892e2) ) and redid the cluster.
The small test files ( time cp small1/* small2/. ) went from the 2 min
30 seconds to 1 min 40 secs. With some initial tuning I was able to
get it down to 1 min 22 secs.
It is much better but I still have more work to do to get it to 1 min
for 1600 files.

Thank you for the pointer Zheng it was _extremely_ helpful :)

Still more work to do in terms of performance and speed (the goal is
to replace the HPC tmp space we have on Panasas)

Thanks again,

Beeij

On Thu, Apr 2, 2015 at 9:12 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Thu, Apr 2, 2015 at 11:18 PM, Barclay Jameson
> <almightybeeij@gmail.com> wrote:
>> I am using the Giant release. The OSDs and MON/MDS are using default
>> RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
>> cephaux.
>
> I reproduced this issue by using giant release. It's a bug in the MDS
> code. Could you try the newest development version ceph (It includes
> the fix). Or apply the attached patch to source of  giant release.
>
> Regards
> Yan, Zheng
>
>>
>> I may have found something.
>> I did the build manually as such I did _NOT_ set up these config settings:
>> filestore xattr use omap = false
>> filestore max inline xattr size = 65536,
>> filestore_max_inline_xattr_size_xfs = 65536
>> filestore_max_inline_xattr_size_other = 512
>> filestore_max_inline_xattrs_xfs = 10
>>
>> I just changed these settings to see if it will make a difference.
>> I copied data from one directory that had files I created before I set
>> these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
>> to copy 1600 files.
>> If I took the files I just copied from small2 and copy them to a
>> different directory ( time cp small2/* small3/.) it only takes 5 mins
>> to copy 10000 files!
>>
>> Could this be part of the problem?
>>
>>
>> On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@gmail.com> wrote:
>>> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
>>> <almightybeeij@gmail.com> wrote:
>>>> Here is the mds output from the command you requested. I did this
>>>> during the small data run . ( time cp small1/* small2/ )
>>>> It is 20MB in size so I couldn't find a place online that would accept
>>>> that much data.
>>>>
>>>> Please find attached file.
>>>>
>>>> Thanks,
>>>
>>> In the log file, each 'create' request is followed by several
>>> 'getattr' requests. I guess these 'getattr' requests resulted from
>>> some kinds of permission check, but I can't reproduce this situation
>>> locally.
>>>
>>> which version of ceph/kernel are you using? do you use ceph-fuse or
>>> kernel client, what's the mount options?
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>>
>>>>
>>>> Beeij
>>>>
>>>>
>>>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@gmail.com> wrote:
>>>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>>>> <almightybeeij@gmail.com> wrote:
>>>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>>>> get the same performance I did last time.
>>>>>> The rados bench test was the best I have ever had with a time of 740
>>>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>>>> test that had performance equal to PanFS. I find that this does not
>>>>>> translate to my CephFS. Even with the following tweaking it still at
>>>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>>>> absolutely no tweaking):
>>>>>>
>>>>>> OSD
>>>>>>  osd_op_treads 8
>>>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>>>
>>>>>> Client
>>>>>>  rsize=16777216
>>>>>>  readdir_max_bytes=16777216
>>>>>>  readdir_max_entries=16777216
>>>>>>
>>>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>>>
>>>>>> Strange thing is none of the resources are taxed.
>>>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>>>> the client,mon/mds, or the osd nodes.
>>>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>>>> but you can see the huge difference in speed.
>>>>>>
>>>>>> As per Gregs questions before:
>>>>>> There is only one client reading and writing (time cp Small1/*
>>>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>>>> doing anything on the filesystem.
>>>>>>
>>>>>> I have done another test where I stream data info a file as fast as
>>>>>> the processor can put it there.
>>>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>>>> seconds for CephFS although the first build did it in 130 seconds
>>>>>> without any tuning.
>>>>>>
>>>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>>>> any thoughts on this?
>>>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>>>
>>>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>>>
>>>>> Regards
>>>>> Yan, Zheng
>>>>>
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>>>> tried using the MDS on the old node).
>>>>>>>> I have not tried moving the MON back to the old node.
>>>>>>>>
>>>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>>>> I created 2048 for data and metadata:
>>>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>>>
>>>>>>>>
>>>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>>>
>>>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>>>> pattern of files for the old cluster?
>>>>>>>
>>>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>>>> there are and what they have permissions on and check; otherwise
>>>>>>> you'll have to figure it out from the client side.
>>>>>>> -Greg
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the input!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@gregs42.com> wrote:
>>>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>>>> faster hardware and the test is slower?
>>>>>>>>>
>>>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>>>> then reading it back in.
>>>>>>>>>
>>>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>>>
>>>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>>>> other inappropriately, etc.
>>>>>>>>> -Greg
>>>>>>>>>
>>>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>>>
>>>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> BJ
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>>>> <almightybeeij@gmail.com> wrote:
>>>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>>>
>>>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>>>
>>>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>>>> and it's the same results.
>>>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>>>
>>>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>>>
>>>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>>>
>>>>>>>>>>> Hardware Setup:
>>>>>>>>>>> [OSDs]
>>>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>>
>>>>>>>>>>> [MDS/MON new]
>>>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>>
>>>>>>>>>>> [MDS/MON old]
>>>>>>>>>>> 32 GB 800 MHz
>>>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>>>> 10Gb Intel NIC
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list
>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-04-03 20:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-27 16:40 CephFS Slow writes with 1MB files Barclay Jameson
2015-03-27 16:47 ` Barclay Jameson
     [not found]   ` <CAMzumdbezcb-p1_MpcSL-h8tTR0RKATt93Om6NPejtn1G6yPeQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-27 16:50     ` Mark Nelson
2015-03-27 20:04   ` [ceph-users] " Gregory Farnum
2015-03-27 21:46     ` Barclay Jameson
2015-03-27 21:50       ` Gregory Farnum
     [not found]         ` <CAC6JEv-4D2kF7rrnGMncbE-_63+hdDzecqpB+HrMOKp1YGwv_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-28 17:12           ` Barclay Jameson
     [not found]             ` <CAMzumda7VserizM5PEoT8mwYnTHeE4CFiUiLjnjqN2xXaeNVQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-30 18:30               ` Gregory Farnum
     [not found]                 ` <CAC6JEv_mMbKn+=nHjQmwxAk8T7=MjpJ7bXtHTeWKda7ogJ1GPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-30 18:46                   ` Barclay Jameson
2015-03-31  3:59             ` [ceph-users] " Yan, Zheng
     [not found]               ` <CAMzumdZG4Zqv5SWGTRTS_FTEUe3EwVbbBThjXVTP2gQadMhFsw@mail.gmail.com>
     [not found]                 ` <CAMzumdZG4Zqv5SWGTRTS_FTEUe3EwVbbBThjXVTP2gQadMhFsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-04-02 11:03                   ` Yan, Zheng
2015-04-02 15:18                     ` [ceph-users] " Barclay Jameson
     [not found]                       ` <CAMzumdYdeTv0qF9VdNHb4CM=DNV_HC2nQ7sVnqJ+5w6=TrvCiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-04-02 21:36                         ` Barclay Jameson
2015-04-03  2:12                       ` [ceph-users] " Yan, Zheng
2015-04-03 20:57                         ` Barclay Jameson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.