All of lore.kernel.org
 help / color / mirror / Atom feed
* read performance not perfect
@ 2011-07-18  4:51 huang jun
  2011-07-18 17:14 ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: huang jun @ 2011-07-18  4:51 UTC (permalink / raw)
  To: ceph-devel

hi,all
We test ceph's read performance last week, and find something weird
we use ceph v0.30 on linux 2.6.37
mount ceph on back-platform consist of 2 osds \1 mon \1 mds
$mount -t ceph 192.168.1.103:/ /mnt -vv
$ dd if=/dev/zero of=/mnt/test bs=4M count=200
$ cd .. && umount /mnt
$mount -t ceph 192.168.1.103:/ /mnt -vv
$dd if=test of=/dev/zero bs=4M
  200+0 records in
  200+0 records out
  838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s
but if we use rados to test it
$ rados -m 192.168.1.103:6789 -p data bench 60 write
$ rados -m 192.168.1.103:6789 -p data bench 60 seq
  the result is:
  Total time run:        24.733935
  Total reads made:     438
  Read size:            4194304
  Bandwidth (MB/sec):    70.834

  Average Latency:       0.899429
  Max latency:           1.85106
  Min latency:           0.128017
this phenomenon attracts our attention, then we begin to analysis the
osd debug log.
we find that :
1) the kernel client send READ request, at first it requests 1MB, and
after that it is 512KB
2) from rados test cmd log, OSD recept the READ op with 4MB data to handle
we know the ceph developers pay their attention to read and write
performance, so i just want to confrim that
if the communication between the client and OSD spend  more time than
it should be? can we request  bigger size, just like default object
size 4MB, when it occurs to READ operation? or this is related to OS
management, if so, what can we do to promote the performance?

thanks very much!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-07-18  4:51 read performance not perfect huang jun
@ 2011-07-18 17:14 ` Sage Weil
  2011-07-20  0:21   ` huang jun
       [not found]   ` <CABAwU-YKmEC=umFLzDb-ykPbzQ9s3sKoUmQbkumExrXEwyveNA@mail.gmail.com>
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2011-07-18 17:14 UTC (permalink / raw)
  To: huang jun; +Cc: ceph-devel

On Mon, 18 Jul 2011, huang jun wrote:
> hi,all
> We test ceph's read performance last week, and find something weird
> we use ceph v0.30 on linux 2.6.37
> mount ceph on back-platform consist of 2 osds \1 mon \1 mds
> $mount -t ceph 192.168.1.103:/ /mnt -vv
> $ dd if=/dev/zero of=/mnt/test bs=4M count=200
> $ cd .. && umount /mnt
> $mount -t ceph 192.168.1.103:/ /mnt -vv
> $dd if=test of=/dev/zero bs=4M
>   200+0 records in
>   200+0 records out
>   838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s
> but if we use rados to test it
> $ rados -m 192.168.1.103:6789 -p data bench 60 write
> $ rados -m 192.168.1.103:6789 -p data bench 60 seq
>   the result is:
>   Total time run:        24.733935
>   Total reads made:     438
>   Read size:            4194304
>   Bandwidth (MB/sec):    70.834
> 
>   Average Latency:       0.899429
>   Max latency:           1.85106
>   Min latency:           0.128017
> this phenomenon attracts our attention, then we begin to analysis the
> osd debug log.
> we find that :
> 1) the kernel client send READ request, at first it requests 1MB, and
> after that it is 512KB
> 2) from rados test cmd log, OSD recept the READ op with 4MB data to handle
> we know the ceph developers pay their attention to read and write
> performance, so i just want to confrim that
> if the communication between the client and OSD spend  more time than
> it should be? can we request  bigger size, just like default object
> size 4MB, when it occurs to READ operation? or this is related to OS
> management, if so, what can we do to promote the performance?

I think it's related to the way the Linux VFS is doing readahead, and how 
the ceph fs code is handling it.  It's issue #1122 in the tracker and I 
plan to look at it today or tomorrow!

Thanks-
sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-07-18 17:14 ` Sage Weil
@ 2011-07-20  0:21   ` huang jun
       [not found]   ` <CABAwU-YKmEC=umFLzDb-ykPbzQ9s3sKoUmQbkumExrXEwyveNA@mail.gmail.com>
  1 sibling, 0 replies; 19+ messages in thread
From: huang jun @ 2011-07-20  0:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

thanks for you reply
now we find two points confused us:
1) the kernel client execute sequence read though aio_read function,
but from OSD log,
  the dispatch_queue length in OSD is always 0, it means OSD can't
got next READ message until client send to it. It seems that
async_read changes to sync_read, OSD can't parallely read data, so can
not make the most of  resources.What are the original purposes when
you design this part? perfect realiablity?
2) In singleness read circumstance,during OSD read data from it disk,
the OSD doesn't do anything but to wait it finish.We think it was the
result of 1), OSD have nothing to do,so just to wait.

2011/7/19 Sage Weil <sage@newdream.net>:
> On Mon, 18 Jul 2011, huang jun wrote:
>> hi,all
>> We test ceph's read performance last week, and find something weird
>> we use ceph v0.30 on linux 2.6.37
>> mount ceph on back-platform consist of 2 osds \1 mon \1 mds
>> $mount -t ceph 192.168.1.103:/ /mnt -vv
>> $ dd if=/dev/zero of=/mnt/test bs=4M count=200
>> $ cd .. && umount /mnt
>> $mount -t ceph 192.168.1.103:/ /mnt -vv
>> $dd if=test of=/dev/zero bs=4M
>>   200+0 records in
>>   200+0 records out
>>   838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s
>> but if we use rados to test it
>> $ rados -m 192.168.1.103:6789 -p data bench 60 write
>> $ rados -m 192.168.1.103:6789 -p data bench 60 seq
>>   the result is:
>>   Total time run:        24.733935
>>   Total reads made:     438
>>   Read size:            4194304
>>   Bandwidth (MB/sec):    70.834
>>
>>   Average Latency:       0.899429
>>   Max latency:           1.85106
>>   Min latency:           0.128017
>> this phenomenon attracts our attention, then we begin to analysis the
>> osd debug log.
>> we find that :
>> 1) the kernel client send READ request, at first it requests 1MB, and
>> after that it is 512KB
>> 2) from rados test cmd log, OSD recept the READ op with 4MB data to handle
>> we know the ceph developers pay their attention to read and write
>> performance, so i just want to confrim that
>> if the communication between the client and OSD spend  more time than
>> it should be? can we request  bigger size, just like default object
>> size 4MB, when it occurs to READ operation? or this is related to OS
>> management, if so, what can we do to promote the performance?
>
> I think it's related to the way the Linux VFS is doing readahead, and how
> the ceph fs code is handling it.  It's issue #1122 in the tracker and I
> plan to look at it today or tomorrow!
>
> Thanks-
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
       [not found]   ` <CABAwU-YKmEC=umFLzDb-ykPbzQ9s3sKoUmQbkumExrXEwyveNA@mail.gmail.com>
@ 2011-08-04 15:51     ` Sage Weil
  2011-08-04 19:36       ` Fyodor Ustinov
  2011-08-09  3:56       ` huang jun
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2011-08-04 15:51 UTC (permalink / raw)
  To: huang jun; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3738 bytes --]

Hi,

I've just pushed a wip-readahead branch to ceph-client.git that rewrites 
ceph_readpages (used for readahead) to be fully asynchronous.  This should 
let us take full advantage of whatever the readahead window is.  I'm still 
doing some testing on this end, but things look good so far.

There are two relevant mount options:

 rasize=NN    - max readahead window size (bytes)
 rsize=MM     - max read size

rsize defaults to 0 (no limit), which means it effectively maxes out at 
the stripe size (one object, 4MB by default).

rasize now defaults to 8 MB.  This is probably what you'll want to 
experiment with.  In practice I think something on the order of 8-12 MB 
will be best, as it will start loading things of disk ~2 objects ahead of 
the current position.

Can you give it a go and see if this helps in your environment?

Thanks!
sage


On Tue, 19 Jul 2011, huang jun wrote:
> thanks for you reply
> now we find two points confused us:
> 1) the kernel client execute sequence read though aio_read function,
> but from OSD log,
>    the dispatch_queue length in OSD is always 0, it means OSD can't
> got next READ message until client send to it. It seems that
> async_read changes to sync_read, OSD can't parallely read data, so can
> not make the most of  resources.What are the original purposes when
> you design this part? perfect realiablity?

Right.  The old ceph_readpages was synhronous, which slowed things down in 
a couple of different ways.

> 2) In singleness read circumstance,during OSD read data from it disk,
> the OSD doesn't do anything but to wait it finish.We think it was the
> result of 1), OSD have nothing to do,so just to wait.
> 
> 
> 2011/7/19 Sage Weil <sage@newdream.net>:
> > On Mon, 18 Jul 2011, huang jun wrote:
> >> hi,all
> >> We test ceph's read performance last week, and find something weird
> >> we use ceph v0.30 on linux 2.6.37
> >> mount ceph on back-platform consist of 2 osds \1 mon \1 mds
> >> $mount -t ceph 192.168.1.103:/ /mnt -vv
> >> $ dd if=/dev/zero of=/mnt/test bs=4M count=200
> >> $ cd .. && umount /mnt
> >> $mount -t ceph 192.168.1.103:/ /mnt -vv
> >> $dd if=test of=/dev/zero bs=4M
> >>   200+0 records in
> >>   200+0 records out
> >>   838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s
> >> but if we use rados to test it
> >> $ rados -m 192.168.1.103:6789 -p data bench 60 write
> >> $ rados -m 192.168.1.103:6789 -p data bench 60 seq
> >>   the result is:
> >>   Total time run:        24.733935
> >>   Total reads made:     438
> >>   Read size:            4194304
> >>   Bandwidth (MB/sec):    70.834
> >>
> >>   Average Latency:       0.899429
> >>   Max latency:           1.85106
> >>   Min latency:           0.128017
> >> this phenomenon attracts our attention, then we begin to analysis the
> >> osd debug log.
> >> we find that :
> >> 1) the kernel client send READ request, at first it requests 1MB, and
> >> after that it is 512KB
> >> 2) from rados test cmd log, OSD recept the READ op with 4MB data to handle
> >> we know the ceph developers pay their attention to read and write
> >> performance, so i just want to confrim that
> >> if the communication between the client and OSD spend  more time than
> >> it should be? can we request  bigger size, just like default object
> >> size 4MB, when it occurs to READ operation? or this is related to OS
> >> management, if so, what can we do to promote the performance?
> >
> > I think it's related to the way the Linux VFS is doing readahead, and how
> > the ceph fs code is handling it.  It's issue #1122 in the tracker and I
> > plan to look at it today or tomorrow!
> >
> > Thanks-
> > sage
> >
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-04 15:51     ` Sage Weil
@ 2011-08-04 19:36       ` Fyodor Ustinov
  2011-08-04 19:53         ` Sage Weil
  2011-08-09  3:56       ` huang jun
  1 sibling, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-04 19:36 UTC (permalink / raw)
  To: ceph-devel

Sage Weil <sage <at> newdream.net> writes:

> 
> Hi,
> 
> I've just pushed a wip-readahead branch to ceph-client.git that rewrites 
> ceph_readpages (used for readahead) to be fully asynchronous.  This should 
> let us take full advantage of whatever the readahead window is.  I'm still 
> doing some testing on this end, but things look good so far.

As I understand it's available only in kernel 3.1 ?

WBR,
    Fyodor.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-04 19:36       ` Fyodor Ustinov
@ 2011-08-04 19:53         ` Sage Weil
  2011-08-04 23:38           ` Fyodor Ustinov
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2011-08-04 19:53 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Thu, 4 Aug 2011, Fyodor Ustinov wrote:
> Sage Weil <sage <at> newdream.net> writes:
> 
> > 
> > Hi,
> > 
> > I've just pushed a wip-readahead branch to ceph-client.git that rewrites 
> > ceph_readpages (used for readahead) to be fully asynchronous.  This should 
> > let us take full advantage of whatever the readahead window is.  I'm still 
> > doing some testing on this end, but things look good so far.
> 
> As I understand it's available only in kernel 3.1 ?

The current patches are on top of v3.0, but you should be able to rebase 
the readahead stuff on top of anything reasonably recent.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-04 19:53         ` Sage Weil
@ 2011-08-04 23:38           ` Fyodor Ustinov
  2011-08-05  1:26             ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-04 23:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/04/2011 10:53 PM, Sage Weil wrote:
>
> The current patches are on top of v3.0, but you should be able to rebase
> the readahead stuff on top of anything reasonably recent.
>
> sage

As usual.
cluster - latest 0.32 from your ubuntu rep.
client - latest git-pulled kernel.

dd file from cluster to /dev/null and press ctrl-c. In syslog:

[   12.950114] libceph: mon0 10.5.51.230:6789 connection failed
[   19.971512] libceph: client4119 fsid af9be081-9777-e2cc-8988-ba02fff0f390
[   19.971845] libceph: mon0 10.5.51.230:6789 session established
[   92.891202] libceph: try_read bad con->in_tag = -108
[   92.891258] libceph: osd5 10.5.51.145:6801 protocol error, garbage tag
[  114.508350] libceph: try_read bad con->in_tag = 122
[  114.508406] libceph: osd1 10.5.51.141:6800 protocol error, garbage tag
[  119.077246] libceph: try_read bad con->in_tag = -39
[  119.077301] libceph: osd7 10.5.51.147:6801 protocol error, garbage tag

WBR,
     Fyodor.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-04 23:38           ` Fyodor Ustinov
@ 2011-08-05  1:26             ` Sage Weil
  2011-08-05  6:34               ` Fyodor Ustinov
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2011-08-05  1:26 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
> On 08/04/2011 10:53 PM, Sage Weil wrote:
> > 
> > The current patches are on top of v3.0, but you should be able to rebase
> > the readahead stuff on top of anything reasonably recent.
> > 
> > sage
> 
> As usual.
> cluster - latest 0.32 from your ubuntu rep.
> client - latest git-pulled kernel.
> 
> dd file from cluster to /dev/null and press ctrl-c. In syslog:
> 
> [   12.950114] libceph: mon0 10.5.51.230:6789 connection failed
> [   19.971512] libceph: client4119 fsid af9be081-9777-e2cc-8988-ba02fff0f390
> [   19.971845] libceph: mon0 10.5.51.230:6789 session established
> [   92.891202] libceph: try_read bad con->in_tag = -108
> [   92.891258] libceph: osd5 10.5.51.145:6801 protocol error, garbage tag
> [  114.508350] libceph: try_read bad con->in_tag = 122
> [  114.508406] libceph: osd1 10.5.51.141:6800 protocol error, garbage tag
> [  119.077246] libceph: try_read bad con->in_tag = -39
> [  119.077301] libceph: osd7 10.5.51.147:6801 protocol error, garbage tag

Hmm, this is something new.  Can you confirm which commit you're running?

Have you seen this before?  It may be in the batch of stuff on top of 
3.0.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05  1:26             ` Sage Weil
@ 2011-08-05  6:34               ` Fyodor Ustinov
  2011-08-05 16:07                 ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-05  6:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/05/2011 04:26 AM, Sage Weil wrote:
> On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
>> On 08/04/2011 10:53 PM, Sage Weil wrote:
>>> The current patches are on top of v3.0, but you should be able to rebase
>>> the readahead stuff on top of anything reasonably recent.
>>>
>>> sage
>> As usual.
>> cluster - latest 0.32 from your ubuntu rep.
>> client - latest git-pulled kernel.
>>
>> dd file from cluster to /dev/null and press ctrl-c. In syslog:
>>
>> [   12.950114] libceph: mon0 10.5.51.230:6789 connection failed
>> [   19.971512] libceph: client4119 fsid af9be081-9777-e2cc-8988-ba02fff0f390
>> [   19.971845] libceph: mon0 10.5.51.230:6789 session established
>> [   92.891202] libceph: try_read bad con->in_tag = -108
>> [   92.891258] libceph: osd5 10.5.51.145:6801 protocol error, garbage tag
>> [  114.508350] libceph: try_read bad con->in_tag = 122
>> [  114.508406] libceph: osd1 10.5.51.141:6800 protocol error, garbage tag
>> [  119.077246] libceph: try_read bad con->in_tag = -39
>> [  119.077301] libceph: osd7 10.5.51.147:6801 protocol error, garbage tag
> Hmm, this is something new.  Can you confirm which commit you're running?
Well. More detailed.

1. Cluster: 8 physical servers with 14 osd servers (fs - xfs) + 1 
physical server with mon+mds. Ceph version - 0.32 from repository on all 
servers and clients.
2. Fresh ceph fs. (Really fresh - I made this fs from scratch)
3. One client via cfuse slowly fills the cluster by some data (7T). 
Really slowly (about 1G in minute).

But we are talking about another client.

Kernel for this client git pulled from 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git (it's 
latest kernel).
On client ceph mounted via fstab:

10.5.51.230:/dcvolia/bacula /bacula ceph _netdev,rw 0 0

Now make show:

root@amanda:/bacula/archive/zab.servers.dcv# cd 
/bacula/archive/zab.servers.dcv
root@amanda:/bacula/archive/zab.servers.dcv# ls -alh
total 100G
drwxr-xr-x 1 bacula tape 100G 2011-07-31 00:05 .
drwxr-xr-x 1 bacula tape 253G 2011-07-18 15:21 ..
-rw-r----- 1 bacula tape  23G 2011-08-05 00:40 
zab.servers.dcv-daily-20110719-000519
-rw-r----- 1 bacula tape  28G 2011-07-25 00:39 
zab.servers.dcv-daily-20110719-003333
-rw-r----- 1 bacula tape  32G 2011-08-01 00:42 
zab.servers.dcv-daily-20110726-000515
-rw-r----- 1 bacula tape 6.2G 2011-07-18 12:29 
zab.servers.dcv-monthly-20110718-111036
-rw-r----- 1 bacula tape 6.1G 2011-07-24 01:22 
zab.servers.dcv-weekly-20110724-000518
-rw-r----- 1 bacula tape 6.1G 2011-07-31 01:22 
zab.servers.dcv-weekly-20110731-000522
root@amanda:/bacula/archive/zab.servers.dcv# dd 
if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
^C34+1 records in
34+0 records out
285212672 bytes (285 MB) copied, 5.04607 s, 56.5 MB/s

[24983.180068] libceph: get_reply unknown tid 6215 from osd6

root@amanda:/bacula/archive/zab.servers.dcv# dd 
if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
^C24+1 records in
24+0 records out
201326592 bytes (201 MB) copied, 2.4007 s, 83.9 MB/s

[25035.656266] libceph: get_reply unknown tid 7025 from osd1

root@amanda:/bacula/archive/zab.servers.dcv# dd 
if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
^C130+1 records in
130+0 records out
1090519040 bytes (1.1 GB) copied, 14.9645 s, 72.9 MB/s

root@amanda:/bacula/archive/zab.servers.dcv#

[25088.452033] libceph: try_read bad con->in_tag = 106
[25088.452087] libceph: osd13 10.5.51.146:6800 protocol error, garbage tag

root@amanda:/bacula/archive/zab.servers.dcv# dd 
if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
^C104+1 records in
104+0 records out
872415232 bytes (872 MB) copied, 10.5863 s, 82.4 MB/s

[25166.344264] libceph: try_read bad con->in_tag = 122
[25166.344317] libceph: osd4 10.5.51.144:6800 protocol error, garbage tag

and so on.


> Have you seen this before?
Never.
> It may be in the batch of stuff on top of
> 3.0.
>
May be.

BTW, dramatically increase read speed I do not see. :(

WBR,
     Fyodor.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05  6:34               ` Fyodor Ustinov
@ 2011-08-05 16:07                 ` Sage Weil
  2011-08-05 19:30                   ` Fyodor Ustinov
  2011-08-06 11:03                   ` Fyodor Ustinov
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2011-08-05 16:07 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
> On 08/05/2011 04:26 AM, Sage Weil wrote:
> > On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
> > > On 08/04/2011 10:53 PM, Sage Weil wrote:
> > > > The current patches are on top of v3.0, but you should be able to rebase
> > > > the readahead stuff on top of anything reasonably recent.
> > > > 
> > > > sage
> > > As usual.
> > > cluster - latest 0.32 from your ubuntu rep.
> > > client - latest git-pulled kernel.
> > > 
> > > dd file from cluster to /dev/null and press ctrl-c. In syslog:
> > > 
> > > [   12.950114] libceph: mon0 10.5.51.230:6789 connection failed
> > > [   19.971512] libceph: client4119 fsid
> > > af9be081-9777-e2cc-8988-ba02fff0f390
> > > [   19.971845] libceph: mon0 10.5.51.230:6789 session established
> > > [   92.891202] libceph: try_read bad con->in_tag = -108
> > > [   92.891258] libceph: osd5 10.5.51.145:6801 protocol error, garbage tag
> > > [  114.508350] libceph: try_read bad con->in_tag = 122
> > > [  114.508406] libceph: osd1 10.5.51.141:6800 protocol error, garbage tag
> > > [  119.077246] libceph: try_read bad con->in_tag = -39
> > > [  119.077301] libceph: osd7 10.5.51.147:6801 protocol error, garbage tag
> > Hmm, this is something new.  Can you confirm which commit you're running?
> Well. More detailed.
> 
> 1. Cluster: 8 physical servers with 14 osd servers (fs - xfs) + 1 physical
> server with mon+mds. Ceph version - 0.32 from repository on all servers and
> clients.
> 2. Fresh ceph fs. (Really fresh - I made this fs from scratch)
> 3. One client via cfuse slowly fills the cluster by some data (7T). Really
> slowly (about 1G in minute).
> 
> But we are talking about another client.
> 
> Kernel for this client git pulled from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git (it's latest
> kernel).

This is the problem.  The readahead patches in the master branch of 
git://ceph.newdream.net/git/ceph-client.git.  They're not upstream yet.  
Sorry that wasn't clear!

> On client ceph mounted via fstab:
> 
> 10.5.51.230:/dcvolia/bacula /bacula ceph _netdev,rw 0 0
> 
> Now make show:
> 
> root@amanda:/bacula/archive/zab.servers.dcv# cd
> /bacula/archive/zab.servers.dcv
> root@amanda:/bacula/archive/zab.servers.dcv# ls -alh
> total 100G
> drwxr-xr-x 1 bacula tape 100G 2011-07-31 00:05 .
> drwxr-xr-x 1 bacula tape 253G 2011-07-18 15:21 ..
> -rw-r----- 1 bacula tape  23G 2011-08-05 00:40
> zab.servers.dcv-daily-20110719-000519
> -rw-r----- 1 bacula tape  28G 2011-07-25 00:39
> zab.servers.dcv-daily-20110719-003333
> -rw-r----- 1 bacula tape  32G 2011-08-01 00:42
> zab.servers.dcv-daily-20110726-000515
> -rw-r----- 1 bacula tape 6.2G 2011-07-18 12:29
> zab.servers.dcv-monthly-20110718-111036
> -rw-r----- 1 bacula tape 6.1G 2011-07-24 01:22
> zab.servers.dcv-weekly-20110724-000518
> -rw-r----- 1 bacula tape 6.1G 2011-07-31 01:22
> zab.servers.dcv-weekly-20110731-000522
> root@amanda:/bacula/archive/zab.servers.dcv# dd
> if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
> ^C34+1 records in
> 34+0 records out
> 285212672 bytes (285 MB) copied, 5.04607 s, 56.5 MB/s
> 
> [24983.180068] libceph: get_reply unknown tid 6215 from osd6

This message is normal.  We should probably turn down the debug level, 
or try to detect whether it is expected or not.

> root@amanda:/bacula/archive/zab.servers.dcv# dd
> if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
> ^C24+1 records in
> 24+0 records out
> 201326592 bytes (201 MB) copied, 2.4007 s, 83.9 MB/s
> 
> [25035.656266] libceph: get_reply unknown tid 7025 from osd1
> 
> root@amanda:/bacula/archive/zab.servers.dcv# dd
> if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
> ^C130+1 records in
> 130+0 records out
> 1090519040 bytes (1.1 GB) copied, 14.9645 s, 72.9 MB/s
> 
> root@amanda:/bacula/archive/zab.servers.dcv#
> 
> [25088.452033] libceph: try_read bad con->in_tag = 106
> [25088.452087] libceph: osd13 10.5.51.146:6800 protocol error, garbage tag

This is not.  I'll open a bug and try to track this one down.  It looks 
new.

Thanks!
sage


> 
> root@amanda:/bacula/archive/zab.servers.dcv# dd
> if=zab.servers.dcv-daily-20110719-000519 of=/dev/null bs=8M
> ^C104+1 records in
> 104+0 records out
> 872415232 bytes (872 MB) copied, 10.5863 s, 82.4 MB/s
> 
> [25166.344264] libceph: try_read bad con->in_tag = 122
> [25166.344317] libceph: osd4 10.5.51.144:6800 protocol error, garbage tag
> 
> and so on.
> 
> 
> > Have you seen this before?
> Never.
> > It may be in the batch of stuff on top of
> > 3.0.
> > 
> May be.
> 
> BTW, dramatically increase read speed I do not see. :(
> 
> WBR,
>     Fyodor.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 16:07                 ` Sage Weil
@ 2011-08-05 19:30                   ` Fyodor Ustinov
  2011-08-05 19:35                     ` Gregory Farnum
  2011-08-05 20:17                     ` Sage Weil
  2011-08-06 11:03                   ` Fyodor Ustinov
  1 sibling, 2 replies; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-05 19:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/05/2011 07:07 PM, Sage Weil wrote:
>
> This is the problem.  The readahead patches in the master branch of
> git://ceph.newdream.net/git/ceph-client.git.  They're not upstream yet.
> Sorry that wasn't clear!
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c 
- it's not this patch?

WBR,
     Fyodor.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 19:30                   ` Fyodor Ustinov
@ 2011-08-05 19:35                     ` Gregory Farnum
  2011-08-05 20:17                     ` Sage Weil
  1 sibling, 0 replies; 19+ messages in thread
From: Gregory Farnum @ 2011-08-05 19:35 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: Sage Weil, ceph-devel

On Fri, Aug 5, 2011 at 12:30 PM, Fyodor Ustinov <ufm@ufm.su> wrote:
> On 08/05/2011 07:07 PM, Sage Weil wrote:
>>
>> This is the problem.  The readahead patches in the master branch of
>> git://ceph.newdream.net/git/ceph-client.git.  They're not upstream yet.
>> Sorry that wasn't clear!
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c
> - it's not this patch?
Nope, that patch essentially just adjusted a preference setting. The
full set of patches are much more extensive.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 19:30                   ` Fyodor Ustinov
  2011-08-05 19:35                     ` Gregory Farnum
@ 2011-08-05 20:17                     ` Sage Weil
  2011-08-05 21:12                       ` Fyodor Ustinov
  1 sibling, 1 reply; 19+ messages in thread
From: Sage Weil @ 2011-08-05 20:17 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
> On 08/05/2011 07:07 PM, Sage Weil wrote:
> > 
> > This is the problem.  The readahead patches in the master branch of
> > git://ceph.newdream.net/git/ceph-client.git.  They're not upstream yet.
> > Sorry that wasn't clear!
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c
> - it's not this patch?

Nope, it's ebd62c49c0a71a9af6b92b4f0cedfd2b1d46c16e, in ceph-client.git.  
Then d0a287e18a81a0314a9aa82b6f54eb7f5ecabd60 bumps up the default rasize
window.

FWIW I saw a big jump in read spead on my cluster (now fully saturates the 
client interface).

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 20:17                     ` Sage Weil
@ 2011-08-05 21:12                       ` Fyodor Ustinov
  2011-08-08 17:52                         ` Fyodor Ustinov
  0 siblings, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-05 21:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/05/2011 11:17 PM, Sage Weil wrote:
> On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
>> On 08/05/2011 07:07 PM, Sage Weil wrote:
>>> This is the problem.  The readahead patches in the master branch of
>>> git://ceph.newdream.net/git/ceph-client.git.  They're not upstream yet.
>>> Sorry that wasn't clear!
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c
>> - it's not this patch?
> Nope, it's ebd62c49c0a71a9af6b92b4f0cedfd2b1d46c16e, in ceph-client.git.
> Then d0a287e18a81a0314a9aa82b6f54eb7f5ecabd60 bumps up the default rasize
> window.
>
> FWIW I saw a big jump in read spead on my cluster (now fully saturates the
> client interface).
>
> sage
Well. I wait net kernel. :)

WBR,
     Fyodor.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 16:07                 ` Sage Weil
  2011-08-05 19:30                   ` Fyodor Ustinov
@ 2011-08-06 11:03                   ` Fyodor Ustinov
  2011-08-06 19:08                     ` Sage Weil
  1 sibling, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-06 11:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/05/2011 07:07 PM, Sage Weil wrote:
>
> This is not.  I'll open a bug and try to track this one down.  It looks
> new.
In yours kernel version I do not see this trouble.

WBR,
     Fyodor.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-06 11:03                   ` Fyodor Ustinov
@ 2011-08-06 19:08                     ` Sage Weil
  0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2011-08-06 19:08 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Sat, 6 Aug 2011, Fyodor Ustinov wrote:
> On 08/05/2011 07:07 PM, Sage Weil wrote:
> > 
> > This is not.  I'll open a bug and try to track this one down.  It looks
> > new.
> In yours kernel version I do not see this trouble.

Oh, this might have been the bug Jim was seeing a few weeks back, fixed by 
0da5d70369e87f80adf794080cfff1ca15a34198 (merged into 3.0-rc1).  In any 
case, if you see this again with the current code, let us know!

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-05 21:12                       ` Fyodor Ustinov
@ 2011-08-08 17:52                         ` Fyodor Ustinov
  2011-08-08 19:14                           ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Fyodor Ustinov @ 2011-08-08 17:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 08/06/2011 12:12 AM, Fyodor Ustinov wrote:
> On 08/05/2011 11:17 PM, Sage Weil wrote:
>> On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
>>> On 08/05/2011 07:07 PM, Sage Weil wrote:
>>>> This is the problem.  The readahead patches in the master branch of
>>>> git://ceph.newdream.net/git/ceph-client.git.  They're not upstream 
>>>> yet.
>>>> Sorry that wasn't clear!
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c 
>>>
>>> - it's not this patch?
>> Nope, it's ebd62c49c0a71a9af6b92b4f0cedfd2b1d46c16e, in ceph-client.git.
>> Then d0a287e18a81a0314a9aa82b6f54eb7f5ecabd60 bumps up the default 
>> rasize
>> window.
>>
>> FWIW I saw a big jump in read spead on my cluster (now fully 
>> saturates the
>> client interface).
>>
>> sage
> Well. I wait net kernel. :)
Sage, 3.1-rc1 released. This release has the necessary patches?

WBR,
     Fyodor.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-08 17:52                         ` Fyodor Ustinov
@ 2011-08-08 19:14                           ` Sage Weil
  0 siblings, 0 replies; 19+ messages in thread
From: Sage Weil @ 2011-08-08 19:14 UTC (permalink / raw)
  To: Fyodor Ustinov; +Cc: ceph-devel

On Mon, 8 Aug 2011, Fyodor Ustinov wrote:
> On 08/06/2011 12:12 AM, Fyodor Ustinov wrote:
> > On 08/05/2011 11:17 PM, Sage Weil wrote:
> > > On Fri, 5 Aug 2011, Fyodor Ustinov wrote:
> > > > On 08/05/2011 07:07 PM, Sage Weil wrote:
> > > > > This is the problem.  The readahead patches in the master branch of
> > > > > git://ceph.newdream.net/git/ceph-client.git.  They're not upstream
> > > > > yet.
> > > > > Sorry that wasn't clear!
> > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e9852227431a0ed6ceda064f33e4218757acab6c 
> > > > - it's not this patch?
> > > Nope, it's ebd62c49c0a71a9af6b92b4f0cedfd2b1d46c16e, in ceph-client.git.
> > > Then d0a287e18a81a0314a9aa82b6f54eb7f5ecabd60 bumps up the default rasize
> > > window.
> > > 
> > > FWIW I saw a big jump in read spead on my cluster (now fully saturates the
> > > client interface).
> > > 
> > > sage
> > Well. I wait net kernel. :)
> Sage, 3.1-rc1 released. This release has the necessary patches?

Not readahead, no.  I didn't have the patches done and tested in time.  
That'll go into 3.2.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: read performance not perfect
  2011-08-04 15:51     ` Sage Weil
  2011-08-04 19:36       ` Fyodor Ustinov
@ 2011-08-09  3:56       ` huang jun
  1 sibling, 0 replies; 19+ messages in thread
From: huang jun @ 2011-08-09  3:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

hi,sage
we have a test recently,use 5 OSDs on v0.30, OS is linux-2.6.39
the read speed increased to 79MB/s at first read,
and avg ups to 85MB/s~90MB/s, about two times of our former test ,it
promotes read performance very much.
but we don't know whether it lives up to your expectations.

2011/8/4 Sage Weil <sage@newdream.net>:
> Hi,
>
> I've just pushed a wip-readahead branch to ceph-client.git that rewrites
> ceph_readpages (used for readahead) to be fully asynchronous.  This should
> let us take full advantage of whatever the readahead window is.  I'm still
> doing some testing on this end, but things look good so far.
>
> There are two relevant mount options:
>
>  rasize=NN    - max readahead window size (bytes)
>  rsize=MM     - max read size
>
> rsize defaults to 0 (no limit), which means it effectively maxes out at
> the stripe size (one object, 4MB by default).
>
> rasize now defaults to 8 MB.  This is probably what you'll want to
> experiment with.  In practice I think something on the order of 8-12 MB
> will be best, as it will start loading things of disk ~2 objects ahead of
> the current position.
>
> Can you give it a go and see if this helps in your environment?
>
> Thanks!
> sage
>
>
> On Tue, 19 Jul 2011, huang jun wrote:
>> thanks for you reply
>> now we find two points confused us:
>> 1) the kernel client execute sequence read though aio_read function,
>> but from OSD log,
>>    the dispatch_queue length in OSD is always 0, it means OSD can't
>> got next READ message until client send to it. It seems that
>> async_read changes to sync_read, OSD can't parallely read data, so can
>> not make the most of  resources.What are the original purposes when
>> you design this part? perfect realiablity?
>
> Right.  The old ceph_readpages was synhronous, which slowed things down in
> a couple of different ways.
>
>> 2) In singleness read circumstance,during OSD read data from it disk,
>> the OSD doesn't do anything but to wait it finish.We think it was the
>> result of 1), OSD have nothing to do,so just to wait.
>>
>>
>> 2011/7/19 Sage Weil <sage@newdream.net>:
>> > On Mon, 18 Jul 2011, huang jun wrote:
>> >> hi,all
>> >> We test ceph's read performance last week, and find something weird
>> >> we use ceph v0.30 on linux 2.6.37
>> >> mount ceph on back-platform consist of 2 osds \1 mon \1 mds
>> >> $mount -t ceph 192.168.1.103:/ /mnt -vv
>> >> $ dd if=/dev/zero of=/mnt/test bs=4M count=200
>> >> $ cd .. && umount /mnt
>> >> $mount -t ceph 192.168.1.103:/ /mnt -vv
>> >> $dd if=test of=/dev/zero bs=4M
>> >>   200+0 records in
>> >>   200+0 records out
>> >>   838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s
>> >> but if we use rados to test it
>> >> $ rados -m 192.168.1.103:6789 -p data bench 60 write
>> >> $ rados -m 192.168.1.103:6789 -p data bench 60 seq
>> >>   the result is:
>> >>   Total time run:        24.733935
>> >>   Total reads made:     438
>> >>   Read size:            4194304
>> >>   Bandwidth (MB/sec):    70.834
>> >>
>> >>   Average Latency:       0.899429
>> >>   Max latency:           1.85106
>> >>   Min latency:           0.128017
>> >> this phenomenon attracts our attention, then we begin to analysis the
>> >> osd debug log.
>> >> we find that :
>> >> 1) the kernel client send READ request, at first it requests 1MB, and
>> >> after that it is 512KB
>> >> 2) from rados test cmd log, OSD recept the READ op with 4MB data to handle
>> >> we know the ceph developers pay their attention to read and write
>> >> performance, so i just want to confrim that
>> >> if the communication between the client and OSD spend  more time than
>> >> it should be? can we request  bigger size, just like default object
>> >> size 4MB, when it occurs to READ operation? or this is related to OS
>> >> management, if so, what can we do to promote the performance?
>> >
>> > I think it's related to the way the Linux VFS is doing readahead, and how
>> > the ceph fs code is handling it.  It's issue #1122 in the tracker and I
>> > plan to look at it today or tomorrow!
>> >
>> > Thanks-
>> > sage
>> >
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-08-09  3:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-18  4:51 read performance not perfect huang jun
2011-07-18 17:14 ` Sage Weil
2011-07-20  0:21   ` huang jun
     [not found]   ` <CABAwU-YKmEC=umFLzDb-ykPbzQ9s3sKoUmQbkumExrXEwyveNA@mail.gmail.com>
2011-08-04 15:51     ` Sage Weil
2011-08-04 19:36       ` Fyodor Ustinov
2011-08-04 19:53         ` Sage Weil
2011-08-04 23:38           ` Fyodor Ustinov
2011-08-05  1:26             ` Sage Weil
2011-08-05  6:34               ` Fyodor Ustinov
2011-08-05 16:07                 ` Sage Weil
2011-08-05 19:30                   ` Fyodor Ustinov
2011-08-05 19:35                     ` Gregory Farnum
2011-08-05 20:17                     ` Sage Weil
2011-08-05 21:12                       ` Fyodor Ustinov
2011-08-08 17:52                         ` Fyodor Ustinov
2011-08-08 19:14                           ` Sage Weil
2011-08-06 11:03                   ` Fyodor Ustinov
2011-08-06 19:08                     ` Sage Weil
2011-08-09  3:56       ` huang jun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.