All of lore.kernel.org
 help / color / mirror / Atom feed
* The max single write IOPS on single RBD
@ 2015-12-11 10:00 Zhi Zhang
  2015-12-11 11:56 ` Ning Yao
  2015-12-11 13:15 ` Sage Weil
  0 siblings, 2 replies; 5+ messages in thread
From: Zhi Zhang @ 2015-12-11 10:00 UTC (permalink / raw)
  To: sage, ceph-devel

Hi Guys,

We have a small 4 nodes cluster. Here is the hardware configuration.

11 x 300GB SSD, 24 cores, 32GB memory per one node.
all the nodes connected within one 1Gb/s network.

So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
fio. Here are the major fio options.

-direct=1
-rw=randwrite
-ioengine=psync
-size=1000M
-bs=4k
-numjobs=1

The max IOPS we can achieve for single write (numjobs=1) is close to
1000. This means each IO from RBD takes 1.x ms.

From osd logs, we can also observe most of osd_ops will take 1.x ms,
including op processing, journal writing, replication, etc, before
sending commit back to client.

The network RTT is around 0.04 ms;
Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 ms;
Most osd_repops including writing journal on peer OSD take around 0.5 ms.

We even tried to modify journal to write page cache only, but didn't
get very significant improvement. Does it mean this is the best result
we can get for single write on single RBD?

Thanks.

-- 
Regards,
Zhi Zhang (David)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The max single write IOPS on single RBD
  2015-12-11 10:00 The max single write IOPS on single RBD Zhi Zhang
@ 2015-12-11 11:56 ` Ning Yao
  2015-12-11 13:15 ` Sage Weil
  1 sibling, 0 replies; 5+ messages in thread
From: Ning Yao @ 2015-12-11 11:56 UTC (permalink / raw)
  To: Zhi Zhang; +Cc: ceph-devel

Currently, yes before we can improve the osd code efficiency further.
You can achieve better performance by using client writeback cache if
application allowed.
Regards
Ning Yao


2015-12-11 18:00 GMT+08:00 Zhi Zhang <zhang.david2011@gmail.com>:
> Hi Guys,
>
> We have a small 4 nodes cluster. Here is the hardware configuration.
>
> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
> all the nodes connected within one 1Gb/s network.
>
> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
> fio. Here are the major fio options.
>
> -direct=1
> -rw=randwrite
> -ioengine=psync
> -size=1000M
> -bs=4k
> -numjobs=1
>
> The max IOPS we can achieve for single write (numjobs=1) is close to
> 1000. This means each IO from RBD takes 1.x ms.
>
> From osd logs, we can also observe most of osd_ops will take 1.x ms,
> including op processing, journal writing, replication, etc, before
> sending commit back to client.
>
> The network RTT is around 0.04 ms;
> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 ms;
> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
>
> We even tried to modify journal to write page cache only, but didn't
> get very significant improvement. Does it mean this is the best result
> we can get for single write on single RBD?
>
> Thanks.
>
> --
> Regards,
> Zhi Zhang (David)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The max single write IOPS on single RBD
  2015-12-11 10:00 The max single write IOPS on single RBD Zhi Zhang
  2015-12-11 11:56 ` Ning Yao
@ 2015-12-11 13:15 ` Sage Weil
  2015-12-14  3:10   ` Zhi Zhang
  1 sibling, 1 reply; 5+ messages in thread
From: Sage Weil @ 2015-12-11 13:15 UTC (permalink / raw)
  To: Zhi Zhang; +Cc: ceph-devel

On Fri, 11 Dec 2015, Zhi Zhang wrote:
> Hi Guys,
> 
> We have a small 4 nodes cluster. Here is the hardware configuration.
> 
> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
> all the nodes connected within one 1Gb/s network.
> 
> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
> fio. Here are the major fio options.
> 
> -direct=1
> -rw=randwrite
> -ioengine=psync
> -size=1000M
> -bs=4k
> -numjobs=1
> 
> The max IOPS we can achieve for single write (numjobs=1) is close to
> 1000. This means each IO from RBD takes 1.x ms.
> 
> >From osd logs, we can also observe most of osd_ops will take 1.x ms,
> including op processing, journal writing, replication, etc, before
> sending commit back to client.
> 
> The network RTT is around 0.04 ms;
> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 ms;
> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
> 
> We even tried to modify journal to write page cache only, but didn't
> get very significant improvement. Does it mean this is the best result
> we can get for single write on single RBD?

What version is this?  There have been a few recent changes that will 
reduce the wall clock time spent preparing/processing a request.  There is 
still a fair bit of work to do here, though--the theoretical lower bound 
is the SSD write time + 2x RTT (client <-> primary osd <-> replica osd <-> 
replica ssd).

sage


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The max single write IOPS on single RBD
  2015-12-11 13:15 ` Sage Weil
@ 2015-12-14  3:10   ` Zhi Zhang
  2015-12-14 15:39     ` Jason Dillaman
  0 siblings, 1 reply; 5+ messages in thread
From: Zhi Zhang @ 2015-12-14  3:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, Dec 11, 2015 at 9:15 PM, Sage Weil <sage@newdream.net> wrote:
> On Fri, 11 Dec 2015, Zhi Zhang wrote:
>> Hi Guys,
>>
>> We have a small 4 nodes cluster. Here is the hardware configuration.
>>
>> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
>> all the nodes connected within one 1Gb/s network.
>>
>> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
>> fio. Here are the major fio options.
>>
>> -direct=1
>> -rw=randwrite
>> -ioengine=psync
>> -size=1000M
>> -bs=4k
>> -numjobs=1
>>
>> The max IOPS we can achieve for single write (numjobs=1) is close to
>> 1000. This means each IO from RBD takes 1.x ms.
>>
>> >From osd logs, we can also observe most of osd_ops will take 1.x ms,
>> including op processing, journal writing, replication, etc, before
>> sending commit back to client.
>>
>> The network RTT is around 0.04 ms;
>> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes 0.3 ms;
>> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
>>
>> We even tried to modify journal to write page cache only, but didn't
>> get very significant improvement. Does it mean this is the best result
>> we can get for single write on single RBD?
>
> What version is this?  There have been a few recent changes that will
> reduce the wall clock time spent preparing/processing a request.  There is
> still a fair bit of work to do here, though--the theoretical lower bound
> is the SSD write time + 2x RTT (client <-> primary osd <-> replica osd <->
> replica ssd).
>

Ceph version is 0.94.1 with few backports.

I already saw some related changes. I will try a newer version and
keep your guys on the updates.

Thanks.

> sage
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: The max single write IOPS on single RBD
  2015-12-14  3:10   ` Zhi Zhang
@ 2015-12-14 15:39     ` Jason Dillaman
  0 siblings, 0 replies; 5+ messages in thread
From: Jason Dillaman @ 2015-12-14 15:39 UTC (permalink / raw)
  To: Zhi Zhang; +Cc: Sage Weil, ceph-devel

If you are testing with "iodepth=1", I'd recommend testing with "rbd non blocking aio = false" in your Ceph config file to see if that improves your single-threaded IO performance.

-- 

Jason Dillaman 


----- Original Message -----
> From: "Zhi Zhang" <zhang.david2011@gmail.com>
> To: "Sage Weil" <sage@newdream.net>
> Cc: ceph-devel@vger.kernel.org
> Sent: Sunday, December 13, 2015 10:10:58 PM
> Subject: Re: The max single write IOPS on single RBD
> 
> On Fri, Dec 11, 2015 at 9:15 PM, Sage Weil <sage@newdream.net> wrote:
> > On Fri, 11 Dec 2015, Zhi Zhang wrote:
> >> Hi Guys,
> >>
> >> We have a small 4 nodes cluster. Here is the hardware configuration.
> >>
> >> 11 x 300GB SSD, 24 cores, 32GB memory per one node.
> >> all the nodes connected within one 1Gb/s network.
> >>
> >> So we have one Monitor and 44 OSDs for testing kernel RBD IOPS using
> >> fio. Here are the major fio options.
> >>
> >> -direct=1
> >> -rw=randwrite
> >> -ioengine=psync
> >> -size=1000M
> >> -bs=4k
> >> -numjobs=1
> >>
> >> The max IOPS we can achieve for single write (numjobs=1) is close to
> >> 1000. This means each IO from RBD takes 1.x ms.
> >>
> >> >From osd logs, we can also observe most of osd_ops will take 1.x ms,
> >> including op processing, journal writing, replication, etc, before
> >> sending commit back to client.
> >>
> >> The network RTT is around 0.04 ms;
> >> Most osd_ops on primary OSD take around 0.5~0.7 ms, journal write takes
> >> 0.3 ms;
> >> Most osd_repops including writing journal on peer OSD take around 0.5 ms.
> >>
> >> We even tried to modify journal to write page cache only, but didn't
> >> get very significant improvement. Does it mean this is the best result
> >> we can get for single write on single RBD?
> >
> > What version is this?  There have been a few recent changes that will
> > reduce the wall clock time spent preparing/processing a request.  There is
> > still a fair bit of work to do here, though--the theoretical lower bound
> > is the SSD write time + 2x RTT (client <-> primary osd <-> replica osd <->
> > replica ssd).
> >
> 
> Ceph version is 0.94.1 with few backports.
> 
> I already saw some related changes. I will try a newer version and
> keep your guys on the updates.
> 
> Thanks.
> 
> > sage
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-14 15:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11 10:00 The max single write IOPS on single RBD Zhi Zhang
2015-12-11 11:56 ` Ning Yao
2015-12-11 13:15 ` Sage Weil
2015-12-14  3:10   ` Zhi Zhang
2015-12-14 15:39     ` Jason Dillaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.