* 4x write amplification? @ 2013-07-10 2:08 Li Wang 2013-07-10 17:43 ` Gregory Farnum 0 siblings, 1 reply; 3+ messages in thread From: Li Wang @ 2013-07-10 2:08 UTC (permalink / raw) To: ceph-devel; +Cc: Sage Weil Hi, We did a simple throughput test on Ceph with 2 OSD nodes configured with one replica policy. For each OSD node, the throughput measured by 'dd' run locally is 117MB/s. Therefore, in theory, the two OSDs could provide 200+MB/s throughput. However, using 'iozone' from clients we only get a peak throughput at around 40MB/s. Is that because a write will incur 2 replica * 2 journal write? That is, writing into journal and replica doubled the traffic, respectively, which results in a total 4x write amplification. Is that true, or our understanding is wrong, and some performance tuning hint? Cheers, Li Wang ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 4x write amplification? 2013-07-10 2:08 4x write amplification? Li Wang @ 2013-07-10 17:43 ` Gregory Farnum 2013-07-11 8:23 ` Oleg Krasnianskiy 0 siblings, 1 reply; 3+ messages in thread From: Gregory Farnum @ 2013-07-10 17:43 UTC (permalink / raw) To: Li Wang; +Cc: ceph-devel, Sage Weil On Tue, Jul 9, 2013 at 7:08 PM, Li Wang <liwang@ubuntukylin.com> wrote: > Hi, > We did a simple throughput test on Ceph with 2 OSD nodes configured > with one replica policy. For each OSD node, the throughput measured by 'dd' > run locally is 117MB/s. Therefore, in theory, the two OSDs could provide > 200+MB/s throughput. However, using 'iozone' from clients we only get a peak > throughput at around 40MB/s. Is that because a write will incur 2 replica * > 2 journal write? That is, writing into journal > and replica doubled the traffic, respectively, which results in a total > 4x write amplification. Is that true, or our understanding is wrong, > and some performance tuning hint? Well, that write amplification is certainly happening. However, I'd expect to get a better ratio out of the disk than that. Couple things to check: 1) is your benchmark dispatching multiple requests at a time, or just one? (the latency on a single request is going to make throughput numbers come out badly.) 2) How do your disks handle two simultaneous write streams? (Most disks should be fine, but sometimes they struggle.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: 4x write amplification? 2013-07-10 17:43 ` Gregory Farnum @ 2013-07-11 8:23 ` Oleg Krasnianskiy 0 siblings, 0 replies; 3+ messages in thread From: Oleg Krasnianskiy @ 2013-07-11 8:23 UTC (permalink / raw) To: Gregory Farnum; +Cc: Li Wang, ceph-devel, Sage Weil 2013/7/10 Gregory Farnum <greg@inktank.com>: > On Tue, Jul 9, 2013 at 7:08 PM, Li Wang <liwang@ubuntukylin.com> wrote: >> Hi, >> We did a simple throughput test on Ceph with 2 OSD nodes configured >> with one replica policy. For each OSD node, the throughput measured by 'dd' >> run locally is 117MB/s. Therefore, in theory, the two OSDs could provide >> 200+MB/s throughput. However, using 'iozone' from clients we only get a peak >> throughput at around 40MB/s. Is that because a write will incur 2 replica * >> 2 journal write? That is, writing into journal >> and replica doubled the traffic, respectively, which results in a total >> 4x write amplification. Is that true, or our understanding is wrong, >> and some performance tuning hint? > > Well, that write amplification is certainly happening. However, I'd > expect to get a better ratio out of the disk than that. Couple things > to check: > 1) is your benchmark dispatching multiple requests at a time, or just > one? (the latency on a single request is going to make throughput > numbers come out badly.) > 2) How do your disks handle two simultaneous write streams? (Most > disks should be fine, but sometimes they struggle.) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html I have the same problem: 2 machines with several osds each, default replica count (2), btrfs on osd partitions, default journal location (file on a osd partition), journal size 2Gb, ceph 0.65, 1 Gbit network between machines that is shared with the client. dd benchmark shows ~ 110 Mb/s on osd partition. Object write via librados: 1 object - 43 Mb/s, 3 parallel objects - 66 Mb/s. At the same time I monitor osd partitions with iostat. Partitions are loaded 85-110 Mb/s. If I shutdown one node, write occurs only on one partition on one machine. Same results: disk is loaded 85-110 Mb/s and I get same throughput on client - 49 Mb/s for single obj and 66 Mb/s for 3 parallel objects. We are deciding whether to use ceph in production and we are stuck due to our weak understanding of this situation. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-07-11 8:23 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-07-10 2:08 4x write amplification? Li Wang 2013-07-10 17:43 ` Gregory Farnum 2013-07-11 8:23 ` Oleg Krasnianskiy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.