From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcus Granado Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support Date: Tue, 23 Jun 2015 13:51:09 +0100 Message-ID: <5589563D.8070005__27835.2548521527$1435064015$gmane$org@citrix.com> References: <1433310003-13089-1-git-send-email-bob.liu@oracle.com> <1433310003-13089-3-git-send-email-bob.liu@oracle.com> <5576A8C0.8000804@oracle.com> <9AAE0902D5BC7E449B7C8E4E778ABCD0259410F3@AMSPEX01CL01.citrite.net> <20150609133938.GA15200@x230> <5576F325.5050304@citrix.com> <558762C4.2000002@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <558762C4.2000002@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Bob Liu , =?windows-1252?Q?Roger_Pau_Monn=E9?= Cc: Rafal Mielniczuk , Jonathan Davies , "linux-kernel@vger.kernel.org" , "xen-devel@lists.xen.org" , Julien Grall , "justing@spectralogic.com" , Paul Durrant , David Vrabel List-Id: xen-devel@lists.xenproject.org On 22/06/15 02:20, Bob Liu wrote: > > On 06/09/2015 10:07 PM, Roger Pau Monn=E9 wrote: >> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit: > ... >>> Roger, I put them (patches) on devel/for-jens-4.2 on >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git >>> >>> I think these two patches: >>> drivers: xen-blkback: delay pending_req allocation to connect_ring >>> xen/block: add multi-page ring support >>> >>> are the only ones that haven't been Acked by you (or maybe they >>> have and I missed the Ack?) >> >> Hello, >> >> I was waiting to Ack those because the XenServer storage performance >> folks found out that these patches cause a performance regression on >> some of their tests. I'm adding them to the conversation so they can >> provide more details about the issues they found, and whether we should >> hold pushing this patches or not. >> > > Hey, > > Are there any updates? What's the performance regression problem? > Hi, We were using the 2 last weeks to finish measurements on the multipage = ring v5 patches in a range of diverse conditions. The measurements were obtained under the following conditions: - using blkback as the dom0 backend with a back-ported multipage ring v5 = applied to our dom0 kernel 3.10. - using a recent Ubuntu 15.04 kernel 3.19 with v5 frontend applied to be = used as guest - using a micron RealSSD P320h as the underlying local storage on a Dell = PowerEdge R720 with 2 Xeon E5-2643 v2 cpus. - fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. = We used direct_io to skip caching in the guest and ran fio for 60s for a = number of block sizes ranging from 512 bytes to 4MiB. We also tried pure = random and pure sequential reads. Random reads were used to counter-act = read-ahead prefetching at the underlying storage. We noticed that using large (>16) queue depths in fio would saturate = individual vcpus in the guest, so to better utilise the cpu resources in = the guest, we chose to (a) fix the queue depth to 4 for each fio thread, = (b) increase the guest vcpus to 16 and (c) vary the number of fio = threads from 1 to 64. We were interested in observing storage iops and throughput for = different values of in-flight requests (=3D io depth * fio threads) = generated by the guest. Our expectation was that iops and throughput = with single-page and multi-page rings would be the same up to 32 = in-flight requests (the number of requests that fit in a single-page = ring), and then the single-page ring case would flat-line with >32 = in-flight requests, whereas the multi-page ring case would continue to = show improvements until hitting some other bottleneck. The effect should = be more visible when using requests with smaller block sizes because the = measurements are less likely to be affected by memory copy delays or = large data transfer delays to storage. These are the results we got for the conditions above with 4KiB blocks = and random reads: fio_threads io_depth in_flight 1-page_IOPS 8-page_IOPS 1 4 4 19K 19K 4 4 16 89K 89K 8 4 32 149K 149K 16 4 64 131K 198K 32 4 128 127K 208K 64 4 256 132K 209K We believe that this data shows that there's a clear improvement when = using multi-page rings when there are more than 32 in-flight requests. = We observed similar improvements when writing, and across all small = block sizes. For block sizes >=3D16KiB, the results were similar between = single- and multi-page rings, and we attribute that to bottlenecks when = transferring large amounts of data that is not present with smaller = block sizes. Another reason for using random reads in the synthetic fio tests above = is that we noticed that when sequential reads are used there were some = anomalies that we believe would affect a fair comparison: (A)- in some situations with sequential read, we observed a decreasing = number of merges in the guest (according to 'iostat -x -m 1') with small = block sizes <=3D4KiB when increasing the number of ring pages. There were = no merges whenever in_flight < ring_pages * 32. With larger in_flight = requests (>=3D128) -- visible with both 8 fio_threads x 32 io_depth and 32 = fio_threads x 8 io_depth -- storage throughput with 1 page was around = 25% better than with 8 pages. This is the regression that Roger was = talking about previously in this discussion. It seems related to the = merges of requests occurring much more frequently with 1 page than with = 8 pages. During the measurements, the average request queue size in = iostat has always a similar value as the number of requests in the ring. = I would appreciate potential explanations of why the guest kernel would = behave like that. We believe that this regression is a corner-case that = would be difficult to spot in a real-world load, where random reads are = interspersed with sequential reads of many different block sizes and io = depths, and we only spotted it because of our synthetic load with fio = used a wide range of parameters with sequential reads. It may also be = specific to the way that Linux handles this situation. (B)- in other situations with sequential read (block sizes between 8KiB = and 128KiB), we observed the storage throughput with 1 page was around = 50% worse than with 8 pages. Again, this seems related to the existence = of merges with 1 page but not with 8 pages, and I would appreciate = potential explanations. For sequential reads, arguably the performance difference spotted in (A) = is counter balanced by the performance difference in (B), and they = cancel each other out if all block sizes are considered together. For = random reads, 8-page rings were similar or superior to 1-page rings in = all tested conditions. All considered, we believe that the multi-page ring patches improve the = storage performance (apart from case (A)) and therefore should be good = to merge. Marcus