From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcus Granado <marcus.granado@citrix.com>
Subject: Re: [PATCH 3/3] xen/block: add multi-page ring support
Date: Tue, 23 Jun 2015 13:51:09 +0100
Message-ID: <5589563D.8070005__27835.2548521527$1435064015$gmane$org@citrix.com>
References: <1433310003-13089-1-git-send-email-bob.liu@oracle.com>
	<1433310003-13089-3-git-send-email-bob.liu@oracle.com>
	<5576A8C0.8000804@oracle.com>
	<9AAE0902D5BC7E449B7C8E4E778ABCD0259410F3@AMSPEX01CL01.citrite.net>
	<20150609133938.GA15200@x230> <5576F325.5050304@citrix.com>
	<558762C4.2000002@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <558762C4.2000002@oracle.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Bob Liu <bob.liu@oracle.com>, =?windows-1252?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
Cc: Rafal Mielniczuk <Rafal.Mielniczuk@citrix.com>, Jonathan Davies <Jonathan.Davies@eu.citrix.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Julien Grall <julien.grall@citrix.com>, "justing@spectralogic.com" <justing@spectralogic.com>, Paul Durrant <Paul.Durrant@citrix.com>, David Vrabel <david.vrabel@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On 22/06/15 02:20, Bob Liu wrote:
>
> On 06/09/2015 10:07 PM, Roger Pau Monn=E9 wrote:
>> El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
> ...
>>> Roger, I put them (patches) on devel/for-jens-4.2 on
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
>>>
>>> I think these two patches:
>>> drivers: xen-blkback: delay pending_req allocation to connect_ring
>>> xen/block: add multi-page ring support
>>>
>>> are the only ones that haven't been Acked by you (or maybe they
>>> have and I missed the Ack?)
>>
>> Hello,
>>
>> I was waiting to Ack those because the XenServer storage performance
>> folks found out that these patches cause a performance regression on
>> some of their tests. I'm adding them to the conversation so they can
>> provide more details about the issues they found, and whether we should
>> hold pushing this patches or not.
>>
>
> Hey,
>
> Are there any updates? What's the performance regression problem?
>

Hi,

We were using the 2 last weeks to finish measurements on the multipage =

ring v5 patches in a range of diverse conditions.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with a back-ported multipage ring v5 =

applied to our dom0 kernel 3.10.

- using a recent Ubuntu 15.04 kernel 3.19 with v5 frontend applied to be =

used as guest

- using a micron RealSSD P320h as the underlying local storage on a Dell =

PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest. =

We used direct_io to skip caching in the guest and ran fio for 60s for a =

number of block sizes ranging from 512 bytes to 4MiB. We also tried pure =

random and pure sequential reads. Random reads were used to counter-act =

read-ahead prefetching at the underlying storage.

We noticed that using large (>16) queue depths in fio would saturate =

individual vcpus in the guest, so to better utilise the cpu resources in =

the guest, we chose to (a) fix the queue depth to 4 for each fio thread, =

(b) increase the guest vcpus to 16 and (c) vary the number of fio =

threads from 1 to 64.

We were interested in observing storage iops and throughput for =

different values of in-flight requests (=3D io depth * fio threads) =

generated by the guest. Our expectation was that iops and throughput =

with single-page and multi-page rings would be the same up to 32 =

in-flight requests (the number of requests that fit in a single-page =

ring), and then the single-page ring case would flat-line with >32 =

in-flight requests, whereas the multi-page ring case would continue to =

show improvements until hitting some other bottleneck. The effect should =

be more visible when using requests with smaller block sizes because the =

measurements are less likely to be affected by memory copy delays or =

large data transfer delays to storage.

These are the results we got for the conditions above with 4KiB blocks =

and random reads:

fio_threads  io_depth  in_flight   1-page_IOPS  8-page_IOPS
     1            4         4           19K          19K
     4            4        16           89K          89K
     8            4        32          149K         149K
    16            4        64          131K         198K
    32            4       128          127K         208K
    64            4       256          132K         209K

We believe that this data shows that there's a clear improvement when =

using multi-page rings when there are more than 32 in-flight requests. =

We observed similar improvements when writing, and across all small =

block sizes. For block sizes >=3D16KiB, the results were similar between =

single- and multi-page rings, and we attribute that to bottlenecks when =

transferring large amounts of data that is not present with smaller =

block sizes.

Another reason for using random reads in the synthetic fio tests above =

is that we noticed that when sequential reads are used there were some =

anomalies that we believe would affect a fair comparison:

(A)- in some situations with sequential read, we observed a decreasing =

number of merges in the guest (according to 'iostat -x -m 1') with small =

block sizes <=3D4KiB when increasing the number of ring pages. There were =

no merges whenever in_flight < ring_pages * 32. With larger in_flight =

requests (>=3D128) -- visible with both 8 fio_threads x 32 io_depth and 32 =

fio_threads x 8 io_depth -- storage throughput with 1 page was around =

25% better than with 8 pages. This is the regression that Roger was =

talking about previously in this discussion. It seems related to the =

merges of requests occurring much more frequently with 1 page than with =

8 pages. During the measurements, the average request queue size in =

iostat has always a similar value as the number of requests in the ring. =

I would appreciate potential explanations of why the guest kernel would =

behave like that. We believe that this regression is a corner-case that =

would be difficult to spot in a real-world load, where random reads are =

interspersed with sequential reads of many different block sizes and io =

depths, and we only spotted it because of our synthetic load with fio =

used a wide range of parameters with sequential reads. It may also be =

specific to the way that Linux handles this situation.

(B)- in other situations with sequential read (block sizes between 8KiB =

and 128KiB), we observed the storage throughput with 1 page was around =

50% worse than with 8 pages. Again, this seems related to the existence =

of merges with 1 page but not with 8 pages, and I would appreciate =

potential explanations.

For sequential reads, arguably the performance difference spotted in (A) =

is counter balanced by the performance difference in (B), and they =

cancel each other out if all block sizes are considered together. For =

random reads, 8-page rings were similar or superior to 1-page rings in =

all tested conditions.

All considered, we believe that the multi-page ring patches improve the =

storage performance (apart from case (A)) and therefore should be good =

to merge.

Marcus