From mboxrd@z Thu Jan  1 00:00:00 1970
From: Felipe Franciosi <felipe.franciosi@citrix.com>
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring
 information to an new struct
Date: Thu, 19 Feb 2015 12:06:24 +0000
Message-ID: <9F2C4E7DFB7839489C89757A66C5AD629EDBBA__43943.2562132881$1424347716$gmane$org@AMSPEX01CL03.citrite.net>
References: <1423988345-4005-1-git-send-email-bob.liu@oracle.com>
	<1423988345-4005-5-git-send-email-bob.liu@oracle.com>
	<54E4CBD1.1000802@citrix.com> <20150218173746.GF8152@l.oracle.com>
	<9F2C4E7DFB7839489C89757A66C5AD629EB997@AMSPEX01CL03.citrite.net>
	<54E544CC.4080007@oracle.com> <54E5C444.4050100@citrix.com>
	<54E5C59F.2060300@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <54E5C59F.2060300@citrix.com>
Content-Language: en-US
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: David Vrabel <david.vrabel@citrix.com>, Roger Pau Monne <roger.pau@citrix.com>, Bob Liu <bob.liu@oracle.com>
Cc: "hch@infradead.org" <hch@infradead.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, "axboe@fb.com" <axboe@fb.com>, "avanzini.arianna@gmail.com" <avanzini.arianna@gmail.com>
List-Id: xen-devel@lists.xenproject.org


> -----Original Message-----
> From: David Vrabel
> Sent: 19 February 2015 11:15
> To: Roger Pau Monne; Bob Liu; Felipe Franciosi
> Cc: 'Konrad Rzeszutek Wilk'; xen-devel@lists.xen.org; linux-
> kernel@vger.kernel.org; axboe@fb.com; hch@infradead.org;
> avanzini.arianna@gmail.com
> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an =
new
> struct
> =

> On 19/02/15 11:08, Roger Pau Monn=E9 wrote:
> > El 19/02/15 a les 3.05, Bob Liu ha escrit:
> >>
> >>
> >> On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
> >>>> -----Original Message-----
> >>>> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> >>>> Sent: 18 February 2015 17:38
> >>>> To: Roger Pau Monne
> >>>> Cc: Bob Liu; xen-devel@lists.xen.org; David Vrabel; linux-
> >>>> kernel@vger.kernel.org; Felipe Franciosi; axboe@fb.com;
> >>>> hch@infradead.org; avanzini.arianna@gmail.com
> >>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information
> >>>> to an new struct
> >>>>
> >>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn=E9 wrote:
> >>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
> >>>>> AFAICT you seem to have a list of persistent grants, indirect
> >>>>> pages and a grant table callback for each ring, isn't this
> >>>>> supposed to be shared between all rings?
> >>>>>
> >>>>> I don't think we should be going down that route, or else we can
> >>>>> hoard a large amount of memory and grants.
> >>>>
> >>>> It does remove the lock that would have to be accessed by each ring
> >>>> thread to access those. Those values (grants) can be limited to be
> >>>> a smaller value such that the overall number is the same as it was w=
ith
> the previous version. As in:
> >>>> each ring has =3D MAX_GRANTS / nr_online_cpus().
> >>>>>
> >>>
> >>> We should definitely be concerned with the amount of memory consumed
> on the backend for each plugged virtual disk. We have faced several probl=
ems
> in XenServer around this area before; it drastically affects VBD scalabil=
ity per
> host.
> >>>
> >>
> >> Right, so we have to keep both the lock and the amount of memory
> >> consumed in mind.
> >>
> >>> This makes me think that all the persistent grants work was done as a
> workaround while we were facing several performance problems around
> concurrent grant un/mapping operations. Given all the recent submissions
> made around this (grant ops) area, is this something we should perhaps re=
visit
> and discuss whether we want to continue offering persistent grants as a f=
eature?
> >>>
> >>
> >> Agree, Life would be easier if we can remove the persistent feature.
> >
> > I was thinking about this yesterday, and IMHO I think we should remove
> > persistent grants now while it's not too entangled, leaving it for
> > later will just make our life more miserable.
> >
> > While it's true that persistent grants provide a throughput increase
> > by preventing grant table operations and TLB flushes, it has several
> > problems that cannot by avoided:
> >
> >  - Memory/grants hoarding, we need to reserve the same amount of
> > memory as the amount of data that we want to have in-flight. While
> > this is not so critical for memory, it is for grants, since using too
> > many grants can basically deadlock other PV interfaces. There's no way
> > to avoid this since it's the design behind persistent grants.
> >
> >  - Memcopy: guest needs to perform a memcopy of all data that goes
> > through blkfront. While not so critical, Felipe found systems were
> > memcopy was more expensive than grant map/unmap in the backend (IIRC
> > those were AMD systems).
> >
> >  - Complexity/interactions: when persistent grants was designed number
> > of requests was limited to 32 and each request could only contain 11
> > pages. This means we had to use 352 pages/grants which was fine. Now
> > that we have indirect IO and multiqueue in the horizon this number has
> > gone up by orders of magnitude, I don't think this is viable/useful
> > any more.
> >
> > If Konrad/Bob agree I would like to send a patch to remove persistent
> > grants and then have the multiqueue series rebased on top of that.
> =

> I agree with this.
> =

> I think we can get better  performance/scalability gains of with improvem=
ents
> to grant table locking and TLB flush avoidance.
> =

> David

It doesn't change the fact that persistent grants (as well as the grant cop=
y implementation we did for tapdisk3) were alternatives that allowed aggreg=
ate storage performance to increase drastically. Before committing to remov=
ing something that allow Xen users to scale their deployments, I think we n=
eed to revisit whether the recent improvements to the whole grant mechanism=
s (grant table locking, TLB flushing, batched calls, etc) are performing as=
 we would (now) expect.

What I think should be done prior to committing to either direction is a pr=
oper performance assessment of grant mapping vs. persistent grants vs. gran=
t copy for single and aggregate workloads. We need to test a meaningful set=
 of host architectures, workloads and storage types. Last year at the XenDe=
velSummit, for example, we showed how grant copy scaled better than persist=
ent grants at the cost of doing the copy on the back end.

I don't mean to propose tests that will delay innovation by weeks or months=
. However, it is very easy to find changes that improve this or that synthe=
tic workload and ignore the fact that it might damage several (possibly ver=
y realistic) others. I think this is the time to run performance tests obje=
ctively without trying to dig too much into debugging and go from there.

Felipe