From mboxrd@z Thu Jan 1 00:00:00 1970 From: Felipe Franciosi Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct Date: Thu, 19 Feb 2015 12:06:24 +0000 Message-ID: <9F2C4E7DFB7839489C89757A66C5AD629EDBBA__43943.2562132881$1424347716$gmane$org@AMSPEX01CL03.citrite.net> References: <1423988345-4005-1-git-send-email-bob.liu@oracle.com> <1423988345-4005-5-git-send-email-bob.liu@oracle.com> <54E4CBD1.1000802@citrix.com> <20150218173746.GF8152@l.oracle.com> <9F2C4E7DFB7839489C89757A66C5AD629EB997@AMSPEX01CL03.citrite.net> <54E544CC.4080007@oracle.com> <54E5C444.4050100@citrix.com> <54E5C59F.2060300@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <54E5C59F.2060300@citrix.com> Content-Language: en-US List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: David Vrabel , Roger Pau Monne , Bob Liu Cc: "hch@infradead.org" , "linux-kernel@vger.kernel.org" , "xen-devel@lists.xen.org" , "axboe@fb.com" , "avanzini.arianna@gmail.com" List-Id: xen-devel@lists.xenproject.org > -----Original Message----- > From: David Vrabel > Sent: 19 February 2015 11:15 > To: Roger Pau Monne; Bob Liu; Felipe Franciosi > Cc: 'Konrad Rzeszutek Wilk'; xen-devel@lists.xen.org; linux- > kernel@vger.kernel.org; axboe@fb.com; hch@infradead.org; > avanzini.arianna@gmail.com > Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an = new > struct > = > On 19/02/15 11:08, Roger Pau Monn=E9 wrote: > > El 19/02/15 a les 3.05, Bob Liu ha escrit: > >> > >> > >> On 02/19/2015 02:08 AM, Felipe Franciosi wrote: > >>>> -----Original Message----- > >>>> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > >>>> Sent: 18 February 2015 17:38 > >>>> To: Roger Pau Monne > >>>> Cc: Bob Liu; xen-devel@lists.xen.org; David Vrabel; linux- > >>>> kernel@vger.kernel.org; Felipe Franciosi; axboe@fb.com; > >>>> hch@infradead.org; avanzini.arianna@gmail.com > >>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information > >>>> to an new struct > >>>> > >>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn=E9 wrote: > >>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit: > >>>>> AFAICT you seem to have a list of persistent grants, indirect > >>>>> pages and a grant table callback for each ring, isn't this > >>>>> supposed to be shared between all rings? > >>>>> > >>>>> I don't think we should be going down that route, or else we can > >>>>> hoard a large amount of memory and grants. > >>>> > >>>> It does remove the lock that would have to be accessed by each ring > >>>> thread to access those. Those values (grants) can be limited to be > >>>> a smaller value such that the overall number is the same as it was w= ith > the previous version. As in: > >>>> each ring has =3D MAX_GRANTS / nr_online_cpus(). > >>>>> > >>> > >>> We should definitely be concerned with the amount of memory consumed > on the backend for each plugged virtual disk. We have faced several probl= ems > in XenServer around this area before; it drastically affects VBD scalabil= ity per > host. > >>> > >> > >> Right, so we have to keep both the lock and the amount of memory > >> consumed in mind. > >> > >>> This makes me think that all the persistent grants work was done as a > workaround while we were facing several performance problems around > concurrent grant un/mapping operations. Given all the recent submissions > made around this (grant ops) area, is this something we should perhaps re= visit > and discuss whether we want to continue offering persistent grants as a f= eature? > >>> > >> > >> Agree, Life would be easier if we can remove the persistent feature. > > > > I was thinking about this yesterday, and IMHO I think we should remove > > persistent grants now while it's not too entangled, leaving it for > > later will just make our life more miserable. > > > > While it's true that persistent grants provide a throughput increase > > by preventing grant table operations and TLB flushes, it has several > > problems that cannot by avoided: > > > > - Memory/grants hoarding, we need to reserve the same amount of > > memory as the amount of data that we want to have in-flight. While > > this is not so critical for memory, it is for grants, since using too > > many grants can basically deadlock other PV interfaces. There's no way > > to avoid this since it's the design behind persistent grants. > > > > - Memcopy: guest needs to perform a memcopy of all data that goes > > through blkfront. While not so critical, Felipe found systems were > > memcopy was more expensive than grant map/unmap in the backend (IIRC > > those were AMD systems). > > > > - Complexity/interactions: when persistent grants was designed number > > of requests was limited to 32 and each request could only contain 11 > > pages. This means we had to use 352 pages/grants which was fine. Now > > that we have indirect IO and multiqueue in the horizon this number has > > gone up by orders of magnitude, I don't think this is viable/useful > > any more. > > > > If Konrad/Bob agree I would like to send a patch to remove persistent > > grants and then have the multiqueue series rebased on top of that. > = > I agree with this. > = > I think we can get better performance/scalability gains of with improvem= ents > to grant table locking and TLB flush avoidance. > = > David It doesn't change the fact that persistent grants (as well as the grant cop= y implementation we did for tapdisk3) were alternatives that allowed aggreg= ate storage performance to increase drastically. Before committing to remov= ing something that allow Xen users to scale their deployments, I think we n= eed to revisit whether the recent improvements to the whole grant mechanism= s (grant table locking, TLB flushing, batched calls, etc) are performing as= we would (now) expect. What I think should be done prior to committing to either direction is a pr= oper performance assessment of grant mapping vs. persistent grants vs. gran= t copy for single and aggregate workloads. We need to test a meaningful set= of host architectures, workloads and storage types. Last year at the XenDe= velSummit, for example, we showed how grant copy scaled better than persist= ent grants at the cost of doing the copy on the back end. I don't mean to propose tests that will delay innovation by weeks or months= . However, it is very easy to find changes that improve this or that synthe= tic workload and ignore the fact that it might damage several (possibly ver= y realistic) others. I think this is the time to run performance tests obje= ctively without trying to dig too much into debugging and go from there. Felipe