From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Liu Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles Date: Thu, 27 Feb 2014 15:15:39 +0000 Message-ID: <20140227151538.GG16241@zion.uk.xensource.com> References: <1772884781.20140218222513@eikelenboom.it> <5305CFC6.3080502@oracle.com> <587238484.20140220121842@eikelenboom.it> <5306F2E8.5090509@oracle.com> <824074181.20140226101442@eikelenboom.it> <59358334.20140226161123@eikelenboom.it> <20140227141812.GE16241@zion.uk.xensource.com> <529743590.20140227154351@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <529743590.20140227154351@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: annie li , Paul Durrant , Wei Liu , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Thu, Feb 27, 2014 at 03:43:51PM +0100, Sander Eikelenboom wrote: [...] > > > As far as I can tell netfront has a pool of grant references and it > > will BUG_ON() if there's no grefs in the pool when you request one. > > Since your DomU didn't crash so I suspect the book-keeping is still > > intact. > > >> > Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 somewhere this night. > >> > Domain 7 is the domain that happens to give the netfront messages. > >> > >> > I also don't get why it is reporting the "Bad grant reference" for domain 0, which seems to have 0 active entries .. > >> > Also is this amount of grant entries "normal" ? or could it be a leak somewhere ? > >> > > > I suppose Dom0 expanding its maptrack is normal. I see as well when I > > increase the number of domains. But if it keeps increasing while the > > number of DomUs stay the same then it is not normal. > > It keeps increasing (without (re)starting domains) although eventually it looks like it is settling at a round a maptrack size of 31/256 frames. > Then I guess that's reasonable. You have 15 DomUs after all... > > > Presumably you only have netfront and blkfront to use grant table and > > your workload as described below invovled both so it would be hard to > > tell which one is faulty. > > > There's no immediate functional changes regarding slot counting in this > > dev cycle for network driver. But there's some changes to blkfront/back > > which seem interesting (memory related). > > Hmm all the times i get a "Bad grant reference" are related to that one specific guest. > And it's not doing much blkback/front I/O (it's providing webdav and rsync to network based storage (glusterfs)) > OK. I misunderstood that you were rsync'ing from / to your VM disk. What does webdav do anyway? Does it have a specific traffic pattern? > Added some more printk's: > > @@ -2072,7 +2076,11 @@ __gnttab_copy( > &s_frame, &s_pg, > &source_off, &source_len, 1); > if ( rc != GNTST_okay ) > - goto error_out; > + PIN_FAIL(error_out, GNTST_general_error, > + "?!?!? src_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n", > + current->domain->domain_id, op->source.domid, op->dest.domid); > + > + > have_s_grant = 1; > if ( op->source.offset < source_off || > op->len > source_len ) > @@ -2096,7 +2104,11 @@ __gnttab_copy( > current->domain->domain_id, 0, > &d_frame, &d_pg, &dest_off, &dest_len, 1); > if ( rc != GNTST_okay ) > - goto error_out; > + PIN_FAIL(error_out, GNTST_general_error, > + "?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n", > + current->domain->domain_id, op->source.domid, op->dest.domid); > + > + > have_d_grant = 1; > > > this comes out: > > (XEN) [2014-02-27 02:34:37] grant_table.c:2109:d0 ?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:0 src_dom_id:32752 dest_dom_id:7 > If it fails in gnttab_copy then I very much suspects this is a network driver problem as persistent grant in blk driver doesn't use grant copy. > > > My suggestion is, if you have a working base line, you can try to setup > > different frontend / backend combination to help narrow down the > > problem. > > Will see what i can do after the weekend > Thanks > > Wei. > > >