From mboxrd@z Thu Jan  1 00:00:00 1970
From: Wei Liu <wei.liu2@citrix.com>
Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network
	troubles
Date: Thu, 27 Feb 2014 15:15:39 +0000
Message-ID: <20140227151538.GG16241@zion.uk.xensource.com>
References: <1772884781.20140218222513@eikelenboom.it>
	<5305CFC6.3080502@oracle.com>
	<587238484.20140220121842@eikelenboom.it>
	<5306F2E8.5090509@oracle.com>
	<824074181.20140226101442@eikelenboom.it>
	<59358334.20140226161123@eikelenboom.it>
	<20140227141812.GE16241@zion.uk.xensource.com>
	<529743590.20140227154351@eikelenboom.it>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <529743590.20140227154351@eikelenboom.it>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Sander Eikelenboom <linux@eikelenboom.it>
Cc: annie li <annie.li@oracle.com>, Paul Durrant <Paul.Durrant@citrix.com>, Wei Liu <wei.liu2@citrix.com>, Zoltan Kiss <zoltan.kiss@citrix.com>, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On Thu, Feb 27, 2014 at 03:43:51PM +0100, Sander Eikelenboom wrote:
[...]
> 
> > As far as I can tell netfront has a pool of grant references and it
> > will BUG_ON() if there's no grefs in the pool when you request one.
> > Since your DomU didn't crash so I suspect the book-keeping is still
> > intact.
> 
> >> > Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 somewhere this night.
> >> > Domain 7 is the domain that happens to give the netfront messages.
> >> 
> >> > I also don't get why it is reporting the "Bad grant reference" for domain 0, which seems to have 0 active entries ..
> >> > Also is this amount of grant entries "normal" ? or could it be a leak somewhere ?
> >> 
> 
> > I suppose Dom0 expanding its maptrack is normal. I see as well when I
> > increase the number of domains. But if it keeps increasing while the
> > number of DomUs stay the same then it is not normal.
> 
> It keeps increasing (without (re)starting domains) although eventually it looks like it is settling at a round a maptrack size of 31/256 frames.
> 

Then I guess that's reasonable. You have 15 DomUs after all...

> 
> > Presumably you only have netfront and blkfront to use grant table and
> > your workload as described below invovled both so it would be hard to
> > tell which one is faulty.
> 
> > There's no immediate functional changes regarding slot counting in this
> > dev cycle for network driver. But there's some changes to blkfront/back
> > which seem interesting (memory related).
> 
> Hmm all the times i get a "Bad grant reference" are related to that one specific guest.
> And it's not doing much blkback/front I/O (it's providing webdav and rsync to network based storage (glusterfs))
> 

OK. I misunderstood that you were rsync'ing from / to your VM disk.

What does webdav do anyway? Does it have a specific traffic pattern?

> Added some more printk's:
> 
> @@ -2072,7 +2076,11 @@ __gnttab_copy(
>                                        &s_frame, &s_pg,
>                                        &source_off, &source_len, 1);
>          if ( rc != GNTST_okay )
> -            goto error_out;
> +            PIN_FAIL(error_out, GNTST_general_error,
> +                     "?!?!? src_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
> +                     current->domain->domain_id, op->source.domid, op->dest.domid);
> +
> +
>          have_s_grant = 1;
>          if ( op->source.offset < source_off ||
>               op->len > source_len )
> @@ -2096,7 +2104,11 @@ __gnttab_copy(
>                                        current->domain->domain_id, 0,
>                                        &d_frame, &d_pg, &dest_off, &dest_len, 1);
>          if ( rc != GNTST_okay )
> -            goto error_out;
> +            PIN_FAIL(error_out, GNTST_general_error,
> +                     "?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
> +                     current->domain->domain_id, op->source.domid, op->dest.domid);
> +
> +
>          have_d_grant = 1;
> 
> 
> this comes out:
> 
> (XEN) [2014-02-27 02:34:37] grant_table.c:2109:d0 ?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:0 src_dom_id:32752 dest_dom_id:7
> 

If it fails in gnttab_copy then I very much suspects this is a network
driver problem as persistent grant in blk driver doesn't use grant
copy.

> 
> > My suggestion is, if you have a working base line, you can try to setup
> > different frontend / backend combination to help narrow down the
> > problem.
> 
> Will see what i can do after the weekend
> 

Thanks

> > Wei.
> 
> <snip>
>