From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Tue, 11 Mar 2014 00:00:26 +0100 Message-ID: <9610144106.20140311000026@eikelenboom.it> References: <587238484.20140220121842@eikelenboom.it> <5306F2E8.5090509@oracle.com> <824074181.20140226101442@eikelenboom.it> <59358334.20140226161123@eikelenboom.it> <20140227141812.GE16241@zion.uk.xensource.com> <529743590.20140227154351@eikelenboom.it> <20140227151538.GG16241@zion.uk.xensource.com> <1982379440.20140227162655@eikelenboom.it> <20140227155726.GI16241@zion.uk.xensource.com> <716618617.20140307113321@eikelenboom.it> <20140307111929.GL19620@zion.uk.xensource.com> <1554992598.20140307125518@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1554992598.20140307125518@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Paul Durrant Cc: annie li , Wei Liu , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Friday, March 7, 2014, 12:55:18 PM, you wrote: > Friday, March 7, 2014, 12:19:29 PM, you wrote: >> On Fri, Mar 07, 2014 at 11:33:21AM +0100, Sander Eikelenboom wrote: >> [...] >>> >>> >> >> >>> >> >> > My suggestion is, if you have a working base line, you can try to setup >>> >> >> > different frontend / backend combination to help narrow down the >>> >> >> > problem. >>> >> >> >>> >> >> Will see what i can do after the weekend >>> >> >> >>> A small update >>> >>> I tried reverting the latest netback / netfront patches .. but to no avail .. >>> Also tried if i could trigger it somehow by using netperf and generating a lot >>> of frags (as that would make it more easily reproduceable). >>> But that was also to no avail .. it seems to only trigger sometimes with my >>> specific workload. >>> >>> So i took a flight forward by trying out Zoltan's series v6 >>> (since it also had changes to the way the network code uses the granttables), >>> got that running overnight applying the same workload as before and >>> i haven't triggered anything yet .. looking good so far :-) >>> >> Thanks for letting us know. If there's any update don't hesitate to post >> to xen-devel. > *sigh* .. it seems posting to xen-devel triggers things ;-) > back to square one again: > Guest kernel: > Mar 7 11:45:29 backup kernel: [49954.928062] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928081] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928086] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928092] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928096] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928101] net eth0: Need more slots > Mar 7 11:45:29 backup kernel: [49954.928196] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928202] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928206] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:45:29 backup kernel: [49954.928210] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:50:42 backup kernel: [50267.397350] net_ratelimit: 14 callbacks suppressed > Mar 7 11:50:42 backup kernel: [50267.397366] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:50:42 backup kernel: [50267.397372] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:50:42 backup kernel: [50267.397377] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:50:42 backup kernel: [50267.397381] net eth0: rx->offset: 0, size: 4294967295 > Mar 7 11:50:42 backup kernel: [50267.397386] net eth0: rx->offset: 0, size: 4294967295 > Xen: > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 20316163 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 4325377 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 6684675 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 13238275 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 20054019 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 4325377 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 3538945 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 3538945 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 3538945 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 3538945 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 3538945 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 4325377 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 7471105 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 4325377 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 4325377 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 107085839 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 107085839 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:45:29] grant_table.c:1857:d0v3 Bad grant reference 268435460 > (XEN) [2014-03-07 10:50:42] grant_table.c:1857:d0v4 Bad grant reference 4325379 > Will be testing 3.13 vanilla .. see how that works out and if there is a baseline somewhere. >> Wei. Hi Paul, It seems a commit by you: "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: improve guest-receive-side flow control" is the first that gives the Bad grant references. It seems later patches partly prevent or mask the issue, so it is less easy to trigger it. With only this commit applied i can trigger it quite fast. This is the result of: - First testing a baseline that worked o.k. for several days (3.13.6 for both dom0 and domU) - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok. - Testing dom0 3.14-rc5 and domU 3.13.6, this failed. - After that took 3.13.6 as base and first applied all the general xen related patches for the dom0 kernel, that works ok. - After that started to apply the netback changes for 3.14 and that failed after the commit stated above. So i'm quite confident i'm reporting the right thing now :-) If you would like me to run debug patches on top of this commit, don't hesitate to send them ! -- Sander