Tuesday, March 11, 2014, 4:36:16 PM, you wrote: > On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote: > [...] >> >> the issue when using 3.13.6 as a base and .. >> >> - pull all 3.14 patches from the git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree >> >> - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: improve guest-receive-side flow control" >> >> - applying annie's v2 patch >> >> - applying your patch >> >> as dom0 and using a 3.14-rc5 as domU kernel. >> >> >> >> Unfortunately i'm still getting the Bad grant references .. >> >> >> >> > :-( That's bad news. >> >> > I guess you always have the same DomU kernel when testing? That means we >> > can narrow down the bug to netback only. >> >> Yes my previous tests (from my previous mail): >> >> - First testing a baseline that worked o.k. for several days (3.13.6 for both dom0 and domU) >> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok. >> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed. >> - After that took 3.13.6 as base and first applied all the general xen related patches for the dom0 kernel, that works ok. >> - After that started to apply the netback changes for 3.14 and that failed after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: improve guest-receive-side flow control". >> >> Also seem to indicate just that, although it could also be something in this netback commit that triggers a latent bug in netfront, can't rule that one out completly. >> >> But the trigger is in that commit && >> annie's and your patch seem to have no effect at all( on this issue) && >> later commits in 3.14 do seems to mask it / make it less likely to trigger, but do not fix it. >> > Unfortunately I've stared at the same piece of code for some time but > don't have immediate clue. Later commits don't look suspecious either. > I also looked at netfront code, but there's no slot couting change > between 3.13 and 3.14. > Do you have some straight setup instructions so that I can try to > reproduce. Ok you asked for it .. so here we go .. ;-) : - Xen-unstable - DomU: - PV guest - Debian Wheezy - Kernel: 3.14-rc5 (.config see dom0 .config) - Running netserver - Dom0: - Debian Wheezy - Kernel: 3.13.6 + additional patches (see attached git log and .config) - 2 physical NIC's - Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen and mem=1536M for dom0 kernel stanzas in grub - vcpu 0 is pinned on pcpu 0 and exclusively for dom0 - Networking: - Routed bridge - eth0 = internet - eth1 = lan 172.16.x.x - xen_bridge = bridge for VM's 192.168.1.x - iptables NAT and routing - attached dom0 and domU ifconfig output - attached ethtool -k output for the bridge, vif and guest eth0 Triggering workload: - Well that's were the problem is :-) - The Guest has a normal disk and swap (phy/lvm) and shared storage with glusterfs (glusterfs server is on dom0) - The Guest exposes this storage via webdavs What triggers it is: - The Guest runs it's rsync of the shared storage to a remote client on the internet So this causes traffic from Dom0 to domU (reading storage) .. back from domU to dom0 and via iptables NAT on to eth0 (actual rsync) or vice versa when syncing the other way around - At the same time do a backup from a windows machine from the lan to the webdavs server So this causes traffic from eth1 to domU (webdav) .. back from domU to dom0 (writing to the shared storage) So in essence it is doing quite some netback && netfront stress testing :-) It seems to only trigger when doing both the rsync and the webdav simultaneous. I tried my best to emulate any of this with netperf (and multiple instances), i tried with various (odd) packet sizes and the packet / byte rates transmitted are higher then with the workload above ... but it doesn't seem to trigger with netpref So i don't think it will be easy to replicate ... Perhaps running through the available logging again .. and try to answer some questions ... this is just with one guest running kernels as before only added debugging to netfront and xen (diffs attached): Mar 12 02:00:44 backup kernel: [ 496.840646] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840665] net eth0: cons:1346005 slots:1 rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840680] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840687] net eth0: cons:1346005 slots:2 rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840701] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840712] net eth0: cons:1346005 slots:3 rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840733] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840740] net eth0: cons:1346005 slots:4 rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840757] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840764] net eth0: cons:1346005 slots:5 rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840778] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840784] net eth0: cons:1346005 slots:6 rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.840801] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.840807] net eth0: cons:1346005 slots:7 rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.841043] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.841052] net eth0: cons:1346025 slots:1 rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295 ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1 cons_before:1346024 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.841067] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.841074] net eth0: cons:1346025 slots:2 rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295 ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 cons_before:1346024 xennet_get_extras_err:0 Mar 12 02:00:44 backup kernel: [ 496.841092] net eth0: rx->offset: 0, size: 4294967295 Mar 12 02:00:44 backup kernel: [ 496.841101] net eth0: cons:1346025 slots:3 rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295 ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 cons_before:1346024 xennet_get_extras_err:0 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 source_off:0 source_len:4096 op->source.offset:0 op->len:1168 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 source_off:0 source_len:4096 op->source.offset:0 op->len:2476 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 source_off:0 source_len:4096 op->source.offset:0 op->len:1634 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 source_off:0 source_len:4096 op->source.offset:1634 op->len:1620 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 - Sometimes (but not always) netfront also spits out: dev_warn(dev, "Invalid extra type: %d\n", extra->type); where the extra type seems a random value (seen 196, 31 ..) - Sometimes (but not always) netfront also spits out: dev_warn(dev, "Need more slots\n"); - Sometimes (but not always) netfront also spits out: dev_warn(dev, "Missing extra info\n"); First question that comes to my mind: - Are the warnings netfront spits out the cause of xen reporting the bad grant reference ? Or Are the Bad grant references Xen is reporting .. causing netfront to spit out the warnings ? - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in xen-netfront.c doing there and changing cons midway ? The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that introduced the if says: One major change from xen.git is that the guest transmit path (i.e. what looks like receive to netback) has been significantly reworked to remove the dependency on the out of tree PageForeign page flag (a core kernel patch which enables a per page destructor callback on the final put_page). This page flag was used in order to implement a grant map based transmit path (where guest pages are mapped directly into SKB frags). Instead this version of netback uses grant copy operations into regular memory belonging to the backend domain. Reinstating the grant map functionality is something which I would like to revisit in the future. It *is* using grant copy now .. so should this part have been removed ? And Could Paul's commit that seems to be the first to trigger these events affect this ? -- Sander > Wei. >> > Paul, do you have any idea what might go wrong? >> >> > Wei. >>