From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Wed, 12 Mar 2014 15:23:36 +0100 Message-ID: <751560446.20140312152336@eikelenboom.it> References: <1554992598.20140307125518@eikelenboom.it> <9610144106.20140311000026@eikelenboom.it> <20140311101948.GX19620@zion.uk.xensource.com> <191977479.20140311133142@eikelenboom.it> <20140311123807.GB19620@zion.uk.xensource.com> <1068180642.20140311140041@eikelenboom.it> <20140311153616.GE19620@zion.uk.xensource.com> <1751914199.20140312024243@eikelenboom.it> <20140312113522.GG19620@zion.uk.xensource.com> <659214230.20140312124516@eikelenboom.it> <20140312120444.GH19620@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140312120444.GH19620@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: annie li , Paul Durrant , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Wednesday, March 12, 2014, 1:04:44 PM, you wrote: > On Wed, Mar 12, 2014 at 12:45:16PM +0100, Sander Eikelenboom wrote: > [...] >> >> >> >> - Sometimes (but not always) netfront also spits out: >> >> dev_warn(dev, "Invalid extra type: %d\n", extra->type); >> >> where the extra type seems a random value (seen 196, 31 ..) >> >> - Sometimes (but not always) netfront also spits out: >> >> dev_warn(dev, "Need more slots\n"); >> >> - Sometimes (but not always) netfront also spits out: >> >> dev_warn(dev, "Missing extra info\n"); >> >> >> >> > That's because garbage is pushed to frontend. It's trying to decode some >> > random values. >> >> >> >> >> First question that comes to my mind: >> >> - Are the warnings netfront spits out the cause of xen reporting the bad grant reference ? >> >> Or >> >> Are the Bad grant references Xen is reporting .. causing netfront to spit out the warnings ? >> >> > No. They are not directly connected. >> >> > 0. backend breaks down a skb into slots. >> > 1. backend gets a request from the rx ring (a slot with gref provided by >> > frontend), until all slots in skb are handled. >> > 2. backend grant-copies data to that slot (with gref). >> > 3. backend pushes a response to frontend, whose content depends on the >> > result of previous step. >> > 4. frontend handles that response. >> >> > So the grant copy error in hypervisor you're seeing is in 2. The >> > complaint from frontend is in 4, when backend pushes a response with >> > error status in it. The bug is most likely in 0, when the calculation >> > goes wrong - either in the breakdown process, or in the macro that >> > returns usable slots (this can lead to backend thinks there's enough >> > slots while there's not). >> >> Ok so perhaps a debug patch which does more sanity checking there ? >> Because i don't see any errors/warnings from netback in dom0. >> > From the look of the code, 0 looks correct to me. Netback won't complain > about "bad gref" because it has no idea what a "good gref" looks like. > Only Xen has the knowledge whether a gref is legit. All netback sees is > the hypercall fails and it tries to push corresponding response to > netfront. But you can probably safely assume a gref larger than several > thousands be bad. OK it's indeed netback setting the status to -1: [ 976.431585] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:1 flags:3 size:958 [ 1030.855057] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:2 flags:7 size:2974 [ 1030.861759] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:0 nr_meta_slots:1 flags:0 size:1460 [ 1030.868499] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:1 flags:3 size:958 [ 1030.875278] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:1 flags:3 size:2974 [ 1068.199650] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:8 flags:7 size:1634 [ 1068.206479] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:0 nr_meta_slots:7 flags:4 size:2848 [ 1068.213158] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:1 nr_meta_slots:7 flags:4 size:4096 [ 1068.219701] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:2 nr_meta_slots:7 flags:4 size:4096 [ 1068.226194] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:3 nr_meta_slots:7 flags:4 size:4096 [ 1068.232728] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:4 nr_meta_slots:7 flags:4 size:4096 [ 1068.239163] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:5 nr_meta_slots:7 flags:4 size:4096 [ 1068.245397] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? status:-1 i:6 nr_meta_slots:7 flags:0 size:1168 [ 1068.251457] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:1 flags:3 size:958 Now to find out why .. >> >> - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in xen-netfront.c doing there and changing cons midway ? >> >> The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that introduced the if says: >> >> >> >> One major change from xen.git is that the guest transmit path (i.e. what >> >> looks like receive to netback) has been significantly reworked to remove >> >> the dependency on the out of tree PageForeign page flag (a core kernel >> >> patch which enables a per page destructor callback on the final >> >> put_page). This page flag was used in order to implement a grant map >> >> based transmit path (where guest pages are mapped directly into SKB >> >> frags). Instead this version of netback uses grant copy operations into >> >> regular memory belonging to the backend domain. Reinstating the grant >> >> map functionality is something which I would like to revisit in the >> >> future. >> >> >> >> It *is* using grant copy now .. so should this part have been removed ? >> >> And >> >> Could Paul's commit that seems to be the first to trigger these events affect this ? >> >> >> >> > This depicts guest *transmit* path, but the issue you're seeing is in >> > guest *receive* path. So that's totally different thing. >> >> Errr yes well if the (rx->flags & XEN_NETRXF_extra_info) {}) should only be doing something on the transmit path .. >> why does it seem to result to true on my issue ? (see cons_changed=1 in the debub output.. and you see cons being cons + 1 after that) > First thing is the response goes wild and probably contains random value > so it's possible the flag is "set". > For a legit response, if you see that flag it means there's an "extra > info" slot in the response. That code snippet is to extract that extra > info slot (could be several in a row, that's why there's a loop in > xennet_get_extras). The request / response format is documented in > include/xen/interface/io/netif.h -- it's only the tx one, but rx one is > more or less the same. > Wei. >> >>