From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wei Liu Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Wed, 19 Mar 2014 11:35:32 +0000 Message-ID: <20140319113532.GD16807@zion.uk.xensource.com> References: <20140312154501.GQ19620@zion.uk.xensource.com> <1189397636.20140312174729@eikelenboom.it> <20140317103524.GH16807@zion.uk.xensource.com> <1563694965.20140317233318@eikelenboom.it> <20140318120401.GX16807@zion.uk.xensource.com> <1744594108.20140318162127@eikelenboom.it> <20140318160412.GB16807@zion.uk.xensource.com> <1701035622.20140318211402@eikelenboom.it> <722971844.20140318221859@eikelenboom.it> <1688396550.20140319001104@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <1688396550.20140319001104@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Sander Eikelenboom Cc: annie li , Paul Durrant , Wei Liu , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Wed, Mar 19, 2014 at 12:11:04AM +0100, Sander Eikelenboom wrote: [...] > > >>> Before Paul's change, we always reserve very large room for an incoming > >>> SKB. After Paul's change, we only reserve just enough room. Probably > >>> some extra room prevents this bug being triggered. > > >> [ 599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 vif->rx.sring->req_event:504174 > >> [ 599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 vif->rx.sring->req_event:506388 > >> [ 600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388 > >> [ 600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 npo->copy_gref:165543936 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:1168 size:1168 i:6 vif->rx.sring->req_event:506388 estimated_slots_needed:8 > >> [ 600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 npo->copy_gref:165543936 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388 estimated_slots_needed:8 > >> [ 600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 req->id:11 estimated_slots_needed:8 i(frag): 0 > >> [ 600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 req->id:11 estimated_slots_needed:8 > >> [ 600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 .. vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:8 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6 min_slots_needed:3 > >> [ 600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 .. vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0 nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1 > > >> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't the queue have been stopped before getting here .. > > > *hrmm please just ignore* .. got to get some sleep i guess .. > > Just went the empirical way .. and unconditionally upped the calculated "max_slots_needed" by one in "xenvif_rx_action" just before checking the "xenvif_rx_ring_slots_available", > this has prevented all non-fatal and fatal occurrences of "cons > prod" so far. I will leave my tests running overnight, see if it survives the pounding. > > >From other net drivers i see different calculations .. seems it is kind of voodoo to determine the value .. one of which (drivers/net/ethernet/marvell/sky2.c) > seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know if that makes any sense though: > > /* This is the worst case number of transmit list elements for a single skb: > VLAN:GSO + CKSUM + Data + skb_frags * DMA */ > #define MAX_SKB_TX_LE (2 + (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1)) > Xen network "wire" format doesn't have a CKSUM metadata type though. > > > >>> Wei. > > >>>> > >>>> >> The second time it does get to the code after the RING_GET_REQUEST in 'get_next_rx_buffer' and that leads to mayhem ... > >>>> >> > >>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. unfortunately it now triggers net_ratelimit at the end: > >>>> >> > >>>> >> [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222 > >>>> > >>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod - > >>>> > req_cons < needed, the queue is stopeed. > > >> So it seems .. most of the time .. but if i look at the calculation of "min_slots_needed" in this function it seems completely different from the one in > >> xenvif_rx_action for max_slots_needed .. though both seem to be used for the same thing .. to calcultate how many slots the brokendown SKB would need to fit in .. > >> So i added the calculation method from xenvif_start_xmit to xenvif_rx_action .. in the error case you see min_slots_needed was 3 .. but max_slots_needed and max_slots_used both were 8. > Those are different estimations. max_slots_needed estimates the worst case while min_slots_needed estimates the best case. When min_slots_needed is met, netback puts the SKB into its internal queue. xenvif_rx_action then will dequeue that packet, check max_slots_needed, if met, break SKB down. What I would expect is, even if they yield different values, checking the ring availability is enough before breaking SKB down. In your above case, max_slots_needed was met and SKB broken down. And as you said in your empirical test, bumping max_slots_needed by 1 prevented issues, then we might have problem calculating max_slots_needed. If you can work out what went wrong that can be very useful. Wei.