From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Wed, 19 Mar 2014 00:11:04 +0100 Message-ID: <1688396550.20140319001104@eikelenboom.it> References: <20140312145915.GM19620@zion.uk.xensource.com> <309265573.20140312160156@eikelenboom.it> <20140312150435.GO19620@zion.uk.xensource.com> <1934414370.20140312162003@eikelenboom.it> <20140312154501.GQ19620@zion.uk.xensource.com> <1189397636.20140312174729@eikelenboom.it> <20140317103524.GH16807@zion.uk.xensource.com> <1563694965.20140317233318@eikelenboom.it> <20140318120401.GX16807@zion.uk.xensource.com> <1744594108.20140318162127@eikelenboom.it> <20140318160412.GB16807@zion.uk.xensource.com> <1701035622.20140318211402@eikelenboom.it> <722971844.20140318221859@eikelenboom.it> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <722971844.20140318221859@eikelenboom.it> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu , Paul Durrant Cc: annie li , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Hello Sander, Tuesday, March 18, 2014, 10:18:59 PM, you wrote: > Tuesday, March 18, 2014, 9:14:02 PM, you wrote: >> Tuesday, March 18, 2014, 5:04:12 PM, you wrote: >>> On Tue, Mar 18, 2014 at 04:21:27PM +0100, Sander Eikelenboom wrote: >>> [...] >>>> >>>> Added even more warns ... >>>> >>>> [ 297.885969] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:21764 vif->rx.req_cons:21762 >>>> [ 298.760555] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:3 vif->rx.sring->req_prod:22488 vif->rx.req_cons:22486 >>>> >>>> [ 306.376176] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313 >>>> [ 306.376556] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313 >>>> [ 306.391863] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 req->gref:4325377 req->id:153 >>>> >>>> [ 306.407599] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:640 size:640 i:4 >>>> [ 306.423913] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 npo->copy_gref:4325377 npo->copy_off:640 MAX_BUFFER_OFFSET:4096 bytes:640 size:0 i:5 >>>> >>>> >>>> [ 306.440941] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 npo->meta_prod:31 old_meta_prod:25 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 gso_type:1 gso_size:1448 nr_frags:1 req->gref:638 req->id:147 >>>> [ 306.458334] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 >>>> [ 306.476097] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154 >>>> [ 306.494462] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 before npo->meta_prod:32 old_meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154 j:0 >>>> [ 306.513424] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here start npo->meta_prod:32 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:0 size:66 i:0 >>>> [ 311.390883] net_ratelimit: 317 callbacks suppressed >>>> [ 311.400901] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:3 vif->rx.sring->req_prod:32386 vif->rx.req_cons:32322 >>>> >>>> - So in this case we are in the 3rd iteration of the loop in xenvif_gop_frag_copy ... >>>> - Xenvif_start_xmit stop the queue since it has detected it needs one more slot which is unavailable at that time. >>> Yes. >>>> - The current rx thread however doesn't know and doesn't check (neither in the loop in xenvif_gop_frag_copy nor in get_next_rx_buffer that the ring if full) .. while prod == cons now .. consumes another one .. >>> It does check -- but not in xenvif_gop_frag_copy -- see >>> xenvif_rx_action, which calls xenvif_rx_ring_slots_available before >>> queueing skb to break down. That is, when you call xenvif_gop_skb there >>> should be enough room to accommodate that SKB. >>>> - That ring request leads to the bad grant references reported by the hypervisor >>>> >>>> (XEN) [2014-03-18 15:02:58.928] grant_table.c:1857:d0v2 Bad grant reference 4325377 >>>> >>>> So should there be a check added there ... or should the callers "xenvif_gop_frag_copy" and the caller of that one "xenvif_gop_skb" already have anticipated that what the were about >>>> to do wasn't going to fit anyway ? >>>> >>> No, see above. >>>> And of course .. how made Paul's change trigger this ? >>>> >>> Before Paul's change, we always reserve very large room for an incoming >>> SKB. After Paul's change, we only reserve just enough room. Probably >>> some extra room prevents this bug being triggered. >> [ 599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 vif->rx.sring->req_event:504174 >> [ 599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 vif->rx.sring->req_event:506388 >> [ 600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388 >> [ 600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 npo->copy_gref:165543936 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:1168 size:1168 i:6 vif->rx.sring->req_event:506388 estimated_slots_needed:8 >> [ 600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 npo->copy_gref:165543936 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388 estimated_slots_needed:8 >> [ 600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 req->id:11 estimated_slots_needed:8 i(frag): 0 >> [ 600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 req->id:11 estimated_slots_needed:8 >> [ 600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 .. vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:8 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6 min_slots_needed:3 >> [ 600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 .. vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0 nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1 >> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't the queue have been stopped before getting here .. > *hrmm please just ignore* .. got to get some sleep i guess .. Just went the empirical way .. and unconditionally upped the calculated "max_slots_needed" by one in "xenvif_rx_action" just before checking the "xenvif_rx_ring_slots_available", this has prevented all non-fatal and fatal occurrences of "cons > prod" so far. I will leave my tests running overnight, see if it survives the pounding. >>From other net drivers i see different calculations .. seems it is kind of voodoo to determine the value .. one of which (drivers/net/ethernet/marvell/sky2.c) seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know if that makes any sense though: /* This is the worst case number of transmit list elements for a single skb: VLAN:GSO + CKSUM + Data + skb_frags * DMA */ #define MAX_SKB_TX_LE (2 + (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1)) >>> Wei. >>>> >>>> >> The second time it does get to the code after the RING_GET_REQUEST in 'get_next_rx_buffer' and that leads to mayhem ... >>>> >> >>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. unfortunately it now triggers net_ratelimit at the end: >>>> >> >>>> >> [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222 >>>> >>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod - >>>> > req_cons < needed, the queue is stopeed. >> So it seems .. most of the time .. but if i look at the calculation of "min_slots_needed" in this function it seems completely different from the one in >> xenvif_rx_action for max_slots_needed .. though both seem to be used for the same thing .. to calcultate how many slots the brokendown SKB would need to fit in .. >> So i added the calculation method from xenvif_start_xmit to xenvif_rx_action .. in the error case you see min_slots_needed was 3 .. but max_slots_needed and max_slots_used both were 8. >> The main difference between these calculation methods seems to be that min_slots_needed doesn't take the PAGE_SIZE into account to see how many slots are needed for the frags. >> So Paul .. why was the "xenvif_count_skb_slots(vif, skb)" function dropped and replaced by two seperate and different calculations ? >> -- >> Sander >>>> >>>> > Wei. >>>> -- Best regards, Sander mailto:linux@eikelenboom.it