From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Tue, 18 Mar 2014 16:21:27 +0100 Message-ID: <1744594108.20140318162127@eikelenboom.it> References: <20140312144826.GK19620@zion.uk.xensource.com> <1241369584.20140312154946@eikelenboom.it> <20140312145915.GM19620@zion.uk.xensource.com> <309265573.20140312160156@eikelenboom.it> <20140312150435.GO19620@zion.uk.xensource.com> <1934414370.20140312162003@eikelenboom.it> <20140312154501.GQ19620@zion.uk.xensource.com> <1189397636.20140312174729@eikelenboom.it> <20140317103524.GH16807@zion.uk.xensource.com> <1563694965.20140317233318@eikelenboom.it> <20140318120401.GX16807@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140318120401.GX16807@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: annie li , Paul Durrant , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Tuesday, March 18, 2014, 1:04:01 PM, you wrote: > On Mon, Mar 17, 2014 at 11:33:18PM +0100, Sander Eikelenboom wrote: >> >> Monday, March 17, 2014, 11:35:24 AM, you wrote: >> >> > On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote: >> >> >> >> Wednesday, March 12, 2014, 4:45:01 PM, you wrote: >> >> >> >> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote: >> >> > [...] >> >> >> >> >> >> > Sorry, remove the trailing "S". Actually you only need to look at netback.c. >> >> >> >> >> >> What producer index to compare with .. there are quite some RING_GET_REQUESTS .. and i see: >> >> >> npo->meta_prod >> >> >> vif->rx.sring->req_prod >> >> >> vif->pending_prod >> >> >> >> >> >> to name a few .. >> >> >> Any particular RING_GET_REQUESTS call and particular producer index you are interested in ? >> >> >> >> >> >> >> > There are two macros you can use >> >> >> >> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW. >> >> >> >> Ah i already produced my own .. diff to netback is attached .. >> >> >> >> Netback: >> >> Mar 12 17:41:26 serveerstertje kernel: [ 464.778614] vif vif-7-0 vif7.0: ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431007 >> >> Mar 12 17:41:26 serveerstertje kernel: [ 464.786203] vif vif-7-0 vif7.0: ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431008 >> >> > req_prod < req_cons, so there's an overflow. I'm actually curious how >> > this could happen. >> >> > Back to the code, before netback enqueues SKB to its internal queue, it >> > will check if there's enough room in the ring. Before Paul's changeset, >> > it checks against a static number (the possible maximum slots that can >> > be consumed by an SKB). Paul's changeset made it check against the >> > actual slots the incoming SKB consumes. See >> > interface.c:xenvif_start_xmit. Another interesting site would be when >> > the SKB is broken down later on in internal queue. See >> > netback.c:xenvif_rx_action. The routine to break down an SKB is >> > xenvif_gop_skb. >> >> > Although they look alright to me, but you might want to instrument them >> > a bit more to see what triggers that overflow. It's a bit frustrating, >> > but a bug that cannot be easily reproduced is indeed extremely hard to >> > fix. >> >> [ 554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751 >> [ 571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 >> [ 571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 req->gref:19791875 req->id:22 >> [ 571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 npo->copy_gref:19791875 >> [ 571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 req->id:14 >> [ 571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 >> [ 571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23 >> [ 571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23 >> [ 571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7. >> [ 571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? status:-1 i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38 >> [ 571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 nr_frags:1 npo.copy_prod:39, npo.meta_cons:30 >> >> It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, but the first time somehow the code after the actual RING_GET_REQUEST isn't ran (why?!? queue stop ?) .. and it doesn't lead to an immediate problem .. > You've rate limited the output. get_next_rx_buffer won't sleep > whatsoever. Stopping the queue won't affect a running RX kthread. You > can think of the start_xmit and kthread run in parallel. > get_next_rx_buffer doens't check if the ring is overflowed. It's the > caller's fault. It might be worth checking start_new_rx_buffer Added even more warns ... [ 297.885969] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:21764 vif->rx.req_cons:21762 [ 298.760555] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:3 vif->rx.sring->req_prod:22488 vif->rx.req_cons:22486 [ 306.376176] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313 [ 306.376556] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313 [ 306.391863] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 req->gref:4325377 req->id:153 [ 306.407599] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:640 size:640 i:4 [ 306.423913] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 npo->copy_gref:4325377 npo->copy_off:640 MAX_BUFFER_OFFSET:4096 bytes:640 size:0 i:5 [ 306.440941] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 npo->meta_prod:31 old_meta_prod:25 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 gso_type:1 gso_size:1448 nr_frags:1 req->gref:638 req->id:147 [ 306.458334] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 [ 306.476097] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154 [ 306.494462] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 before npo->meta_prod:32 old_meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154 j:0 [ 306.513424] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here start npo->meta_prod:32 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:0 size:66 i:0 [ 311.390883] net_ratelimit: 317 callbacks suppressed [ 311.400901] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:3 vif->rx.sring->req_prod:32386 vif->rx.req_cons:32322 - So in this case we are in the 3rd iteration of the loop in xenvif_gop_frag_copy ... - Xenvif_start_xmit stop the queue since it has detected it needs one more slot which is unavailable at that time. - The current rx thread however doesn't know and doesn't check (neither in the loop in xenvif_gop_frag_copy nor in get_next_rx_buffer that the ring if full) .. while prod == cons now .. consumes another one .. - That ring request leads to the bad grant references reported by the hypervisor (XEN) [2014-03-18 15:02:58.928] grant_table.c:1857:d0v2 Bad grant reference 4325377 So should there be a check added there ... or should the callers "xenvif_gop_frag_copy" and the caller of that one "xenvif_gop_skb" already have anticipated that what the were about to do wasn't going to fit anyway ? And of course .. how made Paul's change trigger this ? >> The second time it does get to the code after the RING_GET_REQUEST in 'get_next_rx_buffer' and that leads to mayhem ... >> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. unfortunately it now triggers net_ratelimit at the end: >> >> [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222 > I think xenvif_rx_ring_slots_available is doing its job. If req_prod - > req_cons < needed, the queue is stopeed. > Wei.