Monday, March 17, 2014, 11:35:24 AM, you wrote: > On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote: >> >> Wednesday, March 12, 2014, 4:45:01 PM, you wrote: >> >> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote: >> > [...] >> >> >> >> > Sorry, remove the trailing "S". Actually you only need to look at netback.c. >> >> >> >> What producer index to compare with .. there are quite some RING_GET_REQUESTS .. and i see: >> >> npo->meta_prod >> >> vif->rx.sring->req_prod >> >> vif->pending_prod >> >> >> >> to name a few .. >> >> Any particular RING_GET_REQUESTS call and particular producer index you are interested in ? >> >> >> >> > There are two macros you can use >> >> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW. >> >> Ah i already produced my own .. diff to netback is attached .. >> >> Netback: >> Mar 12 17:41:26 serveerstertje kernel: [ 464.778614] vif vif-7-0 vif7.0: ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431007 >> Mar 12 17:41:26 serveerstertje kernel: [ 464.786203] vif vif-7-0 vif7.0: ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431008 > req_prod < req_cons, so there's an overflow. I'm actually curious how > this could happen. > Back to the code, before netback enqueues SKB to its internal queue, it > will check if there's enough room in the ring. Before Paul's changeset, > it checks against a static number (the possible maximum slots that can > be consumed by an SKB). Paul's changeset made it check against the > actual slots the incoming SKB consumes. See > interface.c:xenvif_start_xmit. Another interesting site would be when > the SKB is broken down later on in internal queue. See > netback.c:xenvif_rx_action. The routine to break down an SKB is > xenvif_gop_skb. > Although they look alright to me, but you might want to instrument them > a bit more to see what triggers that overflow. It's a bit frustrating, > but a bug that cannot be easily reproduced is indeed extremely hard to > fix. [ 554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751 [ 571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 [ 571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 req->gref:19791875 req->id:22 [ 571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 npo->copy_gref:19791875 [ 571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 req->id:14 [ 571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 [ 571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23 [ 571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23 [ 571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7. [ 571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? status:-1 i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38 [ 571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 nr_frags:1 npo.copy_prod:39, npo.meta_cons:30 It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, but the first time somehow the code after the actual RING_GET_REQUEST isn't ran (why?!? queue stop ?) .. and it doesn't lead to an immediate problem .. The second time it does get to the code after the RING_GET_REQUEST in 'get_next_rx_buffer' and that leads to mayhem ... So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. unfortunately it now triggers net_ratelimit at the end: [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222 [ 403.409624] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:191612 vif->rx.req_cons:191609 [ 403.658975] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:192860 vif->rx.req_cons:192860 [ 406.417504] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:204870 vif->rx.req_cons:204869 [ 408.156112] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:212862 vif->rx.req_cons:212862 [ 408.407375] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:7 vif->rx.sring->req_prod:213613 vif->rx.req_cons:213609 [ 408.899250] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:215352 vif->rx.req_cons:215352 [ 413.655552] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:237250 vif->rx.req_cons:237249 [ 417.649867] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:252580 vif->rx.req_cons:252580 [ 418.400194] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:255828 vif->rx.req_cons:255826 [ 419.147431] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:37 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 [ 419.149966] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258745 [ 419.153145] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:37 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 req->gref:19791875 req->id:185 [ 419.159518] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 npo->copy_gref:19791875 [ 419.166980] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 gso_type:1 gso_size:1448 nr_frags:1 req->gref:686 req->id:177 [ 419.175597] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258747 gso_type:0 gso_size:0 nr_frags:0 [ 419.185284] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258747 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:186 [ 422.717242] net_ratelimit: 158 callbacks suppressed Hope that this time it does shed some light :-) diff to netback is attached. -- Sander > Wei.