From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected" Date: Fri, 21 Mar 2014 18:27:27 +0100 Message-ID: <1334202265.20140321182727@eikelenboom.it> References: <20140317103524.GH16807@zion.uk.xensource.com> <1563694965.20140317233318@eikelenboom.it> <20140318120401.GX16807@zion.uk.xensource.com> <1744594108.20140318162127@eikelenboom.it> <20140318160412.GB16807@zion.uk.xensource.com> <1701035622.20140318211402@eikelenboom.it> <722971844.20140318221859@eikelenboom.it> <1688396550.20140319001104@eikelenboom.it> <20140319113532.GD16807@zion.uk.xensource.com> <246793256.20140319220752@eikelenboom.it> <20140321164958.GA31766@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140321164958.GA31766@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: annie li , Paul Durrant , Zoltan Kiss , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org Friday, March 21, 2014, 5:49:58 PM, you wrote: > [...] >> > Those are different estimations. max_slots_needed estimates the worst >> > case while min_slots_needed estimates the best case. When >> > min_slots_needed is met, netback puts the SKB into its internal queue. >> > xenvif_rx_action then will dequeue that packet, check max_slots_needed, >> > if met, break SKB down. >> >> I realised that, though if you don't count the cost of calculation .. the best would be >> to have min_slots and max_slots the same .. and being correct. But cost of the calculation count .. >> on the other hand min_slots seems to be far too low in this case. Is the nr_frags the best thing to check for ? >> Wouldn't something like skbsize / PAGE_SIZE be a better estimate ? >> > It doesn't matter. It won't cause breakage at that point. That was more from a performance point of view ... if the estimate can be that much off it would do the more costly check .. and still bail out. So the better the first estimation (at about the same cost of course) the better. >> >> > What I would expect is, even if they yield different values, checking >> > the ring availability is enough before breaking SKB down. >> >> > In your above case, max_slots_needed was met and SKB broken down. And as >> > you said in your empirical test, bumping max_slots_needed by 1 prevented >> > issues, then we might have problem calculating max_slots_needed. If you >> > can work out what went wrong that can be very useful. >> >> OK ran the whole night with "max_slots_needed+1" and i saw no triggering of both: >> - the non-fatal oddity (only a temporary cons > prod ..) >> - the fatal case (leading to the grantref error) >> >> So it seems we are off by one .. somewhere >> >> >> At first the non-fatal oddity .. this one i really can't grasp: >> >> [ 567.880282] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:562972 vif->rx.req_cons:562969 vif->rx.sring->req_event:562973 >> [ 569.893801] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:571188 vif->rx.req_cons:571187 vif->rx.sring->req_event:571189 >> [ 571.892419] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:577141 vif->rx.req_cons:577138 vif->rx.sring->req_event:577142 >> [ 575.451383] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:6 vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 vif->rx.sring->req_event:589615, reserved_slots_left:0 >> [ 575.487721] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:6 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590604 req->gref:569 req->id:11 vif->rx.sring->req_event:589615 reserved_slots_left:-1 >> [ 575.524427] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590660 vif->rx.sring->req_event:590661 >> [ 576.508063] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:16 vif->rx.sring->req_prod:594343 vif->rx.req_cons:594343 vif->rx.sring->req_event:591619, reserved_slots_left:0 >> [ 576.544708] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:16 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594344 req->gref:668 req->id:167 vif->rx.sring->req_event:591619 reserved_slots_left:-1 >> [ 576.581595] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594389 vif->rx.sring->req_event:594390 >> [ 577.325826] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:597248 vif->rx.req_cons:597247 vif->rx.sring->req_event:597249 >> [ 577.576973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:598146 vif->rx.req_cons:598143 vif->rx.sring->req_event:598147 >> >> What i don't get is: >> - i do warn just before the RING_GET_REQUEST on vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 .. so cons == prod >> - i warn because req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++); is going to add one to cons .. leading to a situation that should not exist .. >> - however in the warn just after this RING_GET_REQUEST .. prod is somehow increased a lot .. probably the "next_rx_buffer" aspect of things .. so cons is much smaller than prod again .. > Frontend is also accessing the ring. So it's normal that some index > changed. Ah ok .. so here the frontend is "saving us" by putting more stuff on the ring so we don't run dry .. >> - How everi don't see anything in the RING_GET_REQUEST macro fiddling with vif->rx.sring->req_prod .. >> - so who has been fiddling with vif->rx.sring->req_prod in the mean time .. my very limited c skills fail to see how this works :-) >> >> >> Then the fatal case ... >> >> [ 1108.566486] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:1843278 vif->rx.req_cons:1843278 vif->rx.sring->req_event:1843279 >> [ 1109.775757] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1847779 vif->rx.req_cons:1847776 vif->rx.sring->req_event:1847780 >> [ 1110.041661] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1848795 vif->rx.req_cons:1848794 vif->rx.sring->req_event:1848796 >> [ 1110.301778] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1849755 vif->rx.req_cons:1849754 vif->rx.sring->req_event:1849756 >> [ 1111.565412] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 vif->rx.sring->req_event:1853482, reserved_slots_left:0 >> [ 1111.565973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 vif->rx.sring->req_event:1854292 >> [ 1111.638425] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 req->gref:4325377 req->id:83 vif->rx.sring->req_event:1854292 reserved_slots_left:-1 >> [ 1111.675042] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:1168 size:1168 i:5 vif->rx.sring->req_event:1854292 estimated_slots_needed:7 reserved_slots_left:-1 >> [ 1111.711833] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 npo->copy_gref:4325377 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 bytes:1168 size:0 i:6 vif->rx.sring->req_event:1854292 gso_gaps:0 estimated_slots_needed:7 reserved_slots_left:-1 >> [ 1111.766836] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 after npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 req->gref:674 req->id:76 estimated_slots_needed:7 i(frag):0 j(data):1 reserved_slots_left:-1 >> [ 1111.803702] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 req->gref:674 req->id:76 estimated_slots_needed:7 j(data):1 reserved_slots_left:-1 used in funcstart: 0 + 1 .. used_dataloop:1 .. used_fragloop:6 >> [ 1111.858602] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 .. vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 sco->meta_slots_used:7 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:7 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:5 reserved_slots_left:-1 >> >> Ok so if i try to pull this apart (from bottom to top): >> >> - it starts out with "xenvif_rx_action" and calculated we would need: >> 1 slot for the DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE); >> 5 slots for 1 frag >> 1 slot for GSO >> So 7 slots in total .. >> >> - it checks if there are enough slots available if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) and aparently there are .. so it continues >> >> - We end up in "xenvif_gop_skb" processing .. and everything seems to go well .. >> - In the start of this function we do one unconditional RING_GET_REQUEST .. and potentially one conditional (if gso) >> - Question: in "xenvif_rx_action" we only reserved ONE slot for GSO stuff .. so this could cost TWO slots .. is that right ? > The slot reserved for GSO is for meta data, so 1 is enough. >> - But in this case we only use one ... see "used in funcstart: 0 + 1" >> >> - Now we loop 1 time through the "while (data < skb_tail_pointer(skb))" loop ... since j(data)=1 >> - and call "xenvif_gop_frag_copy" from inside this loop here .. which potentially calls .. get_next_rx_buffer >> - If i understand it correctly the number of slots this loop should take at a maximum should be what we calculated before with the >> DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE) ? (1 slot in this case) >> - Nothing is logged yet because cons is still smaller than prod at this time, but from the aftermath we see this was "used_dataloop:1" >> - So that seems correct .. >> >> - Now we are at the loop through "nr_frags" and loop the 1 frag we have .. >> - This all goes well until the sixth iteration (i from 0..5) .. and call "xenvif_gop_frag_copy" from inside this loop ... > Isn't there only one frag? Yes there is only one frag .. but it seems to be much larger than PAGE_SIZE .. and xenvif_gop_frag_copy brakes that frag down into smaller bits .. hence the calculation in xenvif_rx_action determining the slots needed by doing: for (i = 0; i < nr_frags; i++) { unsigned int size; size = skb_frag_size(&skb_shinfo(skb)->frags[i]); max_slots_needed += DIV_ROUND_UP(size, PAGE_SIZE); } But the code in xenvif_gop_frag_copy .. seems to be needing one more slot (from the emperical test) .. and calling "get_next_rx_buffer" seems involved in that .. >> - "xenvif_gop_frag_copy" calls "get_next_rx_buffer" .. cons becomes larger than prod ..and the "miracle" from the non-fatal case of prod suddenly becoming larger does not happen >> >> - So the calculation of how many slots are required for the frags in "xenvif_rx_action" does not match wath the code in "xenvif_gop_frag_copy" seems to need, >> I don't see why .. it's rounding up .. should behave correct if the left over happens to be 0 size .. >> >> >> >> > Wei. >>