From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751176AbdA2XLH (ORCPT ); Sun, 29 Jan 2017 18:11:07 -0500 Received: from userp1050.oracle.com ([156.151.31.82]:39892 "EHLO userp1050.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750798AbdA2XLE (ORCPT ); Sun, 29 Jan 2017 18:11:04 -0500 Subject: Re: [PATCH v2] xen-netfront: Fix Rx stall during network stress and OOM To: Vineeth Remanan Pillai , David Miller , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Wei Liu , Paul Durrant , xen-devel References: <1484771149-12699-1-git-send-email-vineethp@u480fcf3b67f557f68df1.ant.amazon.com> <66b10c64-936a-8001-6855-2ff1ed626642@amazon.com> From: Boris Ostrovsky Message-ID: <38ccfaea-0a65-a6f3-c19a-e6f9c0d4ef76@oracle.com> Date: Sun, 29 Jan 2017 18:09:21 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <66b10c64-936a-8001-6855-2ff1ed626642@amazon.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userp1040.oracle.com [156.151.31.81] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/19/2017 11:35 AM, Vineeth Remanan Pillai wrote: > From: Vineeth Remanan Pillai > > During an OOM scenario, request slots could not be created as skb > allocation fails. So the netback cannot pass in packets and netfront > wrongly assumes that there is no more work to be done and it disables > polling. This causes Rx to stall. > > The issue is with the retry logic which schedules the timer if the > created slots are less than NET_RX_SLOTS_MIN. The count of new request > slots to be pushed are calculated as a difference between new req_prod > and rsp_cons which could be more than the actual slots, if there are > unconsumed responses. > > The fix is to calculate the count of newly created slots as the > difference between new req_prod and old req_prod. > > Signed-off-by: Vineeth Remanan Pillai > Reviewed-by: Juergen Gross > --- > Changes in v2: > - Removed the old implementation of enabling polling on > skb allocation error. > - Corrected the refill timer logic to schedule when newly > created slots since last push is less than NET_RX_SLOTS_MIN. There are couple of problems with this patch. 1. The 'if' clause now evaluates to true on pretty much every call to xennet_alloc_rx_buffers(). 2. It tickles a latent bug during resume where the timer triggers before we re-connect. The trouble is that we now try to dereference queue->rx.sring which is NULL since we disconnect in netfront_resume(). (Curiously, I only observe it with 32-bit guests) I'll send a patch later that will delete the timer since it looks like a bug to me in any case but the first problem seems to be more serious than the problem that this patch addresses. -boris > > drivers/net/xen-netfront.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 40f26b6..2c7c29f 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -321,7 +321,7 @@ static void xennet_alloc_rx_buffers(struct > netfront_queue *queue) > queue->rx.req_prod_pvt = req_prod; > > /* Not enough requests? Try again later. */ > - if (req_prod - queue->rx.rsp_cons < NET_RX_SLOTS_MIN) { > + if (req_prod - queue->rx.sring->req_prod < NET_RX_SLOTS_MIN) { > mod_timer(&queue->rx_refill_timer, jiffies + (HZ/10)); > return; > }