From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753842AbdA3Qsy (ORCPT ); Mon, 30 Jan 2017 11:48:54 -0500 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:48773 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751637AbdA3Qr7 (ORCPT ); Mon, 30 Jan 2017 11:47:59 -0500 X-IronPort-AV: E=Sophos;i="5.33,312,1477958400"; d="scan'208";a="652142186" Subject: Re: [PATCH v2] xen-netfront: Fix Rx stall during network stress and OOM To: Boris Ostrovsky , References: <1484771149-12699-1-git-send-email-vineethp@u480fcf3b67f557f68df1.ant.amazon.com> <66b10c64-936a-8001-6855-2ff1ed626642@amazon.com> <38ccfaea-0a65-a6f3-c19a-e6f9c0d4ef76@oracle.com> CC: David Miller , , Wei Liu , Paul Durrant , xen-devel From: Vineeth Remanan Pillai Message-ID: <989bd104-13a9-f25f-b857-24ec49781f9c@amazon.com> Date: Mon, 30 Jan 2017 08:47:52 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <38ccfaea-0a65-a6f3-c19a-e6f9c0d4ef76@oracle.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.43.162.40] X-ClientProxiedBy: EX13D01UWB003.ant.amazon.com (10.43.161.94) To EX13D08UWC003.ant.amazon.com (10.43.162.21) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/29/2017 03:09 PM, Boris Ostrovsky wrote: > > There are couple of problems with this patch. > 1. The 'if' clause now evaluates to true on pretty much every call to > xennet_alloc_rx_buffers(). Thanks for catching this. In my testing I did not notice this - mostly because of the nature of the workload in my testing. > 2. It tickles a latent bug during resume where the timer triggers > before we re-connect. The trouble is that we now try to dereference > queue->rx.sring which is NULL since we disconnect in > netfront_resume(). (Curiously, I only observe it with 32-bit guests) I think we may hit this bug after removing the timer as well. We call RING_PUSH_REQUESTS_AND_CHECK_NOTIFY soon after, which also dereference queue->rx.sring. Thanks, Vineeth