From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Roger B. Melton" <rmelton@cisco.com>
Subject: Re: [PATCH] net/i40e: Improve i40evf buffer cleanup in
 tx vector mode
Date: Fri, 6 Oct 2017 10:06:02 -0400
Message-ID: <4c446975-687e-1a84-e39b-00d43c909205@cisco.com>
References: <20171005191111.27557-1-rmelton@cisco.com>
 <20171006085403.GA24124@bricha3-MOBL3.ger.corp.intel.com>
 <b0218719-1adb-eec7-4377-f413035c1d4b@cisco.com>
 <20171006135152.GA31796@bricha3-MOBL3.ger.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: jingjing.wu@intel.com, dev@dpdk.org
To: Bruce Richardson <bruce.richardson@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from rcdn-iport-2.cisco.com (rcdn-iport-2.cisco.com [173.37.86.73])
 by dpdk.org (Postfix) with ESMTP id B42171B1A4
 for <dev@dpdk.org>; Fri,  6 Oct 2017 16:06:11 +0200 (CEST)
In-Reply-To: <20171006135152.GA31796@bricha3-MOBL3.ger.corp.intel.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 10/6/17 9:51 AM, Bruce Richardson wrote:
> On Fri, Oct 06, 2017 at 08:38:01AM -0400, Roger B. Melton wrote:
>> On 10/6/17 4:54 AM, Bruce Richardson wrote:
>>> On Thu, Oct 05, 2017 at 03:11:11PM -0400, Roger B Melton wrote:
>>>> ---
>>>>
>>>> i40evf tx vector logic frees mbufs, but it does not remove the
>>>> mbufs from software rings which leads to double frees.  This change
>>>>    corrects that oversight.  We've validated this fix within our application.
>>>>
>>> Hi Roger,
>>>
>>> I'm a little concerned here by this driver fix, since if we are getting
>>> double frees of mbufs there must be another bug somewhere in tracking
>>> the free ring elements. Clearing the entries to NULL introduces extra
>>> writes to the vector code path which will likely have a performance
>>> impact, but also should be unnecessary, given proper tracking of the
>>> ring status.
>>>
>>> Can you provide us with some details as to how to reproduce the issue,
>>> especially with a public sample app? I'd really like to look more into
>>> why this is happening, and if other fixes may work.
>>>
>>> Thanks,
>>> /Bruce
>>>
>>> .
>>>
>> Hey Bruce,
>>
>> I've not attempted to reproduce the issue using sample apps.  It was
>> initially difficult for us to reproduce with our application until we
>> stumbled on a recipe.  The symptoms of the double free are two fold:
>>
>>   * Application crashes: Corrupted packet headers, walking a chain of rx
>>     segments and loosing one in the middle,...
>>   * i40e adminq lockup - Meaning VF sends an OP and PF does not process it
>>
>> The former has been directly correlated to tx vector mode.  We still don't
>> understand how this could lead to adminq lockup, but we have not observed
>> the issue applied the patch I submitted.  We have questions out to Intel
>> i40e team on this.
>>
>> Here's a high level view of the scenario under which the issues are observed
>> and how we concluded that there were issues with tx vector free. We provided
>> this information to the Intel i40e DPDK team, they reviewed the tx vector
>> logic and suggested changes.  With the changes suggested by Intel (the patch
>> I submitted) we have re-enabled tx vector when MTU is < MBUF size and
>> observed no crashes.
>>
>> Below that you will find additional detail on the procedure within our
>> application for changing MTU, including DPDK API calls.
>>
>> Let me know if you have additional questions.
>>
>> Regards,
>> Roger
>>
>>       o Scenario:
>>
>>           + Intel i40e-da4 (4x10G) or Cisco branded equivalent
>>           + KVM or ESXi 6.5 host
>>           + In a single VM we have at least 1 VF from all 4 PFs.
>>           + We are running traffic on all interfaces.
>>           + We stop traffic
>>           + We stop a device (VF)
>>           + We start the device
>>           + We start traffic
>>           + At any point after we start the device we can crash
>>
>>       o Experiment (Jumping over some of the work to get to the point
>>         where we believed that i40e driver was doing double frees):
>>
>>           + Our application attaches userdata to mbufs and we put that
>>             userdata on a linked list.
>>           + We observed that when we processed the userdata it had been
>>             corrupted which lead to crashes.
>>           + This gave us a hint that the mbuf was on multiple lists.
>>           + We reviewed our application code and could not find a cause.
>>           + We began to suspect a double free in the i40evf PMD.
>>
>>               # We disabled rx free logic and observed crashes
>>                 (intentionally leaking mbufs in search of the double free).
>>               # We disabled tx free logic and observed no crashes
>>               # This gave us a hint that the double frees were coming
>>                 from the i40evf PMD tx logic.
>>
>>           + We had also observed that if we forced MTU to large always
>>             that there were no crashes
>>
>>               # A side effect of forcing large MTU is that multi-segment
>>                 is enabled.
>>               # This gave us a hint that enabling multi-segment was
>>                 somehow avoiding the double free.
>>
>>           + We forced multi-segments regardless of MTU and permitted MTU
>>             changes and observed no crashes.
>>           + We reviewed the i40evf mbuf free logic to see the effect of
>>             enabling multi-segment and observed that when multi-segment
>>             is enabled, rx vector was enabled but tx vector was not.
>>           + This lead us to examine RX vector mode free logic vs TX
>>             vector mode free logic.
>>
>>               # RX free logic has special handling for vector mode free
>>               # TX free logic does not have any special handling for
>>                 vector free
>>
>>       o By enabling multiple segments always (even if MTU does not
>>         require multiple segments), we effectively disabled tx vector
>>         mode and this avoided the double free.
>>       o Our application no longer crashed, but it could not take
>>         advantage of tx vector optimizations.
>>
>>
>>
>>
>>
>> CP == Control Plane
>> DP == Data Plane
>>
>>   * CP sends admin down to DP
>>   * DP disables RX/TX
>>       o Block all future tx burst calls
>>       o rte_eth_dev_set_link_down() invoked
>>       o Block all future rx burst calls
>>   * DP notifies CP admin down action is complete
>>   * CP sends MTU change
>>   * DP processes MTU change
>>       o For each rxq:
>>           + rte_eth_rx_queue_info_get()
>>           + if not rxq_deferred_start rte_eth_dev_rx_queue_stop()
>>       o For each txq:
>>           + rte_eth_tx_queue_info_get()
>>           + if not txq_deferred_start rte_eth_dev_tx_queue_stop()
>>       o rte_eth_dev_stop()
>>       o Re-configure the port: (Note this is original code, not new code
>>         which is forcing multisegs always)
>>           + Set rx_mode.jumbo_frame if MTU > 1518
>>           + Set rx_mode.enable_scatter if MTU > 2048
>>           + txq_flags = ETH_TXQ_NOOFLOADS
>>           + if MTU > 2048, txq_flags |= ETH_TXQ_FLAGS_NOMULTSEGS
>>           + rte_eth_promiscuous_get()
>>           + rte_eth_dev_info_get()
>>           + rte_eth_dev_configure()
>>               # Init tx_vec_allowed and rx_vec_allowed to TRUE.
>>           + rte_eth_dev_info_get()
>>           + For each txq: rte_eth_tx_queue_setup()
>>               # If new MTU > 2048, ETH_TXQ_FLAGS_NOMULTSEGS was set in
>>                 txq_flags & tx_vec_allowed will be cleared.
>>           + For each rxq: rte_eth_rx_queue_setup()
>>       o rte_eth_dev_set_mtu()
>>       o rte_eth_dev_start()
>>       o rte_eth_dev_info_get()
>>       o For each rxq:
>>           + if not rxq_deferred_start rte_eth_dev_rx_queue_start()
>>       o For each txq:
>>           + if not txq_deferred_start rte_eth_dev_tx_queue_start()
>>   * DP notifies CP MTU change applied.
>>   * CP sends admin up to DP
>>   * DP enables RX/TX
>>       o Enable all future tx burst calls
>>       o rte_eth_dev_set_link_up() invoked
>>       o Enable all future rx burst calls
>>   * DP notifies CP admin up action is complete
>>
>>
>>
> Thanks for the explanation.
>
> I'll follow-up with our i40e driver team to see if they can give me more
> detail on their investigation. I'd like to know more details about the
> ultimate root cause of this double free - presumably it's something to
> do with the behaviour when stopping and starting the queue or port - and
> find a solution that does not involve more writes on the fast-path. If
> we don't come up with a better solution, this fix will do, though.
>
> Regards,
> /Bruce
> .
>

Thanks Bruce,  I'll go ahead and submit V2 properly signed off with same 
content.

Regards,
Roger