All of lore.kernel.org
 help / color / mirror / Atom feed
From: =?unknown-8bit?q?Pawe=C5=82?= Staszewski <pstaszewski@itcare.pl>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] Linux 4.12+ memory leak on router with i40e NICs
Date: Sun, 15 Oct 2017 00:00:04 +0200	[thread overview]
Message-ID: <227d17ae-b040-07d0-3c57-e9acd1a3b5b4@itcare.pl> (raw)
In-Reply-To: <1507180753.20182.8.camel@cohaesio.com>

Same problem here

Also only difference is change 82599 intel to x710 and have memleak

mem with ixgbe driver over time - same config saame kernel


changed NIC's to x710 i40e driver (this is the only change)

And mem over time:



There is no process that is eating memory - looks like there is some 
problem with i40e driver - but it not a surprise :) this driver is 
really buggy - with many things - most tickets on e1000e sourceforge 
that i openned have no reply for year or more - or if somebody reply 
after year they are closing ticket after 1 day with info about no 
activity :)



W dniu 2017-10-05 o?07:19, Anders K. Pedersen | Cohaesio pisze:
> On ons, 2017-10-04 at 08:32 -0700, Alexander Duyck wrote:
>> On Wed, Oct 4, 2017 at 5:56 AM, Anders K. Pedersen | Cohaesio
>> <akp@cohaesio.com> wrote:
>>> Hello,
>>>
>>> After updating one of our Linux based routers to kernel 4.13 it
>>> began
>>> leaking memory quite fast (about 1 GB every half hour). To narrow
>>> we
>>> tried various kernel versions and found that 4.11.12 is okay, while
>>> 4.12 also leaks, so we did a bisection between 4.11 and 4.12.
>>>
>>> The first bisection ended at
>>> "[6964e53f55837b0c49ed60d36656d2e0ee4fc27b] i40e: fix handling of
>>> HW
>>> ATR eviction", which fixes some flag handling that was broken by
>>> 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
>>> separate
>>> flag bits", so I did a second bisection, where I added 6964e53f5583
>>> "i40e: fix handling of HW ATR eviction" to the steps that had
>>> 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
>>> separate
>>> flag bits" in them.
>>>
>>> The second bisection ended at
>>> "[0e626ff7ccbfc43c6cc4aeea611c40b899682382] i40e: Fix support for
>>> flow
>>> director programming status", where I don't see any obvious
>>> problems,
>>> so I'm hoping for some assistance.
>>>
>>> The router is a PowerEdge R730 server (Haswell based) with three
>>> Intel
>>> NICs (all using the i40e driver):
>>>
>>> X710 quad port 10 GbE SFP+: eth0 eth1 eth2 eth3
>>> X710 quad port 10 GbE SFP+: eth4 eth5 eth6 eth7
>>> XL710 dual port 40 GbE QSFP+: eth8 eth9
>>>
>>> The NICs are aggregated with LACP with the team driver:
>>>
>>> team0: eth9 (40 GbE selected primary), and eth3, eth7 (10 GbE non-
>>> selected backups)
>>> team1: eth0, eth1, eth4, eth5 (all 10 GbE selected)
>>>
>>> team0 is used for internal networks and has one untagged and four
>>> tagged VLAN interfaces, while team1 has an external uplink
>>> connection
>>> without any VLANs.
>>>
>>> The router runs an eBGP session on team1 to one of our uplinks, and
>>> iBGP via team0 to our other border routers. It also runs OSPF on
>>> the
>>> internal VLANs on team0. One thing I've noticed is that when OSPF
>>> is
>>> not announcing a default gateway to the internal networks, so there
>>> is
>>> almost no traffic coming in on team0 and out on team1, but still
>>> plenty
>>> of traffic coming in on team1 and out via team0, there's no memory
>>> leak
>>> (or at least it is so small that we haven't detected it). But as
>>> soon
>>> as we configure OSPF to announce a default gateway to the internal
>>> VLANs, so we get traffic from team0 to team1 the leaking begins.
>>> Stopping the OSPF default gateway announcement again also stops the
>>> leaking, but does not release already leaked memory.
>>>
>>> So this leads to me suspect that the leaking is related to RX on
>>> team0
>>> (where XL710 eth9 is normally the only active interface) or TX on
>>> team1
>>> (X710 eth0, eth1, eth4, eth5). The first bad commit is related to
>>> RX
>>> cleaning, which suggests RX on team0. Since we're only seeing the
>>> leak
>>> for our outbound traffic, I suspect either a difference between the
>>> X710 vs. XL710 NICs, or that the inbound traffic is for relatively
>>> few
>>> destination addresses (only our own systems) while the outbound
>>> traffic
>>> is for many different addresses on the internet. But I'm just
>>> guessing
>>> here.
>>>
>>> I've tried kmemleak, but it only found a few kB of suspected memory
>>> leaks (several of which disappeared again after a while).
>>>
>>> Below I've included more details - git bisect logs, ethtool -i,
>>> dmesg,
>>> Kernel .config, and various memory related /proc files. Any help or
>>> suggestions would be much appreciated, and please let me know if
>>> more
>>> information is needed or there's something I should try.
>>>
>>> Regards,
>>> Anders K. Pedersen
>>>
>> Hi Anders,
>>
>> I think I see the problem and should have a patch submitted shortly
>> to
>> address it. From what I can tell it looks like the issue is that we
>> weren't properly recycling the pages associated with descriptors that
>> contained an Rx programming status. For now the workaround would be
>> to
>> try disabling ATR via the "ethtool --set-priv-flags" command. I
>> should
>> have a patch out in the next hour or so that you can try testing to
>> verify if it addresses the issue.
>>
>> Thanks.
>>
>> - Alex
> Thanks Alex,
>
> I will test the patch in our next service window on Tuesday morning.
>
> Regards,
> Anders

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20171015/2ac0b6fd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ehgfnoaafhadclol.png
Type: image/png
Size: 21123 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20171015/2ac0b6fd/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kkefopmegoknmddb.png
Type: image/png
Size: 24504 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20171015/2ac0b6fd/attachment-0003.png>

  reply	other threads:[~2017-10-14 22:00 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-04 12:56 Linux 4.12+ memory leak on router with i40e NICs Anders K. Pedersen | Cohaesio
2017-10-04 12:56 ` [Intel-wired-lan] " Anders K. Pedersen | Cohaesio
2017-10-04 15:32 ` Alexander Duyck
2017-10-04 15:32   ` [Intel-wired-lan] " Alexander Duyck
2017-10-05  5:19   ` Anders K. Pedersen | Cohaesio
2017-10-05  5:19     ` [Intel-wired-lan] " Anders K. Pedersen | Cohaesio
2017-10-14 22:00     ` =?unknown-8bit?q?Pawe=C5=82?= Staszewski [this message]
2017-10-14 22:03       ` =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-15  0:58         ` Alexander Duyck
2017-10-15  0:58           ` [Intel-wired-lan] " Alexander Duyck
2017-10-15 15:03           ` Paweł Staszewski
2017-10-15 15:03             ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-16 11:20           ` Pavlos Parissis
2017-10-16 11:20             ` [Intel-wired-lan] " Pavlos Parissis
2017-10-16 14:11             ` Alexander Duyck
2017-10-16 14:11               ` [Intel-wired-lan] " Alexander Duyck
2017-10-16 16:26             ` Paweł Staszewski
2017-10-16 16:26               ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-16 23:34               ` Paweł Staszewski
2017-10-16 23:34                 ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-16 23:56                 ` Alexander Duyck
2017-10-16 23:56                   ` [Intel-wired-lan] " Alexander Duyck
2017-10-17  0:44                   ` Paweł Staszewski
2017-10-17  0:44                     ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17  9:48                     ` Paweł Staszewski
2017-10-17  9:48                       ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 10:20                       ` Paweł Staszewski
2017-10-17 10:20                         ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 10:51                         ` Paweł Staszewski
2017-10-17 10:51                           ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 10:59                           ` Paweł Staszewski
2017-10-17 10:59                             ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 11:05                             ` Paweł Staszewski
2017-10-17 11:05                               ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 11:52                               ` Paweł Staszewski
2017-10-17 11:52                                 ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-17 14:08                                 ` Paweł Staszewski
2017-10-17 14:08                                   ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 15:44                                   ` Paweł Staszewski
2017-10-18 15:44                                     ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 22:20                                     ` Paweł Staszewski
2017-10-18 22:20                                       ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 22:50                                       ` Paweł Staszewski
2017-10-18 22:50                                         ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 22:58                                         ` Paweł Staszewski
2017-10-18 22:58                                           ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 23:22                                           ` Paweł Staszewski
2017-10-18 23:22                                             ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 23:37                                             ` Alexander Duyck
2017-10-18 23:37                                               ` [Intel-wired-lan] " Alexander Duyck
2017-10-18 23:51                                               ` Paweł Staszewski
2017-10-18 23:51                                                 ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 23:56                                                 ` Paweł Staszewski
2017-10-18 23:56                                                   ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-18 23:59                                                   ` Paweł Staszewski
2017-10-18 23:59                                                     ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-19 17:10                                                 ` Alexander Duyck
2017-10-19 17:10                                                   ` [Intel-wired-lan] " Alexander Duyck
2017-10-19 12:19                                               ` Anders K. Pedersen | Cohaesio
2017-10-19 12:19                                                 ` [Intel-wired-lan] " Anders K. Pedersen | Cohaesio
2017-10-19 15:40                                                 ` Alexander Duyck
2017-10-19 15:40                                                   ` [Intel-wired-lan] " Alexander Duyck
2017-10-22 13:56                                                   ` Anders K. Pedersen | Cohaesio
2017-10-22 13:56                                                     ` [Intel-wired-lan] " Anders K. Pedersen | Cohaesio
2017-10-17  5:51                 ` Vitezslav Samel
2017-10-17  5:51                   ` [Intel-wired-lan] " Vitezslav Samel
2017-10-18 23:29                   ` Alexander Duyck
2017-10-18 23:29                     ` [Intel-wired-lan] " Alexander Duyck
2017-10-18 23:40                     ` Paweł Staszewski
2017-10-18 23:40                       ` [Intel-wired-lan] " =?unknown-8bit?q?Pawe=C5=82?= Staszewski
2017-10-19 11:41                       ` Pavlos Parissis
2017-10-19 11:41                         ` [Intel-wired-lan] " Pavlos Parissis
2017-10-19 15:53                         ` Alexander Duyck
2017-10-19 15:53                           ` [Intel-wired-lan] " Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=227d17ae-b040-07d0-3c57-e9acd1a3b5b4@itcare.pl \
    --to=pstaszewski@itcare.pl \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.