Linux-WPAN Archive on
 help / color / Atom feed
From: (Eric W. Biederman)
To: David Ahern <>
Cc: Cong Wang <>,
	David Miller <>,
	Linux Kernel Network Developers <>,,
	Roopa Prabhu <>,
	Stephen Hemminger <>,
	Ido Schimmel <>,
	Jiri Pirko <>,
	Saeed Mahameed <>,
	Alexander Aring <>,,
	NetFilter <>,
	LKML <>
Subject: Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace
Date: Wed, 25 Jul 2018 12:38:01 -0500
Message-ID: <> (raw)
In-Reply-To: <> (David Ahern's message of "Wed, 25 Jul 2018 08:06:10 -0600")

David Ahern <> writes:

> On 7/25/18 6:33 AM, Eric W. Biederman wrote:
>> Cong Wang <> writes:
>>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <> wrote:
>>>> On 7/19/18 11:12 AM, Cong Wang wrote:
>>>>> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <> wrote:
>>>>>> Chatting with Nikolay about this and he brought up a good corollary - ip
>>>>>> fragmentation. It really is a similar problem in that memory is consumed
>>>>>> as a result of packets received from an external entity. The ipfrag
>>>>>> sysctls are per namespace with a limit that non-init_net namespaces can
>>>>>> not set high_thresh > the current value of init_net. Potential memory
>>>>>> consumed by fragments scales with the number of namespaces which is the
>>>>>> primary concern with making neighbor tables per namespace.
>>>>> Nothing new, already discussed:
>>>>> :)
>>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>>> local memory resources due to received packets. bridge and vxlan fdb's
>>>> are fairly straightforward analogs to neighbor entries; they are per
>>>> device with no limits on the number of entries. Fragments have memory
>>>> limits per namespace. So neighbor tables are the only ones with this
>>>> strict limitation and concern on memory consumption.
>>>> I get the impression there is no longer a strong resistance against
>>>> moving the tables to per namespace, but deciding what is the right
>>>> approach to handle backwards compatibility. Correct? Changing the
>>>> accounting is inevitably going to be noticeable to some use case(s), but
>>>> with sysctl settings it is a simple runtime update once the user knows
>>>> to make the change.
>>> This question definitely should go to Eric Biederman who was against
>>> my proposal.
>>> Let's add Eric into CC.
>> Given that the entries are per device and the devices are per-namespace,
>> semantically neighbours are already kept in a per-namespace manner.  So
>> this is all about making the code not honoring global resource limits.
>> Making the code not honor gc_thresh3.
>> Skimming through the code today the default for gc_thresh3 is 1024.
>> Which means that we limit the neighbour tables to 1024 entries per
>> protocol type.
>> There are some pretty compelling reasons especially with ipv4 to keep
>> the subnet size down.  Arp storms are a real thing.
>> I don't know off the top of my head what the reasons for limiting the
>> neighbour table sizes.  I would be much more comfortable with a patchset
>> like this if we did some research and figured out the reasons why
>> we have a global limit.  Then changed the code to remove those limits.
>> When the limits are gone.  When the code can support large subnets
>> without tuning.  We we don't have to worry about someone scanning an all
>> addresses in an ipv6 subnet and causing a DOS on working machines.
>> I think it is completely appropriate to look to see if something per
>> network namespace needs to happen.
>> So please let's address the limits, not the fact that some specific
>> corner case ran into them.
>> If we are going to neuter gc_thresh3 let's go as far as removing it
>> entirely.  If we are going to make the neighbour table per something
>> let's make it per network device.  If we can afford the multiple hash
>> tables then a hash table per device is better.   Perhaps we want to move
>> to rhash tables while we look at this, instead of an old hand grown
>> version of resizable hash table.
> Given the uses cases with increasing number of devices (> 10,000),
> per-device tables will have more problems than per namespace - in
> reference to your concern in the last paragraph below.
>> Unless I misread something all your patchset did is reshuffle code and
>> data structures so that gc_thresh3 does not apply accross namespaces.
>> That does not feel like it really fixes anything.  That just lies to
>> people.
> This patch set fixes the lie that network namespaces provide complete
> isolation when in fact one namespace can evict neighbor entries from
> another. An arp storm you are concerned about in one namespace impacts
> all containers.

Network namespaces can not provide complete isolation.  They share the
same kernel and they do not dedicate resources to each other.
Namespaces in general are about the names.  They are about sharing a
machine efficiently.

I humbly suggest that anyone who wants ``complete'' isolation to use vm
at the very least.

I do think the limits on the neighbour table are quite likely too
strict.  We should be able to relax them and continue to have a
networking stack that works for everyone.

> It starts by removing the proliferation of open coded references to
> arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
> From there (patches 14-16) it makes the tables per-namespace and hence
> makes the gc_thresh parameters which are per-table now
> per-table-per-namespace.
> So it removes the global thresholds because the global ones are just
> wrong given the meaning of a network namespace and provides the more
> appropriate per-namespace limits.

Absolutely NOT.  Global thresholds are exactly correct given the fact
you are running on a single kernel.

Memory is not free (Even though we are swimming in enough of it memory
rarely matters).  One of the few remaining challenges is for containers
is finding was to limit resources in such a way that one application
does not mess things up for another container during ordinary usage.

It looks like the neighbour tables absolutely are that kind of problem,
because the artificial limits are too strict.   Completely giving up on
limits does not seem right approach either.  We need to fix the limits
we have (perhaps making them go away entirely), not just apply a
band-aid.  Let's get to the bottom of this and make the system better.

>> Further unless I misread something you are increasing the number of
>> timers to 3 per namespace.  If I create create a thousand network
>> namespaces that feels like it will hurt system performance overall.
> It seems to me the timers are per neighbor entry not table. The per
> table ones are for proxies.

It seems I misread that bit when I was refreshing my memory on what
everything is doing.  If we can already have 1024 timers that makes
timers not a concern.


  reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-17 12:06 dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 01/17] net/ipv4: rename ipv4_neigh_lookup to ipv4_dst_neigh_lookup dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 02/17] net/neigh: export neigh_find_table dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 03/17] net/ipv4: wrappers for arp table references dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 04/17] net/ipv4: Remove open coded use of arp table dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 05/17] net/ipv6: wrappers for neighbor table references dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 06/17] net/ipv6: Remove open coded use of neighbor table dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 07/17] drivers/net: remove open coding of neighbor tables dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 08/17] net: Remove nd_tbl from ipv6 stub dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 09/17] net: Remove arp_tbl and nd_tbl from headers dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 10/17] net: Add key_len to neighbor constructor dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 11/17] net: Change neigh_table_init and neigh_table_clear signature dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 12/17] net/neigh: Change neigh_xmit to take an address family dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 13/17] net/neighbor: Convert internal functions away from neigh_tables dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 14/17] net/ipv4: Convert arp table to per namespace dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 15/17] net/ipv6: Convert neighbor table to per-namespace dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 16/17] net/decnet: Move " dsahern
2018-07-17 12:06 ` [PATCH RFC/RFT net-next 17/17] net/neighbor: Remove neigh_tables and NEIGH enum dsahern
2018-07-17 17:40 ` [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace Cong Wang
2018-07-17 17:43   ` David Ahern
2018-07-17 17:53     ` Cong Wang
2018-07-17 19:02       ` David Ahern
2018-07-17 20:37         ` Cong Wang
2018-07-18  3:59         ` David Miller
2018-07-19 16:16           ` David Ahern
2018-07-19 17:12             ` Cong Wang
2018-07-24 15:14               ` David Ahern
2018-07-24 17:14                 ` David Miller
2018-07-25 18:23                   ` David Ahern
2018-07-24 22:09                 ` Cong Wang
2018-07-25 12:33                   ` Eric W. Biederman
2018-07-25 14:06                     ` David Ahern
2018-07-25 17:38                       ` Eric W. Biederman [this message]
2018-07-25 18:13                         ` David Ahern
2018-07-25 19:17                           ` Eric W. Biederman
2018-08-13 21:48                             ` David Ahern
2018-08-15  4:36                               ` Eric W. Biederman
2018-07-26 11:12                         ` David Laight
2018-07-27 16:27                           ` Eric W. Biederman
2018-07-19  0:54 ` Michael Richardson
2018-07-19 15:49   ` David Ahern
2018-08-12  6:46 ` [RFC/RFT, net-next, " Vasily Averin
2018-08-12 17:37   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-WPAN Archive on

Archives are clonable:
	git clone --mirror linux-wpan/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-wpan linux-wpan/ \
	public-inbox-index linux-wpan

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone