All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Netlink NLM_F_DUMP_INTR flag lost
       [not found] <20220615171113.7d93af3e@pirotess>
@ 2022-06-15 16:00 ` Jakub Kicinski
  2022-06-16 15:10   ` Ismael Luceno
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-15 16:00 UTC (permalink / raw)
  To: Ismael Luceno; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

CC: netdev ML

On Wed, 15 Jun 2022 17:11:13 +0200 Ismael Luceno wrote:
> It seems a RTM_GETADDR request with AF_UNSPEC has a corner case where
> the NLM_F_DUMP_INTR flag is lost.
> 
> After a change in an address table, if a packet has been fully filled
> just previous, and if the end of the table is found at the same time,
> then the next packet should be flagged, which works fine when it's
> NLMSG_DONE, but gets clobbered when another table is to be dumped next.

Could you describe how it gets clobbered? You mean that prev_seq gets
updated somewhere without setting the flag or something overwrites
nlmsg_flags? Or we set _INTR on an empty skb which never ends up
getting sent? Or..

> A customer noticed the issue using kubernetes, when a large
> number of short-lived containers would push the system constantly
> towards this corner case.
> 
> I'm entertaining the following options:
> 
> 1) introduce a new packet type just to convey flags in cases like this.
> 2) preserve the flag and apply it to the NLMSG_DONE packet.
> 3) flag the first packet of the following table.
> 
> I don't like option 2 and 3 because we can't tell which table is
> affected, which I'm guessing programs might be relying on.
> 
> Option 1 adds a little bit of overhead, but enables us to tell which
> table is affected, and can be ignored by existing software that doesn't
> understand it, so IMHO it's the least disruptive option.
> 
> I want to have a little discussion before introducing a patch, since
> option 1 might have other implications I'm not aware of...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-15 16:00 ` Netlink NLM_F_DUMP_INTR flag lost Jakub Kicinski
@ 2022-06-16 15:10   ` Ismael Luceno
  2022-06-17  0:16     ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: Ismael Luceno @ 2022-06-16 15:10 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

On Wed, 15 Jun 2022 09:00:44 -0700
Jakub Kicinski <kuba@kernel.org> wrote:
> On Wed, 15 Jun 2022 17:11:13 +0200 Ismael Luceno wrote:
> > It seems a RTM_GETADDR request with AF_UNSPEC has a corner case
> > where the NLM_F_DUMP_INTR flag is lost.
> > 
> > After a change in an address table, if a packet has been fully
> > filled just previous, and if the end of the table is found at the
> > same time, then the next packet should be flagged, which works fine
> > when it's NLMSG_DONE, but gets clobbered when another table is to
> > be dumped next.
> 
> Could you describe how it gets clobbered? You mean that prev_seq gets
> updated somewhere without setting the flag or something overwrites
> nlmsg_flags? Or we set _INTR on an empty skb which never ends up
> getting sent? Or..

It seems to me that in most functions, but specifically in the case of
net/ipv4/devinet.c:in_dev_dump_addr or inet_netconf_dump_devconf,
nl_dump_check_consistent is called after each address/attribute is
dumped, meaning a hash table generation change happening just after it
adds an entry, if it also causes it to find the end of the table,
wouldn't be flagged.

Adding an extra call at the end of all these functions should fix that,
and spill the flag into the next packet, but would that be correct?

It seems this condition is flagged correctly when NLM_DONE is produced,
I couldn't see why, but I'm guessing another call to
nl_dump_check_consistent...

Also, I noticed that in net/core/rtnetlink.c:rtnl_dump_all: 

	if (idx > s_idx) {
		memset(&cb->args[0], 0, sizeof(cb->args));
		cb->prev_seq = 0;
		cb->seq = 0;
	}
	ret = dumpit(skb, cb);

This also prevents it to be detect the condition when dumping the next
table, but that seems desirable...

Am I grasping it correctly?

Some functions like net/core/rtnetlink.c:rtnl_dump_ifinfo do call
nl_dump_check_consistent when finishing, but I'm seeing most others
don't do that, instead doing it only when adding an entry to the packet
(another example is: rtnl_stats_dump).

Again, while adding the check at the end of each function would solve
these inconsistencies, it isn't so clear to me that spilling this flag
into the next packet when it's going to be from another table is a good
idea.

It might make more sense to emit a new packet type just for the flag,
that way, in the sequence of packets, the client can reliably tell the
dump of which tables was interrupted, and make some decision based on
that, vs having to deem all tables affected...

-- 
Ismael Luceno
SUSE L3 Support

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-16 15:10   ` Ismael Luceno
@ 2022-06-17  0:16     ` Jakub Kicinski
  2022-06-17 13:01       ` Ismael Luceno
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-17  0:16 UTC (permalink / raw)
  To: Ismael Luceno; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

On Thu, 16 Jun 2022 17:10:16 +0200 Ismael Luceno wrote:
> On Wed, 15 Jun 2022 09:00:44 -0700
> Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 15 Jun 2022 17:11:13 +0200 Ismael Luceno wrote:  
> > > It seems a RTM_GETADDR request with AF_UNSPEC has a corner case
> > > where the NLM_F_DUMP_INTR flag is lost.
> > > 
> > > After a change in an address table, if a packet has been fully
> > > filled just previous, and if the end of the table is found at the
> > > same time, then the next packet should be flagged, which works fine
> > > when it's NLMSG_DONE, but gets clobbered when another table is to
> > > be dumped next.  
> > 
> > Could you describe how it gets clobbered? You mean that prev_seq gets
> > updated somewhere without setting the flag or something overwrites
> > nlmsg_flags? Or we set _INTR on an empty skb which never ends up
> > getting sent? Or..  
> 
> It seems to me that in most functions, but specifically in the case of
> net/ipv4/devinet.c:in_dev_dump_addr or inet_netconf_dump_devconf,
> nl_dump_check_consistent is called after each address/attribute is
> dumped, meaning a hash table generation change happening just after it
> adds an entry, if it also causes it to find the end of the table,
> wouldn't be flagged.
> 
> Adding an extra call at the end of all these functions should fix that,
> and spill the flag into the next packet, but would that be correct?

The whole _DUMP_INTR scheme does indeed seem to lack robustness.
The update side changes the atomic after the update, and the read
side validates it before. So there's plenty time for a race to happen.
But I'm not sure if that's what you mean or you see more issues.

> It seems this condition is flagged correctly when NLM_DONE is produced,
> I couldn't see why, but I'm guessing another call to
> nl_dump_check_consistent...
> 
> Also, I noticed that in net/core/rtnetlink.c:rtnl_dump_all: 
> 
> 	if (idx > s_idx) {
> 		memset(&cb->args[0], 0, sizeof(cb->args));
> 		cb->prev_seq = 0;
> 		cb->seq = 0;
> 	}
> 	ret = dumpit(skb, cb);
> 
> This also prevents it to be detect the condition when dumping the next
> table, but that seems desirable...

That's iterating over protocols, AFAICT, we don't guarantee consistency
across protocols.

> Am I grasping it correctly?
> 
> Some functions like net/core/rtnetlink.c:rtnl_dump_ifinfo do call
> nl_dump_check_consistent when finishing, but I'm seeing most others
> don't do that, instead doing it only when adding an entry to the packet
> (another example is: rtnl_stats_dump).
> 
> Again, while adding the check at the end of each function would solve
> these inconsistencies, it isn't so clear to me that spilling this flag
> into the next packet when it's going to be from another table is a good
> idea.
> 
> It might make more sense to emit a new packet type just for the flag,
> that way, in the sequence of packets, the client can reliably tell the
> dump of which tables was interrupted, and make some decision based on
> that, vs having to deem all tables affected...
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17  0:16     ` Jakub Kicinski
@ 2022-06-17 13:01       ` Ismael Luceno
  2022-06-17 14:55         ` David Ahern
  2022-06-22 11:12         ` Ismael Luceno
  0 siblings, 2 replies; 19+ messages in thread
From: Ismael Luceno @ 2022-06-17 13:01 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

On Thu, 16 Jun 2022 17:16:12 -0700
Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 16 Jun 2022 17:10:16 +0200 Ismael Luceno wrote:
> > On Wed, 15 Jun 2022 09:00:44 -0700
> > Jakub Kicinski <kuba@kernel.org> wrote:  
> > > On Wed, 15 Jun 2022 17:11:13 +0200 Ismael Luceno wrote:    
> > > > It seems a RTM_GETADDR request with AF_UNSPEC has a corner case
> > > > where the NLM_F_DUMP_INTR flag is lost.
> > > > 
> > > > After a change in an address table, if a packet has been fully
> > > > filled just previous, and if the end of the table is found at
> > > > the same time, then the next packet should be flagged, which
> > > > works fine when it's NLMSG_DONE, but gets clobbered when
> > > > another table is to be dumped next.    
> > > 
> > > Could you describe how it gets clobbered? You mean that prev_seq
> > > gets updated somewhere without setting the flag or something
> > > overwrites nlmsg_flags? Or we set _INTR on an empty skb which
> > > never ends up getting sent? Or..    
> > 
> > It seems to me that in most functions, but specifically in the case
> > of net/ipv4/devinet.c:in_dev_dump_addr or inet_netconf_dump_devconf,
> > nl_dump_check_consistent is called after each address/attribute is
> > dumped, meaning a hash table generation change happening just after
> > it adds an entry, if it also causes it to find the end of the table,
> > wouldn't be flagged.
> > 
> > Adding an extra call at the end of all these functions should fix
> > that, and spill the flag into the next packet, but would that be
> > correct?  
> 
> The whole _DUMP_INTR scheme does indeed seem to lack robustness.
> The update side changes the atomic after the update, and the read
> side validates it before. So there's plenty time for a race to happen.
> But I'm not sure if that's what you mean or you see more issues.

No, I'm concerned that while in the dumping loop, the table might
change between iterations, and if this results in the loop not finding
more entries, because in most these functions there's no
consistency check after the loop, this will go undetected.

> > It seems this condition is flagged correctly when NLM_DONE is
> > produced, I couldn't see why, but I'm guessing another call to
> > nl_dump_check_consistent...
> > 
> > Also, I noticed that in net/core/rtnetlink.c:rtnl_dump_all: 
> > 
> > 	if (idx > s_idx) {
> > 		memset(&cb->args[0], 0, sizeof(cb->args));
> > 		cb->prev_seq = 0;
> > 		cb->seq = 0;
> > 	}
> > 	ret = dumpit(skb, cb);
> > 
> > This also prevents it to be detect the condition when dumping the
> > next table, but that seems desirable...  
> 
> That's iterating over protocols, AFAICT, we don't guarantee
> consistency across protocols.

That's reasonable, I was just wondering about it because it does seem
reasonable that the flags affect only the packets describing the table
whose dump got interrupted...

-- 
Ismael Luceno
SUSE L3 Support

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 13:01       ` Ismael Luceno
@ 2022-06-17 14:55         ` David Ahern
  2022-06-17 15:22           ` Jakub Kicinski
  2022-06-22 11:12         ` Ismael Luceno
  1 sibling, 1 reply; 19+ messages in thread
From: David Ahern @ 2022-06-17 14:55 UTC (permalink / raw)
  To: Ismael Luceno, Jakub Kicinski; +Cc: David S. Miller, Paolo Abeni, netdev

On 6/17/22 7:01 AM, Ismael Luceno wrote:
> On Thu, 16 Jun 2022 17:16:12 -0700
> Jakub Kicinski <kuba@kernel.org> wrote:
>> On Thu, 16 Jun 2022 17:10:16 +0200 Ismael Luceno wrote:
>>> On Wed, 15 Jun 2022 09:00:44 -0700
>>> Jakub Kicinski <kuba@kernel.org> wrote:  
>>>> On Wed, 15 Jun 2022 17:11:13 +0200 Ismael Luceno wrote:    
>>>>> It seems a RTM_GETADDR request with AF_UNSPEC has a corner case
>>>>> where the NLM_F_DUMP_INTR flag is lost.
>>>>>
>>>>> After a change in an address table, if a packet has been fully
>>>>> filled just previous, and if the end of the table is found at
>>>>> the same time, then the next packet should be flagged, which
>>>>> works fine when it's NLMSG_DONE, but gets clobbered when
>>>>> another table is to be dumped next.    
>>>>
>>>> Could you describe how it gets clobbered? You mean that prev_seq
>>>> gets updated somewhere without setting the flag or something
>>>> overwrites nlmsg_flags? Or we set _INTR on an empty skb which
>>>> never ends up getting sent? Or..    
>>>
>>> It seems to me that in most functions, but specifically in the case
>>> of net/ipv4/devinet.c:in_dev_dump_addr or inet_netconf_dump_devconf,
>>> nl_dump_check_consistent is called after each address/attribute is
>>> dumped, meaning a hash table generation change happening just after
>>> it adds an entry, if it also causes it to find the end of the table,
>>> wouldn't be flagged.
>>>
>>> Adding an extra call at the end of all these functions should fix
>>> that, and spill the flag into the next packet, but would that be
>>> correct?  
>>
>> The whole _DUMP_INTR scheme does indeed seem to lack robustness.
>> The update side changes the atomic after the update, and the read
>> side validates it before. So there's plenty time for a race to happen.
>> But I'm not sure if that's what you mean or you see more issues.
> 
> No, I'm concerned that while in the dumping loop, the table might
> change between iterations, and if this results in the loop not finding
> more entries, because in most these functions there's no
> consistency check after the loop, this will go undetected.

Specific example? e.g., fib dump and address dumps have a generation id
that gets recorded before the start of the dump and checked at the end
of the dump.

> 
>>> It seems this condition is flagged correctly when NLM_DONE is
>>> produced, I couldn't see why, but I'm guessing another call to
>>> nl_dump_check_consistent...
>>>
>>> Also, I noticed that in net/core/rtnetlink.c:rtnl_dump_all: 
>>>
>>> 	if (idx > s_idx) {
>>> 		memset(&cb->args[0], 0, sizeof(cb->args));
>>> 		cb->prev_seq = 0;
>>> 		cb->seq = 0;
>>> 	}
>>> 	ret = dumpit(skb, cb);
>>>
>>> This also prevents it to be detect the condition when dumping the
>>> next table, but that seems desirable...  
>>
>> That's iterating over protocols, AFAICT, we don't guarantee
>> consistency across protocols.
> 
> That's reasonable, I was just wondering about it because it does seem
> reasonable that the flags affect only the packets describing the table
> whose dump got interrupted...
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 14:55         ` David Ahern
@ 2022-06-17 15:22           ` Jakub Kicinski
  2022-06-17 16:17             ` David Ahern
  2022-06-17 16:28             ` David Ahern
  0 siblings, 2 replies; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-17 15:22 UTC (permalink / raw)
  To: David Ahern; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On Fri, 17 Jun 2022 08:55:53 -0600 David Ahern wrote:
> > No, I'm concerned that while in the dumping loop, the table might
> > change between iterations, and if this results in the loop not finding
> > more entries, because in most these functions there's no
> > consistency check after the loop, this will go undetected.  
> 
> Specific example? e.g., fib dump and address dumps have a generation id
> that gets recorded before the start of the dump and checked at the end
> of the dump.

FWIW what I think is strange is that we record the gen id before the
dump and then check if the recorded version was old. Like.. what's the
point of that? Nothing updates cb->seq while dumping AFAICS, so the
code is functionally equivalent to this right?

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 92b778e423df..0cd7482dc1f0 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2259,6 +2259,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 		rcu_read_lock();
 		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
 			  net->dev_base_seq;
+		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
@@ -2276,7 +2277,6 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 				rcu_read_unlock();
 				goto done;
 			}
-			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
 cont:
 			idx++;
 		}



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 15:22           ` Jakub Kicinski
@ 2022-06-17 16:17             ` David Ahern
  2022-06-17 16:28             ` David Ahern
  1 sibling, 0 replies; 19+ messages in thread
From: David Ahern @ 2022-06-17 16:17 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On 6/17/22 9:22 AM, Jakub Kicinski wrote:
> On Fri, 17 Jun 2022 08:55:53 -0600 David Ahern wrote:
>>> No, I'm concerned that while in the dumping loop, the table might
>>> change between iterations, and if this results in the loop not finding
>>> more entries, because in most these functions there's no
>>> consistency check after the loop, this will go undetected.  
>>
>> Specific example? e.g., fib dump and address dumps have a generation id
>> that gets recorded before the start of the dump and checked at the end
>> of the dump.
> 
> FWIW what I think is strange is that we record the gen id before the
> dump and then check if the recorded version was old. Like.. what's the
> point of that? Nothing updates cb->seq while dumping AFAICS, so the

while dumping, no, because the rtnl is locked.

The genid is used across syscalls when dumping a table that does not fit
within a 64kB message.


> code is functionally equivalent to this right?
> 
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index 92b778e423df..0cd7482dc1f0 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -2259,6 +2259,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
>  		rcu_read_lock();
>  		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
>  			  net->dev_base_seq;
> +		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
>  		hlist_for_each_entry_rcu(dev, head, index_hlist) {
>  			if (idx < s_idx)
>  				goto cont;
> @@ -2276,7 +2277,6 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
>  				rcu_read_unlock();
>  				goto done;
>  			}
> -			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
>  cont:
>  			idx++;
>  		}
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 15:22           ` Jakub Kicinski
  2022-06-17 16:17             ` David Ahern
@ 2022-06-17 16:28             ` David Ahern
  2022-08-24 10:59               ` Ismael Luceno
  1 sibling, 1 reply; 19+ messages in thread
From: David Ahern @ 2022-06-17 16:28 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On 6/17/22 9:22 AM, Jakub Kicinski wrote:
> On Fri, 17 Jun 2022 08:55:53 -0600 David Ahern wrote:
>>> No, I'm concerned that while in the dumping loop, the table might
>>> change between iterations, and if this results in the loop not finding
>>> more entries, because in most these functions there's no
>>> consistency check after the loop, this will go undetected.  
>>
>> Specific example? e.g., fib dump and address dumps have a generation id
>> that gets recorded before the start of the dump and checked at the end
>> of the dump.
> 
> FWIW what I think is strange is that we record the gen id before the
> dump and then check if the recorded version was old. Like.. what's the
> point of that? Nothing updates cb->seq while dumping AFAICS, so the
> code is functionally equivalent to this right?
> 
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index 92b778e423df..0cd7482dc1f0 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -2259,6 +2259,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
>  		rcu_read_lock();
>  		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
>  			  net->dev_base_seq;
> +		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
>  		hlist_for_each_entry_rcu(dev, head, index_hlist) {
>  			if (idx < s_idx)
>  				goto cont;
> @@ -2276,7 +2277,6 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
>  				rcu_read_unlock();
>  				goto done;
>  			}
> -			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
>  cont:
>  			idx++;
>  		}
> 
> 


FWIW, net/ipv4/nexthop.c sets cb->seq at the end of the loop and the
nl_dump_check_consistent follows that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 13:01       ` Ismael Luceno
  2022-06-17 14:55         ` David Ahern
@ 2022-06-22 11:12         ` Ismael Luceno
  2022-06-22 23:55           ` Jakub Kicinski
  1 sibling, 1 reply; 19+ messages in thread
From: Ismael Luceno @ 2022-06-22 11:12 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

On Fri, 17 Jun 2022 15:01:10 +0200
Ismael Luceno <iluceno@suse.de> wrote:
> On Thu, 16 Jun 2022 17:16:12 -0700
> Jakub Kicinski <kuba@kernel.org> wrote:
<...>
> > That's iterating over protocols, AFAICT, we don't guarantee
> > consistency across protocols.  
> 
> That's reasonable, I was just wondering about it because it does seem
> reasonable that the flags affect only the packets describing the table
> whose dump got interrupted...

So, just for clarification:


Scenario 1:
- 64 KB packet is filled.
- protocol table shrinks
- Next iteration finds it's done
- next protocol clears the seq, so nothing is flaged
- ...
- NLMSG_DONE (not flagged)

Scenario 2:
- 64 KB packet is filled.
- protocol table shrinks
- Next iteration finds it's done
- NLMSG_DONE (flagged with NLM_F_DUMP_INTR)

So, in order to break as little as possible, I was thinking about
introducing a new packet iff it happens we have to signal INTR between
protocols.

Does that sound good?

-- 
Ismael Luceno
SUSE L3 Support

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-22 11:12         ` Ismael Luceno
@ 2022-06-22 23:55           ` Jakub Kicinski
  2022-06-23  4:01             ` David Ahern
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-22 23:55 UTC (permalink / raw)
  To: Ismael Luceno; +Cc: David S. Miller, Paolo Abeni, David Ahern, netdev

On Wed, 22 Jun 2022 13:12:18 +0200 Ismael Luceno wrote:
> So, just for clarification:
> 
> Scenario 1:
> - 64 KB packet is filled.
> - protocol table shrinks
> - Next iteration finds it's done
> - next protocol clears the seq, so nothing is flaged
> - ...
> - NLMSG_DONE (not flagged)
> 
> Scenario 2:
> - 64 KB packet is filled.
> - protocol table shrinks
> - Next iteration finds it's done
> - NLMSG_DONE (flagged with NLM_F_DUMP_INTR)
> 
> So, in order to break as little as possible, I was thinking about
> introducing a new packet iff it happens we have to signal INTR between
> protocols.
> 
> Does that sound good?

Right, the question is what message can we introduce here which would
not break old user space?

The alternative of not wiping the _DUMP_INTR flag as we move thru
protocols seems more and more appealing, even tho I was initially
dismissive.

We should make sure we do one last consistency check before we return 0
from the handlers. Or even at the end of the loop in rtnl_dump_all().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-22 23:55           ` Jakub Kicinski
@ 2022-06-23  4:01             ` David Ahern
  2022-06-23 16:03               ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2022-06-23  4:01 UTC (permalink / raw)
  To: Jakub Kicinski, Ismael Luceno; +Cc: David S. Miller, Paolo Abeni, netdev

On 6/22/22 5:55 PM, Jakub Kicinski wrote:
> On Wed, 22 Jun 2022 13:12:18 +0200 Ismael Luceno wrote:
>> So, just for clarification:
>>
>> Scenario 1:
>> - 64 KB packet is filled.
>> - protocol table shrinks
>> - Next iteration finds it's done
>> - next protocol clears the seq, so nothing is flaged
>> - ...
>> - NLMSG_DONE (not flagged)
>>
>> Scenario 2:
>> - 64 KB packet is filled.
>> - protocol table shrinks
>> - Next iteration finds it's done
>> - NLMSG_DONE (flagged with NLM_F_DUMP_INTR)
>>
>> So, in order to break as little as possible, I was thinking about
>> introducing a new packet iff it happens we have to signal INTR between
>> protocols.
>>
>> Does that sound good?
> 
> Right, the question is what message can we introduce here which would
> not break old user space?

I would hope a "normal" message with just the flags set is processed by
userspace. iproute2 does - lib/libnetlink.c, rtnl_dump_filter_l(). It
checks the nlmsg_flags first.

> 
> The alternative of not wiping the _DUMP_INTR flag as we move thru
> protocols seems more and more appealing, even tho I was initially
> dismissive.
> 
> We should make sure we do one last consistency check before we return 0
> from the handlers. Or even at the end of the loop in rtnl_dump_all().

Seems like netlink_dump_done should handle that for the last dump?

That said, in rtnl_dump_all how about a flags check after dumpit() and
send the message if INTR is set? would need to adjust the return code of
rtnl_dump_all so netlink_dump knows the dump is not done yet.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23  4:01             ` David Ahern
@ 2022-06-23 16:03               ` Jakub Kicinski
  2022-06-23 16:17                 ` David Ahern
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-23 16:03 UTC (permalink / raw)
  To: David Ahern; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On Wed, 22 Jun 2022 22:01:37 -0600 David Ahern wrote:
> > Right, the question is what message can we introduce here which would
> > not break old user space?  
> 
> I would hope a "normal"

"normal" == with no attributes?

> message with just the flags set is processed by
> userspace. iproute2 does - lib/libnetlink.c, rtnl_dump_filter_l(). It
> checks the nlmsg_flags first.

🤞

> > The alternative of not wiping the _DUMP_INTR flag as we move thru
> > protocols seems more and more appealing, even tho I was initially
> > dismissive.
> > 
> > We should make sure we do one last consistency check before we return 0
> > from the handlers. Or even at the end of the loop in rtnl_dump_all().  
> 
> Seems like netlink_dump_done should handle that for the last dump?

Yeah, the problem is:
 - it gets lost between families when dumping all, and
 - if the dump get truncated _DUMP_INTR never gets set because many
   places only check consistency when outputting an object.

> That said, in rtnl_dump_all how about a flags check after dumpit() and
> send the message if INTR is set? would need to adjust the return code of
> rtnl_dump_all so netlink_dump knows the dump is not done yet.

Yup, the question for me is what's the risk / benefit of sending 
the empty message vs putting the _DUMP_INTR on the next family.
I'm leaning towards putting it on the next family and treating 
the entire dump as interrupted, do you reckon that's suboptimal?
User space can always dump family by family if it cares about
breaking the entire dump.

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ac45328607f7..c36874d192ef 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3879,6 +3879,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 			continue;
 
 		if (idx > s_idx) {
+			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
 			memset(&cb->args[0], 0, sizeof(cb->args));
 			cb->prev_seq = 0;
 			cb->seq = 0;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23 16:03               ` Jakub Kicinski
@ 2022-06-23 16:17                 ` David Ahern
  2022-06-23 16:36                   ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2022-06-23 16:17 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On 6/23/22 10:03 AM, Jakub Kicinski wrote:
> On Wed, 22 Jun 2022 22:01:37 -0600 David Ahern wrote:
>>> Right, the question is what message can we introduce here which would
>>> not break old user space?  
>>
>> I would hope a "normal"
> 
> "normal" == with no attributes?
> 
>> message with just the flags set is processed by
>> userspace. iproute2 does - lib/libnetlink.c, rtnl_dump_filter_l(). It
>> checks the nlmsg_flags first.
> 
> 🤞
> 
>>> The alternative of not wiping the _DUMP_INTR flag as we move thru
>>> protocols seems more and more appealing, even tho I was initially
>>> dismissive.
>>>
>>> We should make sure we do one last consistency check before we return 0
>>> from the handlers. Or even at the end of the loop in rtnl_dump_all().  
>>
>> Seems like netlink_dump_done should handle that for the last dump?
> 
> Yeah, the problem is:
>  - it gets lost between families when dumping all, and
>  - if the dump get truncated _DUMP_INTR never gets set because many
>    places only check consistency when outputting an object.
> 
>> That said, in rtnl_dump_all how about a flags check after dumpit() and
>> send the message if INTR is set? would need to adjust the return code of
>> rtnl_dump_all so netlink_dump knows the dump is not done yet.
> 
> Yup, the question for me is what's the risk / benefit of sending 
> the empty message vs putting the _DUMP_INTR on the next family.
> I'm leaning towards putting it on the next family and treating 
> the entire dump as interrupted, do you reckon that's suboptimal?

I think it is going to be misleading; the INTR flag needs to be set on
the dump that is affected.

All of the dumps should be checking the consistency at the end of the
dump - regardless of any remaining entries on a particular round (e.g.,
I mentioned this what the nexthop dump does). Worst case then is DONE
and INTR are set on the same message with no data, but it tells
explicitly the set of data affected.

> User space can always dump family by family if it cares about
> breaking the entire dump.
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index ac45328607f7..c36874d192ef 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -3879,6 +3879,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
>  			continue;
>  
>  		if (idx > s_idx) {
> +			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
>  			memset(&cb->args[0], 0, sizeof(cb->args));
>  			cb->prev_seq = 0;
>  			cb->seq = 0;
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23 16:17                 ` David Ahern
@ 2022-06-23 16:36                   ` Jakub Kicinski
  2022-06-23 17:31                     ` David Ahern
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-23 16:36 UTC (permalink / raw)
  To: David Ahern; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On Thu, 23 Jun 2022 10:17:17 -0600 David Ahern wrote:
> > Yup, the question for me is what's the risk / benefit of sending 
> > the empty message vs putting the _DUMP_INTR on the next family.
> > I'm leaning towards putting it on the next family and treating 
> > the entire dump as interrupted, do you reckon that's suboptimal?  
> 
> I think it is going to be misleading; the INTR flag needs to be set on
> the dump that is affected.

Right, it's a bit of a philosophical discussion but dump is delineated
but NLMSG_DONE. PF_UNSPEC dump is a single dump, not a group of multiple
independent per-family dumps. If we think of a nlmsg as a representation
of an object having an empty one is awkward. What if someone does a dump
to just count objects? Too speculative?

I guess one can argue either way, no empty messages is a weaker promise
and hopefully lower risk, hence my preference. Do you feel strongly for
the message? Do we flip a coin? :)

> All of the dumps should be checking the consistency at the end of the
> dump - regardless of any remaining entries on a particular round (e.g.,
> I mentioned this what the nexthop dump does). Worst case then is DONE
> and INTR are set on the same message with no data, but it tells
> explicitly the set of data affected.

Okay, perhaps we should put a WARN_ON_ONCE(seq && seq != prev_seq)
in rtnl_dump_all() then, to catch those who get it wrong.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23 16:36                   ` Jakub Kicinski
@ 2022-06-23 17:31                     ` David Ahern
  2022-06-23 19:03                       ` Jakub Kicinski
  0 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2022-06-23 17:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Ismael Luceno, David S. Miller, Paolo Abeni, netdev

On 6/23/22 10:36 AM, Jakub Kicinski wrote:
> On Thu, 23 Jun 2022 10:17:17 -0600 David Ahern wrote:
>>> Yup, the question for me is what's the risk / benefit of sending 
>>> the empty message vs putting the _DUMP_INTR on the next family.
>>> I'm leaning towards putting it on the next family and treating 
>>> the entire dump as interrupted, do you reckon that's suboptimal?  
>>
>> I think it is going to be misleading; the INTR flag needs to be set on
>> the dump that is affected.
> 
> Right, it's a bit of a philosophical discussion but dump is delineated
> but NLMSG_DONE. PF_UNSPEC dump is a single dump, not a group of multiple
> independent per-family dumps. If we think of a nlmsg as a representation
> of an object having an empty one is awkward. What if someone does a dump
> to just count objects? Too speculative?
> 
> I guess one can argue either way, no empty messages is a weaker promise
> and hopefully lower risk, hence my preference. Do you feel strongly for
> the message? Do we flip a coin? :)

I do not; history suggests it is a toss up.

> 
>> All of the dumps should be checking the consistency at the end of the
>> dump - regardless of any remaining entries on a particular round (e.g.,
>> I mentioned this what the nexthop dump does). Worst case then is DONE
>> and INTR are set on the same message with no data, but it tells
>> explicitly the set of data affected.
> 
> Okay, perhaps we should put a WARN_ON_ONCE(seq && seq != prev_seq)
> in rtnl_dump_all() then, to catch those who get it wrong.

with '!(nlh->msg_flags & INTR)' to catch seq numbers not matching and
the message was not flagged?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23 17:31                     ` David Ahern
@ 2022-06-23 19:03                       ` Jakub Kicinski
  2022-06-28 19:38                         ` Ismael Luceno
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2022-06-23 19:03 UTC (permalink / raw)
  To: David Ahern, Ismael Luceno; +Cc: David S. Miller, Paolo Abeni, netdev

On Thu, 23 Jun 2022 11:31:34 -0600 David Ahern wrote:
> >> All of the dumps should be checking the consistency at the end of the
> >> dump - regardless of any remaining entries on a particular round (e.g.,
> >> I mentioned this what the nexthop dump does). Worst case then is DONE
> >> and INTR are set on the same message with no data, but it tells
> >> explicitly the set of data affected.  
> > 
> > Okay, perhaps we should put a WARN_ON_ONCE(seq && seq != prev_seq)
> > in rtnl_dump_all() then, to catch those who get it wrong.  
> 
> with '!(nlh->msg_flags & INTR)' to catch seq numbers not matching and
> the message was not flagged?

Yup.

Ismael, do you want to send a patch for either version of the solution
or do you expect one of us to do it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-23 19:03                       ` Jakub Kicinski
@ 2022-06-28 19:38                         ` Ismael Luceno
  0 siblings, 0 replies; 19+ messages in thread
From: Ismael Luceno @ 2022-06-28 19:38 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David Ahern, David S. Miller, Paolo Abeni, netdev

On Thu, 23 Jun 2022 12:03:07 -0700
Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 23 Jun 2022 11:31:34 -0600 David Ahern wrote:
> > >> All of the dumps should be checking the consistency at the end
> > >> of the dump - regardless of any remaining entries on a
> > >> particular round (e.g., I mentioned this what the nexthop dump
> > >> does). Worst case then is DONE and INTR are set on the same
> > >> message with no data, but it tells explicitly the set of data
> > >> affected.  
> > > 
> > > Okay, perhaps we should put a WARN_ON_ONCE(seq && seq != prev_seq)
> > > in rtnl_dump_all() then, to catch those who get it wrong.  
> > 
> > with '!(nlh->msg_flags & INTR)' to catch seq numbers not matching
> > and the message was not flagged?
> 
> Yup.
> 
> Ismael, do you want to send a patch for either version of the solution
> or do you expect one of us to do it?

I'll prepare a patch; thanks for the feedback.

-- 
Ismael Luceno
SUSE L3 Support

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-06-17 16:28             ` David Ahern
@ 2022-08-24 10:59               ` Ismael Luceno
  2022-08-24 11:46                 ` Florian Westphal
  0 siblings, 1 reply; 19+ messages in thread
From: Ismael Luceno @ 2022-08-24 10:59 UTC (permalink / raw)
  To: David Ahern
  Cc: Jakub Kicinski, David S. Miller, Paolo Abeni, netdev, Florian Westphal

On 17/Jun/2022 10:28, David Ahern wrote:
> On 6/17/22 9:22 AM, Jakub Kicinski wrote:
<...>
> > FWIW what I think is strange is that we record the gen id before the
> > dump and then check if the recorded version was old. Like.. what's
> > the point of that? Nothing updates cb->seq while dumping AFAICS, so
> > the code is functionally equivalent to this right?
<...> 
> FWIW, net/ipv4/nexthop.c sets cb->seq at the end of the loop and the
> nl_dump_check_consistent follows that.

[CCing Florian Westphal because he gave a related talk at netdev 2.2,
 maybe he can add something]

It seems to me now that most of the calls to nl_dump_check_consistent
are redundant.

Either the rtnl lock is explicitly taken:
- ethnl_tunnel_info_dumpit

Or are implicitly called with the rtnl lock held:
- rtnl_dump_ifinfo
- rtnl_dump_all
- in_dev_dump_addr
- inet_netconf_dump_devconf
- rtm_dump_nexthop
- rtm_dump_nexthop_bucket
- mpls_netconf_dump_devconf

Except:
- inet6_netconf_dump_devconf
- inet6_dump_addr

I assume the ones that rely on rcu_read_lock are safe too.

Also, the following ones set cb->seq just before calling it:
- rtm_dump_nh_ctx
- rtm_dump_res_bucket_ctx

Does it make sense to remove these calls or is anyone looking forward
to convert the functions to run without the rtnl lock?

Am I missing something here?

-- 
Ismael Luceno
SUSE - Support Level 3

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Netlink NLM_F_DUMP_INTR flag lost
  2022-08-24 10:59               ` Ismael Luceno
@ 2022-08-24 11:46                 ` Florian Westphal
  0 siblings, 0 replies; 19+ messages in thread
From: Florian Westphal @ 2022-08-24 11:46 UTC (permalink / raw)
  To: Ismael Luceno
  Cc: David Ahern, Jakub Kicinski, David S. Miller, Paolo Abeni,
	netdev, Florian Westphal

Ismael Luceno <iluceno@suse.de> wrote:
> It seems to me now that most of the calls to nl_dump_check_consistent
> are redundant.
> 
> Either the rtnl lock is explicitly taken:
> - ethnl_tunnel_info_dumpit

That guarantess *this* invocation of ethnl_tunnel_info_dumpit doesn't
change in between, but it doesn't guarantee consistency of the dump.

rtn_lock();
cb->seq = 1
 dumping...
 skb full
unlock();

nl_dump_check_consistent();
return-to-userspace
 /* write happens, base seq increments */

next recv():
rtn_lock();
seq is now > 1
 resume dumping from pos h
unlock();
nl_dump_check_consistent(); -> prev_seq is 1 but seq > 1 -> set INTR

It doesn't really matter if dumpit() grabs RTNL or another lock or no
lock at all (rcu) unless there is a guarantee that everything will fit
in one recv() call. I am not aware of such a guarantee anywhere.

If you meant that there is another nl_dump_check_consistent() that
already covers this then I missed it.

> I assume the ones that rely on rcu_read_lock are safe too.
> 
> Also, the following ones set cb->seq just before calling it:
> - rtm_dump_nh_ctx
> - rtm_dump_res_bucket_ctx

I'm not sure what you mean, do you mean

3443 out_err:
3444         cb->seq = net->nexthop.seq;
3445         nl_dump_check_consistent(cb, nlmsg_hdr(skb));
3446         return err;
3447 }

I don't see why this is buggy.  rtm_dump_nexthop_bucket() is called
with RTNL held.  Things were different in case of RCU, because we'd miss
flagging INTR in case change happened while doing the first/initial dump
invocation.

i.e.:
- dump starts
   // parallel modification, seq increments
- set seq to *incremented* number
- prev_seq is 0,

End of next round sees seq == incremented && prev_seq == seq, no INTR
set.

For the RCU case, seq needs to be set before dump starts.
For RTNL, no parallel modifications can happen, so the above is fine.

Modification can only happen after unlock, so next dump will see
prev_seq != seq and sets the INTR flag.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-08-24 11:46 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220615171113.7d93af3e@pirotess>
2022-06-15 16:00 ` Netlink NLM_F_DUMP_INTR flag lost Jakub Kicinski
2022-06-16 15:10   ` Ismael Luceno
2022-06-17  0:16     ` Jakub Kicinski
2022-06-17 13:01       ` Ismael Luceno
2022-06-17 14:55         ` David Ahern
2022-06-17 15:22           ` Jakub Kicinski
2022-06-17 16:17             ` David Ahern
2022-06-17 16:28             ` David Ahern
2022-08-24 10:59               ` Ismael Luceno
2022-08-24 11:46                 ` Florian Westphal
2022-06-22 11:12         ` Ismael Luceno
2022-06-22 23:55           ` Jakub Kicinski
2022-06-23  4:01             ` David Ahern
2022-06-23 16:03               ` Jakub Kicinski
2022-06-23 16:17                 ` David Ahern
2022-06-23 16:36                   ` Jakub Kicinski
2022-06-23 17:31                     ` David Ahern
2022-06-23 19:03                       ` Jakub Kicinski
2022-06-28 19:38                         ` Ismael Luceno

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.