All of lore.kernel.org
 help / color / mirror / Atom feed
* netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
@ 2021-05-12 22:17 Dexuan Cui
  2021-05-13  4:19 ` Dexuan Cui
  0 siblings, 1 reply; 7+ messages in thread
From: Dexuan Cui @ 2021-05-12 22:17 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel

Hi,
I'm debugging an iptables-restore failure, which happens about 5% of the
time when I keep stopping and starting the Linux VM. The VM has only 1
CPU, and kernel version is 4.15.0-1098-azure, but I suspect the issue may
also exist in the mainline Linux kernel.

When the failure happens, it's always caused by line 27 of the rule file:

  1 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
  2 *raw
  3 :PREROUTING ACCEPT [0:0]
  4 :OUTPUT ACCEPT [0:0]
  5 -A PREROUTING ! -s 168.63.129.16/32 -p tcp -j NOTRACK
  6 -A OUTPUT ! -d 168.63.129.16/32 -p tcp -j NOTRACK
  7 COMMIT
  8 # Completed on Fri Apr 23 09:22:59 2021
  9 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
 10 *filter
 11 :INPUT ACCEPT [2407:79190058]
 12 :FORWARD ACCEPT [0:0]
 13 :OUTPUT ACCEPT [1648:2190051]
 14 -A OUTPUT -d 169.254.169.254/32 -m owner --uid-owner 33 -j DROP
 15 COMMIT
 16 # Completed on Fri Apr 23 09:22:59 2021
 17 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
 18 *security
 19 :INPUT ACCEPT [2345:79155398]
 20 :FORWARD ACCEPT [0:0]
 21 :OUTPUT ACCEPT [1504:2129015]
 22 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
 23 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 24 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
 25 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 26 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 27 COMMIT

The related part of the strace log is:

  1 socket(PF_INET, SOCK_RAW, IPPROTO_RAW) = 3
  2 getsockopt(3, SOL_IP, IPT_SO_GET_INFO, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., [84]) = 0
  3 getsockopt(3, SOL_IP, IPT_SO_GET_ENTRIES, "security\0\357B\16Z\177\0\0Pg\355\0\0\0\0\0Pg\355\0\0\0\0\0"..., [880]) = 0
  4 setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = -1 EAGAIN (Resource temporarily unavailable)
  5 close(3)                          = 0
  6 write(2, "iptables-restore: line 27 failed"..., 33) = 33

The -EAGAIN error comes from line 1240 of xt_replace_table():

  do_ipt_set_ctl
    do_replace
      __do_replace
        xt_replace_table

1216 xt_replace_table(struct xt_table *table,
1217               unsigned int num_counters,
1218               struct xt_table_info *newinfo,
1219               int *error)
1220 {
1221         struct xt_table_info *private;
1222         unsigned int cpu;
1223         int ret;
1224
1225         ret = xt_jumpstack_alloc(newinfo);
1226         if (ret < 0) {
1227                 *error = ret;
1228                 return NULL;
1229         }
1230
1231         /* Do the substitution. */
1232         local_bh_disable();
1233         private = table->private;
1234
1235         /* Check inside lock: is the old number correct? */
1236         if (num_counters != private->number) {
1237                 pr_debug("num_counters != table->private->number (%u/%u)\n",
1238                          num_counters, private->number);
1239                 local_bh_enable();
1240                 *error = -EAGAIN;
1241                 return NULL;
1242         }

When the function returns -EAGAIN, the 'num_counters' is 5 while
'private->number' is 6.

If I re-run the iptables-restore program upon the failure, the program
will succeed.

I checked the function xt_replace_table() in the recent mainline kernel and it
looks like the function is the same.

It looks like there is a race condition between iptables-restore calls
getsockopt() to get the number of table entries and iptables call
setsockopt() to replace the entries? Looks like some other program is
concurrently calling getsockopt()/setsockopt() -- but it looks like this is
not the case according to the messages I print via trace_printk() around
do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
no other program calling do_replace(); the table entry number was changed
to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
this program 'iptables-restore' calls setsockopt() and the kernel sees
'num_counters' being 5 and the 'private->number' being 6 (how can this
happen??); the next setsockopt() call for the same 'security' table
happens in about 1 minute with both the numbers being 6.

Can you please shed some light on the issue? Thanks!

BTW, iptables does have a retry mechanism for getsockopt():
2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
(https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10&ignorews=0&dt=0)

But it looks like this is enough? e.g. here getsockopt() returns 0, but
setsockopt() returns -EAGAIN.

Thanks,
Dexuan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-12 22:17 netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN Dexuan Cui
@ 2021-05-13  4:19 ` Dexuan Cui
  2021-05-13  6:02   ` Dexuan Cui
  0 siblings, 1 reply; 7+ messages in thread
From: Dexuan Cui @ 2021-05-13  4:19 UTC (permalink / raw)
  To: 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org'
  Cc: Stephen Hemminger, Haiyang Zhang, 'linux-kernel@vger.kernel.org'

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 3:17 PM
> ...
> It looks like there is a race condition between iptables-restore calls
> getsockopt() to get the number of table entries and iptables call
> setsockopt() to replace the entries? Looks like some other program is
> concurrently calling getsockopt()/setsockopt() -- but it looks like this is
> not the case according to the messages I print via trace_printk() around
> do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
> no other program calling do_replace(); the table entry number was changed
> to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
> this program 'iptables-restore' calls setsockopt() and the kernel sees
> 'num_counters' being 5 and the 'private->number' being 6 (how can this
> happen??); the next setsockopt() call for the same 'security' table
> happens in about 1 minute with both the numbers being 6.

Well, after tracing the code more closely, it turns out that the program
'iptables' indeed concurrently changes the number from 5 to 6, and this
causes the 'iptable-restores' program to get EAGAIN:

1. iptables-906 (906 is the process ID) calls IPT_SO_GET_ENTRIES and the
current num_entries is 4; the process calls IPT_SO_SET_REPLACE and the
private->number becomes 5.

2. iptables-restor-895 calls IPT_SO_GET_ENTRIES, and gets num_entries==5.

3. iptables-906 calls IPT_SO_GET_ENTRIES again, and also gets num_entries==5;
the process calls IPT_SO_SET_REPLACE and the private->number becomes 6.

4. iptables-restor-895 calls IPT_SO_SET_REPLACE and the kernel function
xt_replace_table() returns -EAGAIN due to num_counters == 5 and
private->number == 6.

I think the latest mainline kernel should also have the same race.
It looks like this by-design race exists since day one?

> BTW, iptables does have a retry mechanism for getsockopt():
> 2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
> (https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10
> &ignorews=0&dt=0)
> 
> But it looks like this is enough? e.g. here getsockopt() returns 0, but
> setsockopt() returns -EAGAIN.

It looks like we need to add a retry mechanism to the iptables-restore
program: iptables/iptables-restore.c: ip46tables_restore_main():

if cb->ops->commit(handle) -> ... -> setsockopt(...,IPT_SO_SET_REPLACE, ...)
fails due to EAGAIN, it should start over from the very begining, i.e. call
create_handle() -> handle = cb->ops->init(tablename) again to get the new
num_entries, and retry the commit op. But I'm not sure how to easily
re-create the context associated with the old handle (should/can it re-parse
the rules?), as I'm not really familiar with iptables.

Or, is it possible to fix the race condition from the netfilter module?

Thanks,
-- Dexuan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-13  4:19 ` Dexuan Cui
@ 2021-05-13  6:02   ` Dexuan Cui
  2021-05-13  6:19     ` Dexuan Cui
  0 siblings, 1 reply; 7+ messages in thread
From: Dexuan Cui @ 2021-05-13  6:02 UTC (permalink / raw)
  To: 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org'
  Cc: Stephen Hemminger, Haiyang Zhang, 'linux-kernel@vger.kernel.org'

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 9:19 PM
> ...
> I think the latest mainline kernel should also have the same race.
> It looks like this by-design race exists since day one?

I indeed reproduced the issue with the latest stable tree (v5.12.3) as well.

> > BTW, iptables does have a retry mechanism for getsockopt():
> > 2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
> >
> (https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10
> > &ignorews=0&dt=0)
> >
> > But it looks like this is enough? 
I missed a "not". IMO 2f93205b375e is not enough.

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-13  6:02   ` Dexuan Cui
@ 2021-05-13  6:19     ` Dexuan Cui
  2021-05-13  9:40       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 7+ messages in thread
From: Dexuan Cui @ 2021-05-13  6:19 UTC (permalink / raw)
  To: 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org'
  Cc: Stephen Hemminger, Haiyang Zhang, 'linux-kernel@vger.kernel.org'

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 11:02 PM

BTW, I found a similar report in 2019:

"
https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
service failed randomly on boot while working ok when launched manually.
Turned out it was conflicting with sshguard service due to systemd trying
to load everything in parallel. What helped is to setting
ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
"

The above report provided a workaround. 
I think we need a real fix.

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-13  6:19     ` Dexuan Cui
@ 2021-05-13  9:40       ` Pablo Neira Ayuso
  2021-05-13 17:08         ` Phil Sutter
  0 siblings, 1 reply; 7+ messages in thread
From: Pablo Neira Ayuso @ 2021-05-13  9:40 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org',
	Stephen Hemminger, Haiyang Zhang,
	'linux-kernel@vger.kernel.org'

On Thu, May 13, 2021 at 06:19:38AM +0000, Dexuan Cui wrote:
> > From: Dexuan Cui
> > Sent: Wednesday, May 12, 2021 11:02 PM
> 
> BTW, I found a similar report in 2019:
> 
> "
> https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
> I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
> service failed randomly on boot while working ok when launched manually.
> Turned out it was conflicting with sshguard service due to systemd trying
> to load everything in parallel. What helped is to setting
> ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
> and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
> "
> 
> The above report provided a workaround.

There's -w and -W to serialize ruleset updates. You could follow a
similar approach from userspace if you don't use iptables userspace
binary.

> I think we need a real fix.

iptables-nft already fixes this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-13  9:40       ` Pablo Neira Ayuso
@ 2021-05-13 17:08         ` Phil Sutter
  2021-05-13 20:40           ` Dexuan Cui
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2021-05-13 17:08 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Dexuan Cui, 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org',
	Stephen Hemminger, Haiyang Zhang,
	'linux-kernel@vger.kernel.org'

Hi,

On Thu, May 13, 2021 at 11:40:47AM +0200, Pablo Neira Ayuso wrote:
> On Thu, May 13, 2021 at 06:19:38AM +0000, Dexuan Cui wrote:
> > > From: Dexuan Cui
> > > Sent: Wednesday, May 12, 2021 11:02 PM
> > 
> > BTW, I found a similar report in 2019:
> > 
> > "
> > https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
> > I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
> > service failed randomly on boot while working ok when launched manually.
> > Turned out it was conflicting with sshguard service due to systemd trying
> > to load everything in parallel. What helped is to setting
> > ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
> > and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
> > "
> > 
> > The above report provided a workaround.
> 
> There's -w and -W to serialize ruleset updates. You could follow a
> similar approach from userspace if you don't use iptables userspace
> binary.

My guess is the xtables lock is not effective here, so waiting for it
probably won't help.

Dexuan, concurrent access is avoided in user space using a file-based
lock. So if multiple iptables(-restore) processes run in different
mount-namespaces, they might miss the other's /run/xtables.lock. Another
option would be if libiptc is used instead of calling iptables, but
that's more a shot in the dark - I don't know if libiptc doesn't support
obtaining the xtables lock.

> > I think we need a real fix.
> 
> iptables-nft already fixes this.

nftables (and therefore iptables-nft) implement transactional logic in
kernel, user space automatically retries if a transaction's commit
fails.

Cheers, Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
  2021-05-13 17:08         ` Phil Sutter
@ 2021-05-13 20:40           ` Dexuan Cui
  0 siblings, 0 replies; 7+ messages in thread
From: Dexuan Cui @ 2021-05-13 20:40 UTC (permalink / raw)
  To: Phil Sutter, Pablo Neira Ayuso
  Cc: 'netfilter-devel@vger.kernel.org',
	'netdev@vger.kernel.org',
	Stephen Hemminger, Haiyang Zhang,
	'linux-kernel@vger.kernel.org'

> From: n0-1@orbyte.nwl.cc <n0-1@orbyte.nwl.cc> On Behalf Of Phil Sutter
> Sent: Thursday, May 13, 2021 10:08 AM
> >
> > There's -w and -W to serialize ruleset updates. You could follow a
> > similar approach from userspace if you don't use iptables userspace
> > binary.
> 
> My guess is the xtables lock is not effective here, so waiting for it
> probably won't help.

Here iptables-restore v1.6.0 is used, and it does not support the -w and -W
options. :-)

New iptables-restore versions, e.g. 1.8.4-3ubuntu2, do support the -w/-W
options.
 
> Dexuan, concurrent access is avoided in user space using a file-based
> lock. So if multiple iptables(-restore) processes run in different
> mount-namespaces, they might miss the other's /run/xtables.lock. Another
> option would be if libiptc is used instead of calling iptables, but
> that's more a shot in the dark - I don't know if libiptc doesn't support
> obtaining the xtables lock.
> 
> > > I think we need a real fix.
> >
> > iptables-nft already fixes this.
>
> nftables (and therefore iptables-nft) implement transactional logic in
> kernel, user space automatically retries if a transaction's commit
> fails.
> 
> Cheers, Phil

Good to know. Thanks for the explanation!

It sounds like I need to either migrate to iptables-nft/nft or use a retry
workarouond (if iptables-restore-legacy fails due to EAGAIN).

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-13 20:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12 22:17 netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN Dexuan Cui
2021-05-13  4:19 ` Dexuan Cui
2021-05-13  6:02   ` Dexuan Cui
2021-05-13  6:19     ` Dexuan Cui
2021-05-13  9:40       ` Pablo Neira Ayuso
2021-05-13 17:08         ` Phil Sutter
2021-05-13 20:40           ` Dexuan Cui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.