netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

* netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN
@ 2021-05-12 22:17 Dexuan Cui
  2021-05-13  4:19 ` Dexuan Cui
  0 siblings, 1 reply; 7+ messages in thread
From: Dexuan Cui @ 2021-05-12 22:17 UTC (permalink / raw)
  To: netfilter-devel, netdev; +Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel

Hi,
I'm debugging an iptables-restore failure, which happens about 5% of the
time when I keep stopping and starting the Linux VM. The VM has only 1
CPU, and kernel version is 4.15.0-1098-azure, but I suspect the issue may
also exist in the mainline Linux kernel.

When the failure happens, it's always caused by line 27 of the rule file:

  1 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
  2 *raw
  3 :PREROUTING ACCEPT [0:0]
  4 :OUTPUT ACCEPT [0:0]
  5 -A PREROUTING ! -s 168.63.129.16/32 -p tcp -j NOTRACK
  6 -A OUTPUT ! -d 168.63.129.16/32 -p tcp -j NOTRACK
  7 COMMIT
  8 # Completed on Fri Apr 23 09:22:59 2021
  9 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
 10 *filter
 11 :INPUT ACCEPT [2407:79190058]
 12 :FORWARD ACCEPT [0:0]
 13 :OUTPUT ACCEPT [1648:2190051]
 14 -A OUTPUT -d 169.254.169.254/32 -m owner --uid-owner 33 -j DROP
 15 COMMIT
 16 # Completed on Fri Apr 23 09:22:59 2021
 17 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
 18 *security
 19 :INPUT ACCEPT [2345:79155398]
 20 :FORWARD ACCEPT [0:0]
 21 :OUTPUT ACCEPT [1504:2129015]
 22 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
 23 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 24 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
 25 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 26 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
 27 COMMIT

The related part of the strace log is:

  1 socket(PF_INET, SOCK_RAW, IPPROTO_RAW) = 3
  2 getsockopt(3, SOL_IP, IPT_SO_GET_INFO, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., [84]) = 0
  3 getsockopt(3, SOL_IP, IPT_SO_GET_ENTRIES, "security\0\357B\16Z\177\0\0Pg\355\0\0\0\0\0Pg\355\0\0\0\0\0"..., [880]) = 0
  4 setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = -1 EAGAIN (Resource temporarily unavailable)
  5 close(3)                          = 0
  6 write(2, "iptables-restore: line 27 failed"..., 33) = 33

The -EAGAIN error comes from line 1240 of xt_replace_table():

  do_ipt_set_ctl
    do_replace
      __do_replace
        xt_replace_table

1216 xt_replace_table(struct xt_table *table,
1217               unsigned int num_counters,
1218               struct xt_table_info *newinfo,
1219               int *error)
1220 {
1221         struct xt_table_info *private;
1222         unsigned int cpu;
1223         int ret;
1224
1225         ret = xt_jumpstack_alloc(newinfo);
1226         if (ret < 0) {
1227                 *error = ret;
1228                 return NULL;
1229         }
1230
1231         /* Do the substitution. */
1232         local_bh_disable();
1233         private = table->private;
1234
1235         /* Check inside lock: is the old number correct? */
1236         if (num_counters != private->number) {
1237                 pr_debug("num_counters != table->private->number (%u/%u)\n",
1238                          num_counters, private->number);
1239                 local_bh_enable();
1240                 *error = -EAGAIN;
1241                 return NULL;
1242         }

When the function returns -EAGAIN, the 'num_counters' is 5 while
'private->number' is 6.

If I re-run the iptables-restore program upon the failure, the program
will succeed.

I checked the function xt_replace_table() in the recent mainline kernel and it
looks like the function is the same.

It looks like there is a race condition between iptables-restore calls
getsockopt() to get the number of table entries and iptables call
setsockopt() to replace the entries? Looks like some other program is
concurrently calling getsockopt()/setsockopt() -- but it looks like this is
not the case according to the messages I print via trace_printk() around
do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
no other program calling do_replace(); the table entry number was changed
to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
this program 'iptables-restore' calls setsockopt() and the kernel sees
'num_counters' being 5 and the 'private->number' being 6 (how can this
happen??); the next setsockopt() call for the same 'security' table
happens in about 1 minute with both the numbers being 6.

Can you please shed some light on the issue? Thanks!

BTW, iptables does have a retry mechanism for getsockopt():
2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
(https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10&ignorews=0&dt=0)

But it looks like this is enough? e.g. here getsockopt() returns 0, but
setsockopt() returns -EAGAIN.

Thanks,
Dexuan

^ permalink raw reply	[flat|nested] 7+ messages in thread