All of lore.kernel.org
 help / color / mirror / Atom feed
* Process Hang in __read_seqcount_begin
@ 2012-07-28 20:43 Peter LaDow
  0 siblings, 0 replies; 18+ messages in thread
From: Peter LaDow @ 2012-07-28 20:43 UTC (permalink / raw)
  To: linux-rt-users

We are running 3.0.36-rt57 on a powerpc box.  During some testing with
heavy loads and interfaces coming up/going down (specifically PPP), we
have run into a case where iptables hangs and cannot be killed.  It
requires a reboot to fix the problem.

Connecting the BDI and debugging the kernel, we get:

#0  get_counters (t=0xdd5145a0, counters=0xe3458000)
    at include/linux/seqlock.h:66
#1  0xc026b4ac in do_ipt_get_ctl (sk=<value optimized out>,
    cmd=<value optimized out>, user=0x10612078, len=<value optimized out>)
    at net/ipv4/netfilter/ip_tables.c:918
#2  0xc022226c in nf_sockopt (sk=<value optimized out>, pf=2 '\002',
    val=<value optimized out>, opt=<value optimized out>, len=0xdd4c7d4c,
    get=1) at net/netfilter/nf_sockopt.c:109
#3  0xc0236b1c in ip_getsockopt (sk=0xdf071480, level=<value optimized out>,
    optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
    optlen=0xbfbe0c2c) at net/ipv4/ip_sockglue.c:1308
#4  0xc02522a8 in raw_getsockopt (sk=0xdf071480, level=<value optimized out>,
    optname=<value optimized out>, optval=<value optimized out>,
    optlen=<value optimized out>) at net/ipv4/raw.c:811
#5  0xc01f4c38 in sock_common_getsockopt (sock=<value optimized out>,
    level=<value optimized out>, optname=<value optimized out>,
    optval=<value optimized out>, optlen=<value optimized out>)
    at net/core/sock.c:2157
#6  0xc01f2df8 in sys_getsockopt (fd=<value optimized out>, level=0,
    optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
    optlen=0xbfbe0c2c) at net/socket.c:1839
#7  0xc01f45b4 in sys_socketcall (call=15, args=<value optimized out>)
    at net/socket.c:2421

It seems to be stuck in __read_seqcount_begin.  From include/linux/seqlock.h:

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
        unsigned ret;

repeat:
        ret = ACCESS_ONCE(s->sequence);
        if (unlikely(ret & 1)) {
                cpu_relax();
  <----- It is always here
                goto repeat;
        }
        return ret;
}

I've been scouring the mailing lists and Google searches trying to
find something, but thus far have come up with nothing.

Any tips would be appreciated.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-30 23:09                       ` Peter LaDow
@ 2012-10-31  0:33                         ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2012-10-31  0:33 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Tue, 30 Oct 2012, Peter LaDow wrote:
> Anyway, based on earlier discussion, is there any reason not to use a
> lock (presuming any solution properly takes into account possible
> recursion)?  I understand that the mainline is protected, but perhaps
> in the RT version we can use seqlock (and prevent a recursive lock)?

Have you tried 3.6.3-rt9?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:54                     ` Thomas Gleixner
@ 2012-10-30 23:09                       ` Peter LaDow
  2012-10-31  0:33                         ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-30 23:09 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

Ok.  More of an update.  We've managed to create a scenario that
exhibits the problem much earlier.  We can now cause the lockup to
occur within a few hours (rather than the 12 to 24 hours in our other
scenario).

 Our setup is to to have a a lot of traffic constantly being processed
by the netfilter code.  After about 2 hours, any external attempt to
read the table entries (such as with getsocktopt and
IPT_SO_GET_ENTRIES) triggers the lockup.  What is strange is that this
does not appear until after a couple of hours of heavy traffic.  We
cannot trigger this problem in the first hour, rarely in the second
hour, and always after the second hour.

Now, our original setup did not have as much traffic.  But based upon
a quick, back of the napkin computation, it seems to occurring after a
certain amount of traffic.  I can try to get more firm numbers.  But
this kind of behavior hints less at a race condition between two
writers, and is instead somehow dependent upon the amount of traffic.
Indeed, my test program only uses IPT_SO_GET_ENTRIES which does not
trigger the second path do do_add_counters.  So I'm no longer thinking
the path through setsockopts is a cause of the problem.

So instead, it seems that the only way there could be multiple writers
(assuming that is the problem), is if there are multiple contexts
through which ipt_do_table() is called.  So far, my perusal of the
code indicates only through the hooks in each of the iptables modules.
 And it isn't clear to me how these are called.  But it does seem that
even with the patch Eric provided (which fixes the seqcount update),
there is still a potential problem.

If indeed we have multiple contexts executing ipt_do_table(), it is
possible for more than just the seqcount to be corrupted.  Indeed, it
seems that any updates to the internal structures could cause
problems.  It isn't clear to me if there is anything modified here,
other than the counters, so I'm not sure if there are any other
issues.  But regardless, if the counters could become corrupted, then
it is possible to break any rules that use them.

Anyway, based on earlier discussion, is there any reason not to use a
lock (presuming any solution properly takes into account possible
recursion)?  I understand that the mainline is protected, but perhaps
in the RT version we can use seqlock (and prevent a recursive lock)?

Thanks,
Pete LaDow

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:25                   ` Peter LaDow
@ 2012-10-26 21:54                     ` Thomas Gleixner
  2012-10-30 23:09                       ` Peter LaDow
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2012-10-26 21:54 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Fri, 26 Oct 2012, Peter LaDow wrote:
> On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> If this were safe, we wouldn't be seeing this lockup and your patch
> wouldn't be needed.  So it seems that your patch doesn't really
> address the issue that we are not "sure a thread cannot be interrupted
> by a softirq, and cannot migrate to another cpu".  Well, we know it
> cannot migrate to another CPU, because there isn't another CPU.  So
> apparently, it can be interrupted by a softirq.  So local_bh_disable
> isn't doing anything useful in the RT patches with regard to this.

RT changes the semantics slightly. And yes it's not prepared for stuff
which is relying on some of the magic mainline implicit semantics.

Let me have a look at the whole scenario, once I'm more awake.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:05                 ` Eric Dumazet
@ 2012-10-26 21:25                   ` Peter LaDow
  2012-10-26 21:54                     ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-26 21:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-rt-users, netfilter

On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Do you know what is per cpu data in linux kernel ?

I sorta did.  But since your response, I did more reading, and now I
see what you mean.  But I don't think this is a per cpu issue.  More
below.

> Because its not needed. Really I dont know why you want that.
>
> Once you are sure a thread cannot be interrupted by a softirq, and
> cannot migrate to another cpu, access to percpu data doesnt need other
> synchronization at all.

Because there are multiple entry points on the same CPU.  In
net/ipv4/netfilter/ip_tables, there are two entries to
xt_write_recseq_begin().  The first is in ipt_do_table and the other
is in get_counters.  Where we are seeing the lockup is with a
getsockopt syscall leading to do_counters.  The other path is through
ipt_do_table, which is installed as a hook.  I'm not sure from what
context the hooks are called, but it is certainly from a different
context than the syscall.

> Following sequence is safe :
>
> addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
> /*
>  * This is kind of a write_seqcount_begin(), but addend is 0 or 1
>  * We dont check addend value to avoid a test and conditional jump,
>  * since addend is most likely 1
>  */
> __this_cpu_add(xt_recseq.sequence, addend);

If this were safe, we wouldn't be seeing this lockup and your patch
wouldn't be needed.  So it seems that your patch doesn't really
address the issue that we are not "sure a thread cannot be interrupted
by a softirq, and cannot migrate to another cpu".  Well, we know it
cannot migrate to another CPU, because there isn't another CPU.  So
apparently, it can be interrupted by a softirq.  So local_bh_disable
isn't doing anything useful in the RT patches with regard to this.

As I mentioned earlier, I think perhaps what your patch did was ensure
an atomic update of the sequence counter.  But it does nothing to
synchronize two writers.  If they were already synchronized (such as
via the call to local_bh_disable), then we wouldn't see sequence
counter corruption, no?

Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 18:51               ` Peter LaDow
  2012-10-26 20:53                 ` Thomas Gleixner
@ 2012-10-26 21:05                 ` Eric Dumazet
  2012-10-26 21:25                   ` Peter LaDow
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-10-26 21:05 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel, linux-rt-users, netfilter

On Fri, 2012-10-26 at 11:51 -0700, Peter LaDow wrote:
> (I've added netfilter and linux-rt-users to try to pull in more help).
> 
> On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Upstream kernel is fine, there is no race, as long as :
> >
> > local_bh_disable() disables BH and preemption.
> 
> Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
> doesn't appear that any of the code checks the return value for
> xt_write_receq_begin to determine if it is safe to write.  And neither
> does the newly patched code.  How did the mainline code prevent
> corruption of the tables it is updating?
> 

Do you know what is per cpu data in linux kernel ?

> Why isn't there something like
> 
>   while ( (addend = xt_write_recseq_begin()) == 0 );
> 
> To make sure that only one person has write access to the tables?
> Better yet, why not use a seqlock_t instead?
> 

Because its not needed. Really I dont know why you want that.

Once you are sure a thread cannot be interrupted by a softirq, and
cannot migrate to another cpu, access to percpu data doesnt need other
synchronization at all.

Following sequence is safe :

addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
/*
 * This is kind of a write_seqcount_begin(), but addend is 0 or 1
 * We dont check addend value to avoid a test and conditional jump,
 * since addend is most likely 1
 */
__this_cpu_add(xt_recseq.sequence, addend);

Because any other thread will use a different percpu data.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 18:51               ` Peter LaDow
@ 2012-10-26 20:53                 ` Thomas Gleixner
  2012-10-26 21:05                 ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2012-10-26 20:53 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Fri, 26 Oct 2012, Peter LaDow wrote:
> (I've added netfilter and linux-rt-users to try to pull in more help).
> 
> On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Upstream kernel is fine, there is no race, as long as :
> >
> > local_bh_disable() disables BH and preemption.
> 
> Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
> doesn't appear that any of the code checks the return value for
> xt_write_receq_begin to determine if it is safe to write.  And neither
> does the newly patched code.  How did the mainline code prevent
> corruption of the tables it is updating?
> 
> Why isn't there something like:
> 
>   while ( (addend = xt_write_recseq_begin()) == 0 );
> 
> To make sure that only one person has write access to the tables?
> Better yet, why not use a seqlock_t instead?
> 
> > Apparently RT changes this, so RT needs to change the code.
> 
> The RT patch only touches local_bh_disable/enable, not the code in
> ip_tables.c.  Does the local_bh_disable/enable in the mainline code
> protect against multiple writers?
> 
> > cmpxchg() has strong guarantees (and is also slower than upstream code),
> > and seems a reasonable way to fix RT/iptables
> 
> I see this now.  And I agree that your patch would prevent corruption
> of the sequence counter.

Thanks for the reminder. I'll have a look.

       tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 16:48             ` Eric Dumazet
@ 2012-10-26 18:51               ` Peter LaDow
  2012-10-26 20:53                 ` Thomas Gleixner
  2012-10-26 21:05                 ` Eric Dumazet
  0 siblings, 2 replies; 18+ messages in thread
From: Peter LaDow @ 2012-10-26 18:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-rt-users, netfilter

(I've added netfilter and linux-rt-users to try to pull in more help).

On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Upstream kernel is fine, there is no race, as long as :
>
> local_bh_disable() disables BH and preemption.

Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
doesn't appear that any of the code checks the return value for
xt_write_receq_begin to determine if it is safe to write.  And neither
does the newly patched code.  How did the mainline code prevent
corruption of the tables it is updating?

Why isn't there something like:

  while ( (addend = xt_write_recseq_begin()) == 0 );

To make sure that only one person has write access to the tables?
Better yet, why not use a seqlock_t instead?

> Apparently RT changes this, so RT needs to change the code.

The RT patch only touches local_bh_disable/enable, not the code in
ip_tables.c.  Does the local_bh_disable/enable in the mainline code
protect against multiple writers?

> cmpxchg() has strong guarantees (and is also slower than upstream code),
> and seems a reasonable way to fix RT/iptables

I see this now.  And I agree that your patch would prevent corruption
of the sequence counter.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 16:15           ` Peter LaDow
@ 2012-10-26 16:48             ` Eric Dumazet
  2012-10-26 18:51               ` Peter LaDow
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-10-26 16:48 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel

On Fri, 2012-10-26 at 09:15 -0700, Peter LaDow wrote:
> On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Could you try following patch ?
> 
> So, I applied your patch.  And so far, it seems to have fixed the
> issue.  I've had my systems running for 48 hours, and no lockup in
> iptables.  Usually, I could get a lockup to occur within 12 to 24
> hours, and running this long tells me this patch may have things
> fixed.
> 
> Now, I have a couple of questions.
> 
> First, I'm not sure how this actually fixes things.  It does seem
> there was a race before.  But it isn't clear to me how this patch
> eliminates the race.  I fear that this patch only reduces the window
> in which a race could occur, but doesn't eliminate it completely.
>  

...

> Second, perhaps use seqlock_t instead of a seqcount_t for xt_recseq?
> This eliminates the problem.  But given that it has been using
> seqcount_t for a long time, and is still using it, and nobody else has
> had this issue appear, makes me wonder if this problem isn't something
> unique to the RT patches.  Perhaps only use seqlock_t in built for
> PREEMPT_RT_FULL?
> 
> Third, recent RT patches (such as 3.6.2-rt4) have added
> preempt_disable_rt() and preempt_enable_rt() calls inside of
> read_seqcount_begin() and write_seqcount_end() respectively.  The call
> to local_bh_disable/local_bh_enable doesn't do anything to
> disable/enable preemption (in 3.6.2-rt4), but it does in 3.03.36-rt58.
>  But in 3.0.36-rt58 it only did so if not in an atomic context.  And
> it doesn't appear to be an atomic context since local_bh_disable
> increments preempt_count by SOFTIRQ_OFFSET.  And it appeared that
> building with SMP enabled, even though it did call
> preempt_disable_rt(), indirectly, through local_bh_disable(), but the
> lockup still occurred.
> 
> Finally, after more testing, if this patch proves a solution to the
> problem, we could apply it locally.  But what kind of testing would be
> required as part of a submission back to the general kernel/RT folk?
> The patch is easy enough to generate, but if it can't be proven that
> this patch actually fixes anything, I fail to see how it would be
> useful.
> 
> Thanks,
> Pete LaDow


Upstream kernel is fine, there is no race, as long as :

local_bh_disable() disables BH and preemption.

Apparently RT changes this, so RT needs to change the code.

cmpxchg() has strong guarantees (and is also slower than upstream code),
and seems a reasonable way to fix RT/iptables




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-24  4:32         ` Eric Dumazet
  2012-10-24 16:30           ` Peter LaDow
@ 2012-10-26 16:15           ` Peter LaDow
  2012-10-26 16:48             ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-26 16:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Could you try following patch ?

So, I applied your patch.  And so far, it seems to have fixed the
issue.  I've had my systems running for 48 hours, and no lockup in
iptables.  Usually, I could get a lockup to occur within 12 to 24
hours, and running this long tells me this patch may have things
fixed.

Now, I have a couple of questions.

First, I'm not sure how this actually fixes things.  It does seem
there was a race before.  But it isn't clear to me how this patch
eliminates the race.  I fear that this patch only reduces the window
in which a race could occur, but doesn't eliminate it completely.

Second, perhaps use seqlock_t instead of a seqcount_t for xt_recseq?
This eliminates the problem.  But given that it has been using
seqcount_t for a long time, and is still using it, and nobody else has
had this issue appear, makes me wonder if this problem isn't something
unique to the RT patches.  Perhaps only use seqlock_t in built for
PREEMPT_RT_FULL?

Third, recent RT patches (such as 3.6.2-rt4) have added
preempt_disable_rt() and preempt_enable_rt() calls inside of
read_seqcount_begin() and write_seqcount_end() respectively.  The call
to local_bh_disable/local_bh_enable doesn't do anything to
disable/enable preemption (in 3.6.2-rt4), but it does in 3.03.36-rt58.
 But in 3.0.36-rt58 it only did so if not in an atomic context.  And
it doesn't appear to be an atomic context since local_bh_disable
increments preempt_count by SOFTIRQ_OFFSET.  And it appeared that
building with SMP enabled, even though it did call
preempt_disable_rt(), indirectly, through local_bh_disable(), but the
lockup still occurred.

Finally, after more testing, if this patch proves a solution to the
problem, we could apply it locally.  But what kind of testing would be
required as part of a submission back to the general kernel/RT folk?
The patch is easy enough to generate, but if it can't be proven that
this patch actually fixes anything, I fail to see how it would be
useful.

Thanks,
Pete LaDow

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-24 16:30           ` Peter LaDow
@ 2012-10-24 16:44             ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-10-24 16:44 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel

On Wed, 2012-10-24 at 09:30 -0700, Peter LaDow wrote:
> On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Could you try following patch ?
> 
> Thanks for the suggestion. But I have a question about the patch below.
> 
> > +       /* Note : cmpxchg() is a memory barrier, we dont need smp_wmb() */
> > +       if (old != new && cmpxchg(&ptr->sequence, old, new) == old)
> > +               return 1;
> > +       return 0;
> 
> Looking at arch/powerpc/include/asm/system.h, cmpxchg is defined as a
> call to __cmpxchg_u32 (we are 32-bit, and I presume the size is 32
> bits):
> 
> static __always_inline unsigned long
> __cmpxchg_u32(volatile unsigned int *p, unsigned long old, unsigned long new)
> {
>         unsigned int prev;
> 
>         __asm__ __volatile__ (
>         PPC_RELEASE_BARRIER
> "1:     lwarx   %0,0,%2         # __cmpxchg_u32\n\
>         cmpw    0,%0,%3\n\
>         bne-    2f\n"
>         PPC405_ERR77(0,%2)
> "       stwcx.  %4,0,%2\n\
>         bne-    1b"
>         PPC_ACQUIRE_BARRIER
>         "\n\
> 2:"
>         : "=&r" (prev), "+m" (*p)
>         : "r" (p), "r" (old), "r" (new)
>         : "cc", "memory");
> 
>         return prev;
> }
> 
> And the interesting part is PPC_RELEASE_BARRIER and
> PPC_ACQUIRE_BARRIER.  Both of these are noops in non SMP systems.
> From arch/powerpc/include/asm/sync.h:
> 
> #ifdef CONFIG_SMP
> #define __PPC_ACQUIRE_BARRIER                           \
>         START_LWSYNC_SECTION(97);                       \
>         isync;                                          \
>         MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
> #define PPC_ACQUIRE_BARRIER     "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
> #define PPC_RELEASE_BARRIER     stringify_in_c(LWSYNC) "\n"
> #else
> #define PPC_ACQUIRE_BARRIER
> #define PPC_RELEASE_BARRIER
> #endif
> 
> So, if these are noops, does this really become an atomic operation?

On UP that would be an 'atomic operation' yes, because of lwarx/stwcx.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-24  4:32         ` Eric Dumazet
@ 2012-10-24 16:30           ` Peter LaDow
  2012-10-24 16:44             ` Eric Dumazet
  2012-10-26 16:15           ` Peter LaDow
  1 sibling, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-24 16:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Could you try following patch ?

Thanks for the suggestion. But I have a question about the patch below.

> +       /* Note : cmpxchg() is a memory barrier, we dont need smp_wmb() */
> +       if (old != new && cmpxchg(&ptr->sequence, old, new) == old)
> +               return 1;
> +       return 0;

Looking at arch/powerpc/include/asm/system.h, cmpxchg is defined as a
call to __cmpxchg_u32 (we are 32-bit, and I presume the size is 32
bits):

static __always_inline unsigned long
__cmpxchg_u32(volatile unsigned int *p, unsigned long old, unsigned long new)
{
        unsigned int prev;

        __asm__ __volatile__ (
        PPC_RELEASE_BARRIER
"1:     lwarx   %0,0,%2         # __cmpxchg_u32\n\
        cmpw    0,%0,%3\n\
        bne-    2f\n"
        PPC405_ERR77(0,%2)
"       stwcx.  %4,0,%2\n\
        bne-    1b"
        PPC_ACQUIRE_BARRIER
        "\n\
2:"
        : "=&r" (prev), "+m" (*p)
        : "r" (p), "r" (old), "r" (new)
        : "cc", "memory");

        return prev;
}

And the interesting part is PPC_RELEASE_BARRIER and
PPC_ACQUIRE_BARRIER.  Both of these are noops in non SMP systems.
>From arch/powerpc/include/asm/sync.h:

#ifdef CONFIG_SMP
#define __PPC_ACQUIRE_BARRIER                           \
        START_LWSYNC_SECTION(97);                       \
        isync;                                          \
        MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
#define PPC_ACQUIRE_BARRIER     "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
#define PPC_RELEASE_BARRIER     stringify_in_c(LWSYNC) "\n"
#else
#define PPC_ACQUIRE_BARRIER
#define PPC_RELEASE_BARRIER
#endif

So, if these are noops, does this really become an atomic operation?

Thanks,
Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-24  0:15       ` Peter LaDow
@ 2012-10-24  4:32         ` Eric Dumazet
  2012-10-24 16:30           ` Peter LaDow
  2012-10-26 16:15           ` Peter LaDow
  0 siblings, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-10-24  4:32 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel

On Tue, 2012-10-23 at 17:15 -0700, Peter LaDow wrote:
> (Sorry for the subject change, but I wanted to try and pull in those
> who work on RT issues, and the subject didn't make that obvious.
> Please search for the same subject without the RT Linux trailing
> text.)
> 
> Well, more information.  Even with SMP enabled (and presumably the
> migrate_enable having calls to preempt_disable), we still got the
> lockup in iptables-restore.  I did more digging, and it looks like the
> complete stack trace includes a call from iptables-restore (through
> setsockopt with IPT_SO_SET_ADD_COUNTERS).  This seems to be a
> potential multiple writer case where the counters are updated through
> the syscall and the kernel is updating the counters as it filters
> packets.
> 
> I think there might be a race on the update to xt_recseq.sequence,
> since the RT patches remove the spinlock in seqlock_t.  Thus multiple
> writers can corrupt the sequence count. 



>  And I thought the SMP
> configuration would disable preemption when local_bh_disable() is
> called.  And indeed, looking at the disassembly, I see
> preempt_disable() (though optimized, goes to add_preempt_count() ) is
> being called.
> 
> Yet we still see the lockup in the get_counters() in iptables.  Which,
> it seems, is because of some sort of problem with the sequence.  It
> doesn't appear to be related to the preemption, and perhaps there is
> some other corruption of the sequence counter happening.  But the only
> places I see it modified is in xt_write_recseq_begin and
> xt_write_recseq_end, which is only in the netfilter code
> (ip6_tables.c, ip_tables.c, and arp_tables.c).  And every single call
> is preceeded by a call to local_bh_disable().
> 
> This problem is a huge one for us.  And so far I'm unable to track
> down how this is occurring.
> 
> Any other tips?  I presume this is the proper place for RT issues.

Hmm...

Could you try following patch ?

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 8d674a7..053f9e7 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -471,30 +471,24 @@ DECLARE_PER_CPU(seqcount_t, xt_recseq);
  *
  * Begin packet processing : all readers must wait the end
  * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use this_cpu_add())
  * Returns :
  *  1 if no recursion on this cpu
  *  0 if recursion detected
  */
 static inline unsigned int xt_write_recseq_begin(void)
 {
-	unsigned int addend;
-
+	unsigned int old, new;
+	seqcount_t *ptr = &__get_cpu_var(xt_recseq);
 	/*
 	 * Low order bit of sequence is set if we already
 	 * called xt_write_recseq_begin().
 	 */
-	addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
-
-	/*
-	 * This is kind of a write_seqcount_begin(), but addend is 0 or 1
-	 * We dont check addend value to avoid a test and conditional jump,
-	 * since addend is most likely 1
-	 */
-	__this_cpu_add(xt_recseq.sequence, addend);
-	smp_wmb();
-
-	return addend;
+	old = ptr->sequence;
+	new = old | 1;
+	/* Note : cmpxchg() is a memory barrier, we dont need smp_wmb() */
+	if (old != new && cmpxchg(&ptr->sequence, old, new) == old)
+		return 1;
+	return 0;
 }
 
 /**



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-22 19:24     ` Peter LaDow
@ 2012-10-24  0:15       ` Peter LaDow
  2012-10-24  4:32         ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-24  0:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Eric Dumazet

(Sorry for the subject change, but I wanted to try and pull in those
who work on RT issues, and the subject didn't make that obvious.
Please search for the same subject without the RT Linux trailing
text.)

Well, more information.  Even with SMP enabled (and presumably the
migrate_enable having calls to preempt_disable), we still got the
lockup in iptables-restore.  I did more digging, and it looks like the
complete stack trace includes a call from iptables-restore (through
setsockopt with IPT_SO_SET_ADD_COUNTERS).  This seems to be a
potential multiple writer case where the counters are updated through
the syscall and the kernel is updating the counters as it filters
packets.

I think there might be a race on the update to xt_recseq.sequence,
since the RT patches remove the spinlock in seqlock_t.  Thus multiple
writers can corrupt the sequence count.  And I thought the SMP
configuration would disable preemption when local_bh_disable() is
called.  And indeed, looking at the disassembly, I see
preempt_disable() (though optimized, goes to add_preempt_count() ) is
being called.

Yet we still see the lockup in the get_counters() in iptables.  Which,
it seems, is because of some sort of problem with the sequence.  It
doesn't appear to be related to the preemption, and perhaps there is
some other corruption of the sequence counter happening.  But the only
places I see it modified is in xt_write_recseq_begin and
xt_write_recseq_end, which is only in the netfilter code
(ip6_tables.c, ip_tables.c, and arp_tables.c).  And every single call
is preceeded by a call to local_bh_disable().

This problem is a huge one for us.  And so far I'm unable to track
down how this is occurring.

Any other tips?  I presume this is the proper place for RT issues.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-22 17:25   ` Peter LaDow
@ 2012-10-22 19:24     ` Peter LaDow
  2012-10-24  0:15       ` Peter LaDow
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-22 19:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

> Now, is preemption required to be disabled in non-SMP systems?

I did more digging, and I found this.

In linux/netfilter/x_tables.h, there is the definition of
xt_write_recseq_begin.  This function updates the sequence number for
the sequence locks.  This is called in the iptables kernel code.  From
the header for the function:

/**
 * xt_write_recseq_begin - start of a write section
 *
 * Begin packet processing : all readers must wait the end
 * 1) Must be called with preemption disabled
 * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add())
 * Returns :
 *  1 if no recursion on this cpu
 *  0 if recursion detected
 */

Note #1.  Your earlier response what that local_bh_disable should
disable preemption.  Indeed, looking at the calls to
xt_write_recseq_begin in the code, all are preceeded by a call to
local_bh_disable().  This call (local_bh_disable) is modified by the
RT patch.  The standard kernel calls __local_bh_disable, which does
quite a bit of work.  From the standard kernel:

static inline void _local_bh_enable_ip(unsigned long ip)
{
        WARN_ON_ONCE(in_irq() || irqs_disabled());
#ifdef CONFIG_TRACE_IRQFLAGS
        local_irq_disable();
#endif
        /*
         * Are softirqs going to be turned on now:
         */
        if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
                trace_softirqs_on(ip);
        /*
         * Keep preemption disabled until we are done with
         * softirq processing:
         */
        sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);

        if (unlikely(!in_interrupt() && local_softirq_pending()))
                do_softirq();

        dec_preempt_count();
#ifdef CONFIG_TRACE_IRQFLAGS
        local_irq_enable();
#endif
        preempt_check_resched();
}

But the patch changes things:

void local_bh_disable(void)
{
       migrate_disable();
       current->softirq_nestcnt++;
}

And the call to migrate_disable() is defined as a macro, depending
upon the configuration.  From the patch:

#ifdef CONFIG_PREEMPT_RT_FULL
# define preempt_disable_rt()           preempt_disable()
# define preempt_enable_rt()            preempt_enable()
# define preempt_disable_nort()         do { } while (0)
# define preempt_enable_nort()          do { } while (0)
# ifdef CONFIG_SMP
   extern void migrate_disable(void);
   extern void migrate_enable(void);
# else /* CONFIG_SMP */
#  define migrate_disable()             do { } while (0)
#  define migrate_enable()              do { } while (0)
# endif /* CONFIG_SMP */
#else
# define preempt_disable_rt()           do { } while (0)
# define preempt_enable_rt()            do { } while (0)
# define preempt_disable_nort()         preempt_disable()
# define preempt_enable_nort()          preempt_enable()
# define migrate_disable()              preempt_disable()
# define migrate_enable()               preempt_enable()
#endif

In our configuration, we have CONFIG_PREEMPT_RT_FULL defined, and do
not have CONFIG_SMP defined.  So migrate_enable() becomes a NOP.  This
seems to violate the requirement on the xt_write_recseq_begin
requirement.  However, enabling SMP does enable the actual function.

Should local_bh_disable() call the actual function in non-SMP systems
to meet the requirements for xt_write_recseq_begin()?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-22 17:01 ` Eric Dumazet
@ 2012-10-22 17:25   ` Peter LaDow
  2012-10-22 19:24     ` Peter LaDow
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-22 17:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

On Mon, Oct 22, 2012 at 10:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This looks like a corruption of s->sequence, and is value is odd, even
> if no writer is alive.
>
> Does local_bh_disable() disables preemption on RT ?

Hmmm....

With PREEMPT_RT_FULL defined (as we have):

void local_bh_disable(void)
{
        migrate_disable();
        current->softirq_nestcnt++;
}

And the RT patches add the following:

#ifdef CONFIG_PREEMPT_RT_FULL
# define preempt_disable_rt()          preempt_disable()
# define preempt_enable_rt()           preempt_enable()
# define preempt_disable_nort()                do { } while (0)
# define preempt_enable_nort()         do { } while (0)
# ifdef CONFIG_SMP
   extern void migrate_disable(void);
   extern void migrate_enable(void);
# else /* CONFIG_SMP */
#  define migrate_disable()            do { } while (0)
#  define migrate_enable()             do { } while (0)
# endif /* CONFIG_SMP */
#else
# define preempt_disable_rt()          do { } while (0)
# define preempt_enable_rt()           do { } while (0)
# define preempt_disable_nort()                preempt_disable()
# define preempt_enable_nort()         preempt_enable()
# define migrate_disable()             preempt_disable()
# define migrate_enable()              preempt_enable()
#endif

And since we are not SMP, local_bh_disable() essentially does nothing.
 These definitions are consistent across all the RT patches, up to
3.6.2-rt4 (as far as I can tell).

Now, is preemption required to be disabled in non-SMP systems?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-22 16:46 Peter LaDow
@ 2012-10-22 17:01 ` Eric Dumazet
  2012-10-22 17:25   ` Peter LaDow
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2012-10-22 17:01 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel

On Mon, 2012-10-22 at 09:46 -0700, Peter LaDow wrote:
> I posted this problem some time back on the linux-rt-users and
> netfilter lists.  Since then, we thought we had a workaround to avoid
> this problem, so we dropped the issue.  But now 5 months later, the
> problem has reappeared.  And this time it is much more serious and
> much more difficult to re-create.  After perusing both those lists,
> I'm not sure if those were the proper places to post.  The netfilter
> list seems to be more focused on the user space side of things, and
> the RT page indicates that kernel side RT issues should go to lkml.
> 
> Anyway, here's a repost of that problem from July.  Perhaps somebody
> here can point us in the right direction.
> 
> We are running 3.0.36-rt57 on a powerpc box.  During some testing with
> heavy loads and interfaces coming up/going down (specifically PPP), we
> have run into a case where iptables hangs and cannot be killed.  It
> requires a reboot to fix the problem.
> 
> Connecting the BDI and debugging the kernel, we get:
> 
> #0  get_counters (t=0xdd5145a0, counters=0xe3458000)
>     at include/linux/seqlock.h:66
> #1  0xc026b4ac in do_ipt_get_ctl (sk=<value optimized out>,
>     cmd=<value optimized out>, user=0x10612078, len=<value optimized out>)
>     at net/ipv4/netfilter/ip_tables.c:918
> #2  0xc022226c in nf_sockopt (sk=<value optimized out>, pf=2 '\002',
>     val=<value optimized out>, opt=<value optimized out>, len=0xdd4c7d4c,
>     get=1) at net/netfilter/nf_sockopt.c:109
> #3  0xc0236b1c in ip_getsockopt (sk=0xdf071480, level=<value optimized out>,
>     optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
>     optlen=0xbfbe0c2c) at net/ipv4/ip_sockglue.c:1308
> #4  0xc02522a8 in raw_getsockopt (sk=0xdf071480, level=<value optimized out>,
>     optname=<value optimized out>, optval=<value optimized out>,
>     optlen=<value optimized out>) at net/ipv4/raw.c:811
> #5  0xc01f4c38 in sock_common_getsockopt (sock=<value optimized out>,
>     level=<value optimized out>, optname=<value optimized out>,
>     optval=<value optimized out>, optlen=<value optimized out>)
>     at net/core/sock.c:2157
> #6  0xc01f2df8 in sys_getsockopt (fd=<value optimized out>, level=0,
>     optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
>     optlen=0xbfbe0c2c) at net/socket.c:1839
> #7  0xc01f45b4 in sys_socketcall (call=15, args=<value optimized out>)
>     at net/socket.c:2421
> 
> It seems to be stuck in __read_seqcount_begin.  From include/linux/seqlock.h:
> 
> static inline unsigned __read_seqcount_begin(const seqcount_t *s)
> {
>         unsigned ret;
> 
> repeat:
>         ret = ACCESS_ONCE(s->sequence);
>         if (unlikely(ret & 1)) {
>                 cpu_relax();
>   <----- It is always here
>                 goto repeat;
>         }
>         return ret;
> }
> 
> I've been scouring the mailing lists and Google searches trying to
> find something, but thus far have come up with nothing.
> 
> Any tips would be appreciated.

This looks like a corruption of s->sequence, and is value is odd, even
if no writer is alive.

Does local_bh_disable() disables preemption on RT ?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Process Hang in __read_seqcount_begin
@ 2012-10-22 16:46 Peter LaDow
  2012-10-22 17:01 ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Peter LaDow @ 2012-10-22 16:46 UTC (permalink / raw)
  To: linux-kernel

I posted this problem some time back on the linux-rt-users and
netfilter lists.  Since then, we thought we had a workaround to avoid
this problem, so we dropped the issue.  But now 5 months later, the
problem has reappeared.  And this time it is much more serious and
much more difficult to re-create.  After perusing both those lists,
I'm not sure if those were the proper places to post.  The netfilter
list seems to be more focused on the user space side of things, and
the RT page indicates that kernel side RT issues should go to lkml.

Anyway, here's a repost of that problem from July.  Perhaps somebody
here can point us in the right direction.

We are running 3.0.36-rt57 on a powerpc box.  During some testing with
heavy loads and interfaces coming up/going down (specifically PPP), we
have run into a case where iptables hangs and cannot be killed.  It
requires a reboot to fix the problem.

Connecting the BDI and debugging the kernel, we get:

#0  get_counters (t=0xdd5145a0, counters=0xe3458000)
    at include/linux/seqlock.h:66
#1  0xc026b4ac in do_ipt_get_ctl (sk=<value optimized out>,
    cmd=<value optimized out>, user=0x10612078, len=<value optimized out>)
    at net/ipv4/netfilter/ip_tables.c:918
#2  0xc022226c in nf_sockopt (sk=<value optimized out>, pf=2 '\002',
    val=<value optimized out>, opt=<value optimized out>, len=0xdd4c7d4c,
    get=1) at net/netfilter/nf_sockopt.c:109
#3  0xc0236b1c in ip_getsockopt (sk=0xdf071480, level=<value optimized out>,
    optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
    optlen=0xbfbe0c2c) at net/ipv4/ip_sockglue.c:1308
#4  0xc02522a8 in raw_getsockopt (sk=0xdf071480, level=<value optimized out>,
    optname=<value optimized out>, optval=<value optimized out>,
    optlen=<value optimized out>) at net/ipv4/raw.c:811
#5  0xc01f4c38 in sock_common_getsockopt (sock=<value optimized out>,
    level=<value optimized out>, optname=<value optimized out>,
    optval=<value optimized out>, optlen=<value optimized out>)
    at net/core/sock.c:2157
#6  0xc01f2df8 in sys_getsockopt (fd=<value optimized out>, level=0,
    optname=65, optval=0x10612078 <Address 0x10612078 out of bounds>,
    optlen=0xbfbe0c2c) at net/socket.c:1839
#7  0xc01f45b4 in sys_socketcall (call=15, args=<value optimized out>)
    at net/socket.c:2421

It seems to be stuck in __read_seqcount_begin.  From include/linux/seqlock.h:

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
        unsigned ret;

repeat:
        ret = ACCESS_ONCE(s->sequence);
        if (unlikely(ret & 1)) {
                cpu_relax();
  <----- It is always here
                goto repeat;
        }
        return ret;
}

I've been scouring the mailing lists and Google searches trying to
find something, but thus far have come up with nothing.

Any tips would be appreciated.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-10-31  0:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-28 20:43 Process Hang in __read_seqcount_begin Peter LaDow
2012-10-22 16:46 Peter LaDow
2012-10-22 17:01 ` Eric Dumazet
2012-10-22 17:25   ` Peter LaDow
2012-10-22 19:24     ` Peter LaDow
2012-10-24  0:15       ` Peter LaDow
2012-10-24  4:32         ` Eric Dumazet
2012-10-24 16:30           ` Peter LaDow
2012-10-24 16:44             ` Eric Dumazet
2012-10-26 16:15           ` Peter LaDow
2012-10-26 16:48             ` Eric Dumazet
2012-10-26 18:51               ` Peter LaDow
2012-10-26 20:53                 ` Thomas Gleixner
2012-10-26 21:05                 ` Eric Dumazet
2012-10-26 21:25                   ` Peter LaDow
2012-10-26 21:54                     ` Thomas Gleixner
2012-10-30 23:09                       ` Peter LaDow
2012-10-31  0:33                         ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.