ERESTARTSYS escaping from sem_wait with RTLinux patch

* ERESTARTSYS escaping from sem_wait with RTLinux patch
@ 2009-10-10  9:09 Blaise Gassend
  2009-10-10 16:40 ` ERESTARTSYS escaping from sem_wait with Preempt-RT Blaise Gassend
  2009-10-10 17:59 ` ERESTARTSYS escaping from sem_wait with RTLinux patch Thomas Gleixner
  0 siblings, 2 replies; 11+ messages in thread
From: Blaise Gassend @ 2009-10-10  9:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeremy Leibs

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

The attached python program, in which 500 threads spin with microsecond
sleeps, crashes with a "sem_wait: Unknown error 512" (conditions
described below). This appears to be due to an ERESTARTSYS generated
from futex_wait escaping to user space (libc). My understanding is that
this should never happen and I am trying to track down what is going on.

Questions that would help me make progress:
-------------------------------------------

1) Where is the ERESTARTSYS being prevented from getting to user space? 

The only likely place I see for preventing ERESTARTSYS from escaping to
user space is in arch/*/kernel/signal*.c. However, I don't see how the
code there is being called if there no signal pending. Is that a path
for ERESTARTSYS to escape from the kernel?

The following comment in kernel/futex.h in futex_wait makes me wonder if
two threads are getting marked as ERESTARTSYS. The first one to leave
the kernel processes the signal and restarts. The second one doesn't
have a signal to handle, so it returns to user space without getting
into signal*.c and wreaks havoc.

    (...)
        /*
         * We expect signal_pending(current), but another thread may
         * have handled it for us already.
         */
        if (!abs_time)
                return -ERESTARTSYS;
    (...)

2) Why would this be happening only with RT kernels?

3) Any suggestions on the best place to patch/workaround this? 

My understanding is that if I was to treat ERESTARTSYS as an EAGAIN,
most applications would be perfectly happy. Would bad things happen if I
replaced the ERESTARTSYS in futex_wait with an EAGAIN?

Crash conditions:
-----------------

- RTLinux only.
- More cores seems to make things worse. Lots of crashes on a dual-quad
core machine. None observed yet on dual core. At least one crash on a
dual-quad core when run with "taskset -c 1"
- Various versions, including 2.6.29.6-rt23, and whatever the latest was
earlier today.
- Seen on both ia64 and x86
- Ubuntu hardy and jaunty
- Sometimes hapens within 2 seconds on a dual quad-core machine, other
times will go for up to 30 minutes to an hour without crashing. I
suspect a dependence on system activity, but haven't noticed an obvious
pattern.
- Time to crash appears to drop fast with more CPU cores.

[-- Attachment #2: threadprocs8.py --]
[-- Type: text/x-python, Size: 222 bytes --]

import threading
import time

exiting = False

def spin():
    while not exiting:
        time.sleep(0.000001)

for i in range(0,500):
    threading.Thread(target=spin).start()

try:
    spin()
finally:
    exiting = True

^ permalink raw reply	[flat|nested] 11+ messages in thread