linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* reproducible bug in 2.2.19 & 2.4.x
@ 2001-09-28 11:01 Balazs Scheidler
       [not found] ` <Pine.LNX.4.10.10109281121590.6506-100000@coffee.psychology.mcmaster.ca>
  2001-10-02  8:31 ` Balazs Scheidler
  0 siblings, 2 replies; 7+ messages in thread
From: Balazs Scheidler @ 2001-09-28 11:01 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2326 bytes --]

Hi,

We encountered SIGSEGV problems in our massively multithreaded application,
which we tracked down to a kernel issue. At least it seems to be a kernel
issue. I wrote a simple test program attached to this mail, which can be
used to reproduce this SIGSEGV.

How to reproduce it:
  - each new session starts a new thread (TCP sessions)
  - sometimes sessions break up causing a SIGPIPE to be delivered
  - each thread logs messages using syslog (libc changes SIGPIPE settings
    before really sending the message to avoid SIGPIPE terminating the
    application)
  - SIGPIPE is SIG_IGNed (it doesn't seem to matter whether I use SIG_IGN or
    an empty signal handler function)

The test program attached (stressthreads.c) opens a socket, binds to a port,
and starts listening for new connections. In each new thread, it sends a
syslog message, writes 1MB in 1024 byte chunks to the connection, closes the
connection and exits.

Also attached a python script starting up 100 threads, and connecting to the
given ip/port, reading 1024 bytes of data, then closing the connection. So
effectively we cause SIGPIPE in each thread in stresstest during their write
call. To run the test, compile stressthreads.c on one computer and run it,
it'll start listening on 0.0.0.0:10000. Modify the end of test-zorp.py, and
set the correct IP address of your first host in the script.

Start the script, and the other end will crash.

The program SEGFAULTs within a second when run on our PIII 800 SMP
test system. The SEGFAULT did not occur on any of our non-SMP systems. (even
with the same PIII with a non-SMP kernel)

I added a backtrace function to my test program to show where it
aborts, and here's the result:

Signal (11) received, stackdump follows; eax='ffffffe0', ebx='0000000c', ecx='be5ff96c', edx='00000400', eip='00000001'
retaddr=0x1, ebp=0xbe5ff944
retaddr=0x804892a, ebp=0xbe5ffd74
retaddr=0x4001bc9f, ebp=0xbe5ffe34

The program _always_ aborted at eip=0x1.

the program didn't abort if I removed the syslog() function call from the
thread.

The results are same for a 2.4.5 kernel (so I assume it affects later kernel
versions as well). 

Ideas, solutions, any help welcome.

PS: I'm not subscribed, so please CC me the replies.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

[-- Attachment #2: stressthreads.c --]
[-- Type: text/x-csrc, Size: 2059 bytes --]

#include <pthread.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <signal.h>
#include <syslog.h>
#include <stdio.h>

void inline 
z_dump_backtrace(unsigned long eip, unsigned long first_ebp)
{
  /* NOTE: this is i386 specific */
  unsigned long *ebp;
  
  fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", eip, first_ebp);
  
  ebp = (unsigned long *) first_ebp;
  while (ebp > (unsigned long *) &ebp && *ebp) 
    {
      fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", *(ebp+1), *ebp);
      ebp = (unsigned long *) *ebp;
    }
}

void
z_fatal_signal_handler(int signo)
{
  struct sigcontext *p = (struct sigcontext *) (((char *) &p) + 16);

  fprintf(stderr, "Signal (%d) received, stackdump follows; eax='%08lx', ebx='%08lx', ecx='%08lx', edx='%08lx', eip='%08lx'\n",
        signo, p->eax, p->ebx, p->ecx, p->edx, p->eip);
  z_dump_backtrace(p->eip, p->ebp);
  exit(1);
}

void *thread_func(void *arg)
{
  int fd = *(int *) arg;
  int i;
  char buf[1024];

  syslog(LOG_DEBUG, "thread started...%p\n", pthread_self());
  memset(buf, 'a', sizeof(buf));
  for (i = 0; i < 1024; i++)
    {
      write(fd, buf, sizeof(buf));
    }
  close(fd);
  //syslog(LOG_DEBUG, "thread stopped...%p\n", pthread_self());
}

int main()
{
  int fd;
  struct sockaddr_in sin;
  int tmp = 1;
  
  signal(SIGSEGV, z_fatal_signal_handler);
  signal(SIGPIPE, SIG_IGN);
  
  fd = socket(AF_INET, SOCK_STREAM, 0);
  
  sin.sin_family = AF_INET;
  sin.sin_port = htons(10000);
  sin.sin_addr.s_addr = INADDR_ANY;
  
  setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &tmp, sizeof(tmp));
  
  if (bind(fd, (struct sockaddr *) &sin, sizeof(sin)) < 0)
    {
      perror("bind");
      return 0;
    }
  
  listen(fd, 255);
  
  while (1)
    {
      int newfd;
      int tmplen;
      pthread_t t;
      
      tmplen = sizeof(sin);
      newfd = accept(fd, (struct sockaddr *) &sin, &tmplen);
      if (newfd == -1)
        {
          perror("accept");
        }
      else
        {
          pthread_create(&t, NULL, thread_func, &newfd);
        }
    }
}

[-- Attachment #3: test-zorp.py --]
[-- Type: text/plain, Size: 918 bytes --]

#!/usr/bin/python

from socket import *
from time import time, sleep


from thread import start_new_thread, get_ident
from os import system
import sys

def httptest(name,url):
        id = get_ident()
        j=0
        while 1:
                j=j+1
                i="%u,%u"%(id,j)
                print "%s,%s,started"%(name,i)
                t1=time()
                try:
                        f=socket(AF_INET, SOCK_STREAM)
			f.connect(url)
			f.read(1024)
	                f.close()
                except:
                        print "%s,%s,failed,%s"%(name,i,sys.exc_value)
                t2=time()
                print "%s,%s,elapsed,%f"%(name,i,(t2-t1))

def test(name,url,count):
        for i in range(1,count):
                print "starting %s thread(%u,%u)"%(name,i,count)
                start_new_thread(httptest,(name,url))

test("rovid", ("10.0.0.2", 10000), 100)
system("ping 10.0.0.2")

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: reproducible bug in 2.2.19 & 2.4.x
       [not found] ` <Pine.LNX.4.10.10109281121590.6506-100000@coffee.psychology.mcmaster.ca>
@ 2001-09-28 18:30   ` Balazs Scheidler
       [not found]     ` <Pine.LNX.4.10.10109281515050.6506-100000@coffee.psychology.mcmaster.ca>
  0 siblings, 1 reply; 7+ messages in thread
From: Balazs Scheidler @ 2001-09-28 18:30 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-kernel

On Fri, Sep 28, 2001 at 11:22:45AM -0400, Mark Hahn wrote:
> >   - each thread logs messages using syslog (libc changes SIGPIPE settings
> 
> it does syslog from within the signal handler?  I doubt that's legal.

you are right, but the problem is that the handler gets called at all.

the sigsegv doesn't happen within the signal handler, if you look at the
backtrace, it happens during the write() syscall.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: reproducible bug in 2.2.19 & 2.4.x
       [not found]     ` <Pine.LNX.4.10.10109281515050.6506-100000@coffee.psychology.mcmaster.ca>
@ 2001-09-28 22:12       ` Balazs Scheidler
  2001-10-01 10:03       ` Balazs Scheidler
  1 sibling, 0 replies; 7+ messages in thread
From: Balazs Scheidler @ 2001-09-28 22:12 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-kernel

On Fri, Sep 28, 2001 at 03:15:32PM -0400, Mark Hahn wrote:
> 
> seems like an interaction between your signal code and pthread,
> no obvious reason to blame the kernel:
[snip]

If you check out the code again, you can see that I _don't_ have signal code
for SIGPIPE, it's SIG_IGN-ed. I only have signal handler for SIGSEGV, which
is executed when the bogus jump occurrs to 0x1.

I agree that this can be either a kernel, pthread or libc problem.

BTW: the SIGSEGV occurs on UP machine with SMP kernel as well.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: reproducible bug in 2.2.19 & 2.4.x
       [not found]     ` <Pine.LNX.4.10.10109281515050.6506-100000@coffee.psychology.mcmaster.ca>
  2001-09-28 22:12       ` Balazs Scheidler
@ 2001-10-01 10:03       ` Balazs Scheidler
  1 sibling, 0 replies; 7+ messages in thread
From: Balazs Scheidler @ 2001-10-01 10:03 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 921 bytes --]

On Fri, Sep 28, 2001 at 03:15:32PM -0400, Mark Hahn wrote:
> 
> seems like an interaction between your signal code and pthread,
> no obvious reason to blame the kernel:

I've updated my test program to address your issues:

1) I removed the SIGSEGV handler (you know with the fprintf())
2) I modified it to use separate memory for all thread initialization data

The segfault still occurs. (to reenable backtrace #define BACKTRACE to 1 at
the beginning of the program) core is not dumped, because threaded programs
do not drop core.

I still think it's a kernel issue, at least the SMP dependancy tells me so.
(remember SMP kernel on an UP is affected too)

I also attach the strace of the process, which also shows the segfault. (it
is made by strace 4.4 which can follow threads)

PS: again please Cc me, since I'm not subscribed.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

[-- Attachment #2: stressthreads.c --]
[-- Type: text/x-csrc, Size: 2279 bytes --]

#include <pthread.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <signal.h>
#include <syslog.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define BACKTRACE 0


#if BACKTRACE
void inline 
z_dump_backtrace(unsigned long eip, unsigned long first_ebp)
{
  /* NOTE: this is i386 specific */
  unsigned long *ebp;
  
  fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", eip, first_ebp);
  
  ebp = (unsigned long *) first_ebp;
  while (ebp > (unsigned long *) &ebp && *ebp) 
    {
      fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", *(ebp+1), *ebp);
      ebp = (unsigned long *) *ebp;
    }
}

void
z_fatal_signal_handler(int signo)
{
  struct sigcontext *p = (struct sigcontext *) (((char *) &p) + 16);

  fprintf(stderr, "Signal (%d) received, stackdump follows; eax='%08lx', ebx='%08lx', ecx='%08lx', edx='%08lx', eip='%08lx'\n",
        signo, p->eax, p->ebx, p->ecx, p->edx, p->eip);
  z_dump_backtrace(p->eip, p->ebp);
  exit(1);
}
#endif

void *thread_func(void *arg)
{
  int fd = *(int *) arg;
  int i;
  char buf[1024];

  syslog(LOG_DEBUG, "thread started...%p\n", pthread_self());
  memset(buf, 'a', sizeof(buf));
  for (i = 0; i < 1024; i++)
    {
      write(fd, buf, sizeof(buf));
    }
  close(fd);
  //syslog(LOG_DEBUG, "thread stopped...%p\n", pthread_self());
  free(arg);
  return NULL;
}

int main()
{
  int fd;
  struct sockaddr_in sin;
  int tmp = 1;
  
#if BACKTRACE
  signal(SIGSEGV, z_fatal_signal_handler);
#endif
  signal(SIGPIPE, SIG_IGN);
  
  fd = socket(AF_INET, SOCK_STREAM, 0);
  
  sin.sin_family = AF_INET;
  sin.sin_port = htons(10000);
  sin.sin_addr.s_addr = INADDR_ANY;
  
  setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &tmp, sizeof(tmp));
  
  if (bind(fd, (struct sockaddr *) &sin, sizeof(sin)) < 0)
    {
      perror("bind");
      return 0;
    }
  
  listen(fd, 255);
  
  while (1)
    {
      int newfd;
      int tmplen;
      pthread_t t;
      
      tmplen = sizeof(sin);
      newfd = accept(fd, (struct sockaddr *) &sin, &tmplen);
      if (newfd == -1)
        {
          perror("accept");
        }
      else
        {
          int *state = (int *) malloc(sizeof(int));
          
          *state = newfd;
          pthread_create(&t, NULL, thread_func, state);
        }
    }
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: reproducible bug in 2.2.19 & 2.4.x
  2001-09-28 11:01 reproducible bug in 2.2.19 & 2.4.x Balazs Scheidler
       [not found] ` <Pine.LNX.4.10.10109281121590.6506-100000@coffee.psychology.mcmaster.ca>
@ 2001-10-02  8:31 ` Balazs Scheidler
       [not found]   ` <20011003125319.A32248@sci41.imag.fr>
  1 sibling, 1 reply; 7+ messages in thread
From: Balazs Scheidler @ 2001-10-02  8:31 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

On Fri, Sep 28, 2001 at 01:01:38PM +0200, Balazs Scheidler wrote:
> I added a backtrace function to my test program to show where it
> aborts, and here's the result:
> 
> Signal (11) received, stackdump follows; eax='ffffffe0', ebx='0000000c', ecx='be5ff96c', edx='00000400', eip='00000001'
> retaddr=0x1, ebp=0xbe5ff944
> retaddr=0x804892a, ebp=0xbe5ffd74
> retaddr=0x4001bc9f, ebp=0xbe5ffe34
> 
> The program _always_ aborted at eip=0x1.
> 
> the program didn't abort if I removed the syslog() function call from the
> thread.

I received an idea, suggesting that syslog() is not reentrant and this
causes problems.

I added mutexes around my syslog call and the problem still occurs, although
slower.

I'm trying to remove syslog() and add some sigaction calls instead (which I
think is the culprit) I attach my modified stressthreads.c

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

[-- Attachment #2: stressthreads.c --]
[-- Type: text/x-csrc, Size: 2414 bytes --]

#include <pthread.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <signal.h>
#include <syslog.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define BACKTRACE 0


#if BACKTRACE
void inline 
z_dump_backtrace(unsigned long eip, unsigned long first_ebp)
{
  /* NOTE: this is i386 specific */
  unsigned long *ebp;
  
  fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", eip, first_ebp);
  
  ebp = (unsigned long *) first_ebp;
  while (ebp > (unsigned long *) &ebp && *ebp) 
    {
      fprintf(stderr, "retaddr=0x%lx, ebp=0x%lx\n", *(ebp+1), *ebp);
      ebp = (unsigned long *) *ebp;
    }
}

void
z_fatal_signal_handler(int signo)
{
  struct sigcontext *p = (struct sigcontext *) (((char *) &p) + 16);

  fprintf(stderr, "Signal (%d) received, stackdump follows; eax='%08lx', ebx='%08lx', ecx='%08lx', edx='%08lx', eip='%08lx'\n",
        signo, p->eax, p->ebx, p->ecx, p->edx, p->eip);
  z_dump_backtrace(p->eip, p->ebp);
  exit(1);
}
#endif

pthread_mutex_t syslog_mutex = PTHREAD_MUTEX_INITIALIZER;

void *thread_func(void *arg)
{
  int fd = *(int *) arg;
  int i;
  char buf[1024];

  pthread_mutex_lock(&syslog_mutex);
  syslog(LOG_DEBUG, "thread started...%p\n", pthread_self());
  pthread_mutex_unlock(&syslog_mutex);
  memset(buf, 'a', sizeof(buf));
  for (i = 0; i < 1024; i++)
    {
      write(fd, buf, sizeof(buf));
    }
  close(fd);
  //syslog(LOG_DEBUG, "thread stopped...%p\n", pthread_self());
  free(arg);
  return NULL;
}

int main()
{
  int fd;
  struct sockaddr_in sin;
  int tmp = 1;
  
#if BACKTRACE
  signal(SIGSEGV, z_fatal_signal_handler);
#endif
  signal(SIGPIPE, SIG_IGN);
  
  fd = socket(AF_INET, SOCK_STREAM, 0);
  
  sin.sin_family = AF_INET;
  sin.sin_port = htons(10000);
  sin.sin_addr.s_addr = INADDR_ANY;
  
  setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &tmp, sizeof(tmp));
  
  if (bind(fd, (struct sockaddr *) &sin, sizeof(sin)) < 0)
    {
      perror("bind");
      return 0;
    }
  
  listen(fd, 255);
  
  while (1)
    {
      int newfd;
      int tmplen;
      pthread_t t;
      
      tmplen = sizeof(sin);
      newfd = accept(fd, (struct sockaddr *) &sin, &tmplen);
      if (newfd == -1)
        {
          perror("accept");
        }
      else
        {
          int *state = (int *) malloc(sizeof(int));
          
          *state = newfd;
          pthread_create(&t, NULL, thread_func, state);
        }
    }
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: reproducible bug in 2.2.19 & 2.4.x [lkml]
       [not found]   ` <20011003125319.A32248@sci41.imag.fr>
@ 2001-10-03 11:41     ` Balazs Scheidler
  0 siblings, 0 replies; 7+ messages in thread
From: Balazs Scheidler @ 2001-10-03 11:41 UTC (permalink / raw)
  To: pierre.lombard; +Cc: linux-kernel

On Wed, Oct 03, 2001 at 12:55:16PM +0200, pierre.lombard@imag.fr wrote:
> Hi,
> 
> I checked glibc syslog() code in misc/syslog.[hc]
> 
> A possible explanation:
> 1. you ignore SIGPIPE
> 2. syslog communicates with syslogd thru a pipe and mess with sigpipe
>   handler
> 3. this breaks things.
> 
> Can you test the patch I attached, it seems to work fine here (whereas your
> snippet crashes quickly) and send feedback if it works ?

The problem is that my the problem occurs in a program completely
independent from my stressthreads program (it is meant only for
demonstration purposes). This SIGSEGV occurs in our multithreaded proxy
firewall software called Zorp.

We are using a dummy signal handler (not SIG_IGN), and the SIGSEGV still
occurs.

A last addition is that the problem occurs on non-SMP kernels as well,
albeit much more rarely.

My workaround is now not to use syslog() from libc. I implemented my own
syslog writing routines, and the problem didn't occur (though it's been
running for two days now, so nothing is for sure yet).

For the time being I'm also trying to reproduce the problem without
syslog() as I think it is a kernel related problem.

I think it must be related to the fact syslog() is changing signal handler
routines very often.

thanks for the feedback.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* reproducible bug in 2.2.19 & 2.4.x
@ 2001-09-28 11:38 Rainer Keller
  0 siblings, 0 replies; 7+ messages in thread
From: Rainer Keller @ 2001-09-28 11:38 UTC (permalink / raw)
  To: Balazs Scheidler, linux-kernel

Hi Balazs,

Well, reading & running (on UP, no SIGSEGVs) Your code, I get the
impression, that there might be a small bug in it.

Your while-loop looks like this:
 while (1)
    {
      int newfd;
      int tmplen;
      pthread_t t;
      tmplen = sizeof(sin);
      newfd = accept(fd, (struct sockaddr *) &sin, &tmplen);
      ...
      pthread_create(&t, NULL, thread_func, &newfd);
   }

Now, if the thread_func is not started immediately on Your SMP-machine,
the next iteration of the loop might already have started the accept
with a new file-descriptor, leaving two threads reading on the same fd.

This should be per-thread data.

Just my 2 cents.

Greetings,
raY
\x18\x13

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2001-10-03 11:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-28 11:01 reproducible bug in 2.2.19 & 2.4.x Balazs Scheidler
     [not found] ` <Pine.LNX.4.10.10109281121590.6506-100000@coffee.psychology.mcmaster.ca>
2001-09-28 18:30   ` Balazs Scheidler
     [not found]     ` <Pine.LNX.4.10.10109281515050.6506-100000@coffee.psychology.mcmaster.ca>
2001-09-28 22:12       ` Balazs Scheidler
2001-10-01 10:03       ` Balazs Scheidler
2001-10-02  8:31 ` Balazs Scheidler
     [not found]   ` <20011003125319.A32248@sci41.imag.fr>
2001-10-03 11:41     ` reproducible bug in 2.2.19 & 2.4.x [lkml] Balazs Scheidler
2001-09-28 11:38 reproducible bug in 2.2.19 & 2.4.x Rainer Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).