linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: PROBLEM: Bug in __pollwait() can cause select() and poll() to hang in   2.4.21
       [not found] <3EF1D830.F12113D@sgi.com.suse.lists.linux.kernel>
@ 2003-06-19 16:42 ` Andi Kleen
  2003-06-19 17:18   ` Ray Bryant
  0 siblings, 1 reply; 3+ messages in thread
From: Andi Kleen @ 2003-06-19 16:42 UTC (permalink / raw)
  To: Ray Bryant; +Cc: linux-kernel

Ray Bryant <raybry@sgi.com> writes:
> 
>      select() and poll() call a common routine: __pollwait().  On the
> first call to __pollwait(), it calls __get_free_page(GFP_KERNEL) to
> allocate a table to hold wait queues.  In the natural course of things,
> this calls into __alloc_pages().  In low memory situations, the process
> can then end up in the rebalance code at the bottom of __alloc_pages()
> where there is a call to yield().  If the process makes this call, this
> is a bad thing [tm], since the process state at that point is
> TASK_INTERRUPTIBLE.  There is no wait queue yet for the process (that is
> done later in __pollwait()) and no schedule timeout event has yet been
> created (that is done later in select()) so the process will never
> return from the call to yield().

Nasty bug. How about adding a BUG() for current->state != TASK_RUNNING at 
the beginning of __alloc_pages unless GFP_ATOMIC is set?

-Andi

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: PROBLEM: Bug in __pollwait() can cause select() and poll() to hang  in   2.4.21
  2003-06-19 16:42 ` PROBLEM: Bug in __pollwait() can cause select() and poll() to hang in 2.4.21 Andi Kleen
@ 2003-06-19 17:18   ` Ray Bryant
  0 siblings, 0 replies; 3+ messages in thread
From: Ray Bryant @ 2003-06-19 17:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen wrote:
> 
> Nasty bug. How about adding a BUG() for current->state != TASK_RUNNING at
> the beginning of __alloc_pages unless GFP_ATOMIC is set?
> 
> -Andi

A good idea.  There may be other places like this that call into
__alloc_pages() with current->state != TASK_RUNNING.

However, this may break otherwise happily running kernels being used in
production today that are not in low memory situations.  Might it not be
better for me to run this in debug mode and see if we can find other
places where this is happening?  If I don't find any, then we can add
the BUG() as a way to avoid future introductions of this problem.

-- 
Best Regards,
Ray
-----------------------------------------------
                  Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 3+ messages in thread

* PROBLEM: Bug in __pollwait() can cause select() and poll() to hang in  2.4.21
@ 2003-06-19 15:35 Ray Bryant
  0 siblings, 0 replies; 3+ messages in thread
From: Ray Bryant @ 2003-06-19 15:35 UTC (permalink / raw)
  To: linux-kernel

[1.] One line summary of the problem:    

     In low memory situations, a process that issues a call to select()
or poll() can sleep forever in the kernel.

[2.] Full description of the problem/report:

     select() and poll() call a common routine: __pollwait().  On the
first call to __pollwait(), it calls __get_free_page(GFP_KERNEL) to
allocate a table to hold wait queues.  In the natural course of things,
this calls into __alloc_pages().  In low memory situations, the process
can then end up in the rebalance code at the bottom of __alloc_pages()
where there is a call to yield().  If the process makes this call, this
is a bad thing [tm], since the process state at that point is
TASK_INTERRUPTIBLE.  There is no wait queue yet for the process (that is
done later in __pollwait()) and no schedule timeout event has yet been
created (that is done later in select()) so the process will never
return from the call to yield().

[3.] Keywords (i.e., modules, networking, kernel):

     Kernel

[4.] Kernel version (from /proc/version):

     This bug appears to be present in every 2.4 kernel from (at least)
2.4.13 thru 2.4.21.  It is not present in 2.5.70, since a different
method of waiting for memory to free up is used there (in
__alloc_pages()).

[5.] Output of Oops.. message (if applicable) with symbolic information 
     resolved (see Documentation/oops-tracing.txt)

     N/A.

[6.] A small shell script or example program which triggers the
     problem (if possible)

     We ecountered this whilst running batch queue tests that are too
complex to include here.

[7.] Environment

[7.1.] Software (add the output of the ver_linux script here)

[7.2.] Processor information (from /proc/cpuinfo):

      We encountered this on ia64, however, this is in machine
independent code and we believe the bug is present on all 2.4.21
platforms.

[7.3.] Module information (from /proc/modules):

[7.4.] Loaded driver and hardware information (/proc/ioports,
/proc/iomem)

[7.5.] PCI information ('lspci -vvv' as root)

[7.6.] SCSI information (from /proc/scsi/scsi)

[7.7.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):

[X.] Other notes, patches, fixes, workarounds:

     The simplest fix is just to set current state back to TASK_RUNNING
for the duration of the call to __get_free_page(GFP_KERNEL) in
__pollwait():

--- /usr/tmp/TmpDir.14764-0/linux/linux/fs/select.c        Mon Jun  2
10:29:37 2003
 +++ linux/linux/fs/select.c     Mon Jun  2 08:02:45 2003
 @@ -79,7 +79,9 @@
         if (!table || POLL_TABLE_FULL(table)) {
                 struct poll_table_page *new_table;
  
 +               set_current_state(TASK_RUNNING);
                 new_table = (struct poll_table_page *)
__get_free_page(GFP_KERNEL);
 +               set_current_state(TASK_INTERRUPTIBLE);
                 if (!new_table) {
                         p->error = -ENOMEM;
                         __set_current_state(TASK_RUNNING);


-- 
Best Regards,
Ray
-----------------------------------------------
                  Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-06-19 17:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3EF1D830.F12113D@sgi.com.suse.lists.linux.kernel>
2003-06-19 16:42 ` PROBLEM: Bug in __pollwait() can cause select() and poll() to hang in 2.4.21 Andi Kleen
2003-06-19 17:18   ` Ray Bryant
2003-06-19 15:35 Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).