* scheduling problem?
@ 2001-01-02 8:27 Mike Galbraith
2001-01-02 14:01 ` Anton Blanchard
2001-01-03 2:39 ` Roger Larsson
0 siblings, 2 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-02 8:27 UTC (permalink / raw)
To: linux-kernel; +Cc: Linus Torvalds, Andrew Morton
Hi,
I am seeing (what I believe is;) severe process CPU starvation in
2.4.0-prerelease. At first, I attributed it to semaphore troubles
as when I enable semaphore deadlock detection in IKD and set it to
5 seconds, it triggers 100% of the time on nscd when I do sequential
I/O (iozone eg). In the meantime, I've done a slew of tracing, and
I think the holder of the semaphore I'm timing out on just flat isn't
being scheduled so it can release it. In the usual case of nscd, I
_think_ it's another nscd holding the semaphore. In no trace can I
go back far enough to catch the taker of the semaphore or any user
task other than iozone running between __down() time and timeout 5
seconds later. (trace buffer covers ~8 seconds of kernel time)
I think the snippet below captures the gist of the problem.
c012f32e nr_free_pages +<e/4c> (0.16) pid(256)
c012f37a nr_inactive_clean_pages +<e/44> (0.22) pid(256)
c01377f2 wakeup_bdflush +<12/a0> (0.14) pid(256)
c011620a wake_up_process +<e/58> (0.29) pid(256)
c012eea4 __alloc_pages_limit +<10/b8> (0.28) pid(256)
c012eea4 __alloc_pages_limit +<10/b8> (0.30) pid(256)
c012e3fa wakeup_kswapd +<12/d4> (0.25) pid(256)
c0115613 __wake_up +<13/130> (0.41) pid(256)
c011527b schedule +<13/398> (0.66) pid(256->6)
c01077db __switch_to +<13/d0> (0.70) pid(6)
c01893c6 generic_unplug_device +<e/38> (0.25) pid(6)
c011527b schedule +<13/398> (0.50) pid(6->256)
c01077db __switch_to +<13/d0> (0.29) pid(256)
c012eea4 __alloc_pages_limit +<10/b8> (0.22) pid(256)
c012d267 reclaim_page +<13/408> (0.54) pid(256)
c012679e __remove_inode_page +<e/74> (0.54) pid(256)
c0126fe0 add_to_page_cache_unique +<10/e4> (0.23) pid(256)
c0126751 add_page_to_hash_queue +<d/4c> (0.16) pid(256)
c012c9fa lru_cache_add +<e/f4> (0.29) pid(256)
c0153ac5 ext2_prepare_write +<d/28> (0.15) pid(256)
c013697f block_prepare_write +<f/4c> (0.15) pid(256)
c0136233 __block_prepare_write +<13/214> (0.17) pid(256)
c0135fd4 create_empty_buffers +<10/7c> (0.17) pid(256)
c0135d13 create_buffers +<13/1bc> (0.14) pid(256)
(repeats zillion times)
Despite wakeup_kswapd(0), we never schedule kswapd until much
later when we hit a wakeup_kswapd(1).. in the interrum, we
bounce back and forth betweem bdflush and iozone.. no other
task is scheduled through very many schedules.
(Per kdb, I had a few tasks which would have liked some CPU ;-)
In addition, kswapd is quite the CPU piggie when it's doing
page_launder() as this profiled snippet of one of kswapd's
running periods shows. (60ms with no schedule)
0.1093% 66.06 11.01 6 c010d9c2 timer_interrupt
0.1462% 88.39 0.28 313 c0134ea2 __remove_from_lru_list
0.1489% 90.02 11.25 8 c01b2077 do_rw_disk
0.1599% 96.66 32.22 3 c018a9e7 elevator_linus_merge
0.1624% 98.18 0.30 325 c0115613 __wake_up
0.2016% 121.88 0.42 290 c012c01f kmem_cache_free
0.2075% 125.43 0.66 189 c0189f1d end_buffer_io_sync
0.2089% 126.32 15.79 8 c01ad533 ide_build_sglist
34.8258% 21054.55 0.53 39486 c0137667 try_to_free_buffers
62.4743% 37769.90 0.95 39796 c012f2b9 __free_pages
Total entries: 81970 Total usecs: 60456.69 Idle: 0.00%
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 8:27 scheduling problem? Mike Galbraith
@ 2001-01-02 14:01 ` Anton Blanchard
2001-01-02 14:59 ` Mike Galbraith
2001-01-03 2:39 ` Roger Larsson
1 sibling, 1 reply; 22+ messages in thread
From: Anton Blanchard @ 2001-01-02 14:01 UTC (permalink / raw)
To: Mike Galbraith; +Cc: linux-kernel, Linus Torvalds, Andrew Morton
Hi Mike,
> I am seeing (what I believe is;) severe process CPU starvation in
> 2.4.0-prerelease. At first, I attributed it to semaphore troubles
> as when I enable semaphore deadlock detection in IKD and set it to
> 5 seconds, it triggers 100% of the time on nscd when I do sequential
> I/O (iozone eg). In the meantime, I've done a slew of tracing, and
> I think the holder of the semaphore I'm timing out on just flat isn't
> being scheduled so it can release it. In the usual case of nscd, I
> _think_ it's another nscd holding the semaphore. In no trace can I
> go back far enough to catch the taker of the semaphore or any user
> task other than iozone running between __down() time and timeout 5
> seconds later. (trace buffer covers ~8 seconds of kernel time)
Did this just appear in recent kernels? Maybe bdflush was hiding the
situation in earlier kernels as it would cause io hogs to block when
things got only mildly interesting.
You might be able to get some useful information with ps axl and checking
the WCHAN value. Of course it wont be possible if like nscd you cant get
ps to schedule :)
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 14:01 ` Anton Blanchard
@ 2001-01-02 14:59 ` Mike Galbraith
2001-01-02 19:02 ` Linus Torvalds
2001-01-02 23:13 ` Daniel Phillips
0 siblings, 2 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-02 14:59 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linux-kernel, Linus Torvalds, Andrew Morton
On Wed, 3 Jan 2001, Anton Blanchard wrote:
>
> Hi Mike,
>
> > I am seeing (what I believe is;) severe process CPU starvation in
> > 2.4.0-prerelease. At first, I attributed it to semaphore troubles
> > as when I enable semaphore deadlock detection in IKD and set it to
> > 5 seconds, it triggers 100% of the time on nscd when I do sequential
> > I/O (iozone eg). In the meantime, I've done a slew of tracing, and
> > I think the holder of the semaphore I'm timing out on just flat isn't
> > being scheduled so it can release it. In the usual case of nscd, I
> > _think_ it's another nscd holding the semaphore. In no trace can I
> > go back far enough to catch the taker of the semaphore or any user
> > task other than iozone running between __down() time and timeout 5
> > seconds later. (trace buffer covers ~8 seconds of kernel time)
>
> Did this just appear in recent kernels? Maybe bdflush was hiding the
> situation in earlier kernels as it would cause io hogs to block when
> things got only mildly interesting.
Yes and no. I've seen nasty stalls for quite a while now. (I think
that there is a wakeup problem lurking)
I found the change which triggers my horrid stalls. Nobody is going
to believe this...
diff -urN linux-2.4.0-test13-pre6/fs/buffer.c linux-2.4.0-test13-pre7/fs/buffer.c
--- linux-2.4.0-test13-pre6/fs/buffer.c Sat Dec 30 08:58:56 2000
+++ linux-2.4.0-test13-pre7/fs/buffer.c Sun Dec 31 06:22:31 2000
@@ -122,16 +122,17 @@
when trying to refill buffers. */
int interval; /* jiffies delay between kupdate flushes */
int age_buffer; /* Time for normal buffer to age before we flush it */
- int dummy1; /* unused, was age_super */
+ int nfract_sync; /* Percentage of buffer cache dirty to
+ activate bdflush synchronously */
int dummy2; /* unused */
int dummy3; /* unused */
} b_un;
unsigned int data[N_PARAM];
-} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 5*HZ, 1884, 2}};
+} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 80, 0, 0}};
/* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = { 0, 10, 5, 25, 0, 1*HZ, 1*HZ, 1, 1};
-int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,600*HZ, 6000*HZ, 6000*HZ, 2047, 5};
+int bdflush_min[N_PARAM] = { 0, 10, 5, 25, 0, 1*HZ, 0, 0, 0};
+int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,600*HZ, 6000*HZ, 100, 0, 0};
/*
* Rewrote the wait-routines to use the "new" wait-queue functionality,
@@ -1032,9 +1034,9 @@
dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
tot = nr_free_buffer_pages();
- dirty *= 200;
+ dirty *= 100;
soft_dirty_limit = tot * bdf_prm.b_un.nfract;
- hard_dirty_limit = soft_dirty_limit * 2;
+ hard_dirty_limit = tot * bdf_prm.b_un.nfract_sync;
/* First, check for the "real" dirty limit. */
if (dirty > soft_dirty_limit) {
...but reversing this cures my semaphore timeouts. Don't say impossible
:) I didn't believe it either until I retested several times. I wager
that if I just fiddle with parameters I'll be able to make the problem
come and go at will. (means the real problem is gonna be a weird one:)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 14:59 ` Mike Galbraith
@ 2001-01-02 19:02 ` Linus Torvalds
2001-01-02 20:09 ` Andrea Arcangeli
2001-01-03 4:48 ` Mike Galbraith
2001-01-02 23:13 ` Daniel Phillips
1 sibling, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2001-01-02 19:02 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Mike Galbraith wrote:
>
> Yes and no. I've seen nasty stalls for quite a while now. (I think
> that there is a wakeup problem lurking)
>
> I found the change which triggers my horrid stalls. Nobody is going
> to believe this...
Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush()
is somewhat suspicious. In particular, if/when that ever triggers, and
bdflush() is busy in flush_dirty_buffers(), then the process that is
trying to wake bdflush up is going to wait until flush_dirty_buffers() is
done.
Which, if there is a process dirtying pages, can basically be
pretty much forever.
This was probably hidden by the lower limits simply by virtue of bdflush
never being very active before.
What does the system feel like if you just change the "sleep for bdflush"
logic in wakeup_bdflush() to something like
wake_up_process(bdflush_tsk);
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
schedule();
instead of trying to wait for bdflush to wake us up?
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 19:02 ` Linus Torvalds
@ 2001-01-02 20:09 ` Andrea Arcangeli
2001-01-02 21:02 ` Linus Torvalds
2001-01-03 4:48 ` Mike Galbraith
1 sibling, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2001-01-02 20:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mike Galbraith, Anton Blanchard, linux-kernel, Andrew Morton
On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> What does the system feel like if you just change the "sleep for bdflush"
> logic in wakeup_bdflush() to something like
>
> wake_up_process(bdflush_tsk);
> __set_current_state(TASK_RUNNING);
> current->policy |= SCHED_YIELD;
> schedule();
>
> instead of trying to wait for bdflush to wake us up?
My bet is a `VM: killing' message.
Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The
above will break write throttling at least unless something foundamental is
changed recently and that doesn't seem the case.
What I like to do there is to just make bdflush the same thing that kswapd
_should_ (I said "should" because it seems it's not the case anymore in 2.4.x
from some email I read recently, I didn't checked that myself yet) be for
memory pressure (I implemented that at some point in my private local tree). I
mean: bdflush only does the async writeouts and the task context calls
something like flush_dirty_buffers itself. The main reason I was doing that is
to fix the case of >bdf_prm.ndirty tasks all waiting on bdflush at the same
time (that will break write throttling even now in 2.2.x and in current 2.4.x).
That's an unlukcy condition very similar to the one in GFP that is fixed
correctly in 2.2.19pre2 putting pages in a per-process freelist during memory
balancing.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 20:09 ` Andrea Arcangeli
@ 2001-01-02 21:02 ` Linus Torvalds
2001-01-02 21:52 ` Andrea Arcangeli
0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2001-01-02 21:02 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Mike Galbraith, Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Andrea Arcangeli wrote:
> On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> > What does the system feel like if you just change the "sleep for bdflush"
> > logic in wakeup_bdflush() to something like
> >
> > wake_up_process(bdflush_tsk);
> > __set_current_state(TASK_RUNNING);
> > current->policy |= SCHED_YIELD;
> > schedule();
> >
> > instead of trying to wait for bdflush to wake us up?
>
> My bet is a `VM: killing' message.
Maybe in 2.2.x, yes.
> Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The
> above will break write throttling at least unless something foundamental is
> changed recently and that doesn't seem the case.
page_launder() should wait for the dirty pages, and that's not something
2.2.x ever did.
This way, the issue of dirty data in the VM is handled by the VM pressure,
not by trying to artificially throttle writers.
NOTE! I think that throttling writers is fine and good, but as it stands
now, the dirty buffer balancing will throttle anybody, not just the
writer. That's partly because of the 2.4.x mis-feature of doing the
balance_dirty call even for previously dirty buffers (fixed in my tree,
btw).
It's _really_ bad to wait for bdflush to finish if we hold on to things
like the superblock lock - which _does_ happen right now. That's why I'm
pretty convinced that we should NOT blindly do the dirty balance in
"mark_buffer_dirty()", but instead at more well-defined points (in places
like "generic_file_write()", for example).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 21:02 ` Linus Torvalds
@ 2001-01-02 21:52 ` Andrea Arcangeli
2001-01-02 22:01 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2001-01-02 21:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mike Galbraith, Anton Blanchard, linux-kernel, Andrew Morton
On Tue, Jan 02, 2001 at 01:02:30PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 2 Jan 2001, Andrea Arcangeli wrote:
>
> > On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> > > What does the system feel like if you just change the "sleep for bdflush"
> > > logic in wakeup_bdflush() to something like
> > >
> > > wake_up_process(bdflush_tsk);
> > > __set_current_state(TASK_RUNNING);
> > > current->policy |= SCHED_YIELD;
> > > schedule();
> > >
> > > instead of trying to wait for bdflush to wake us up?
> >
> > My bet is a `VM: killing' message.
>
> Maybe in 2.2.x, yes.
>
> > Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The
> > above will break write throttling at least unless something foundamental is
> > changed recently and that doesn't seem the case.
>
> page_launder() should wait for the dirty pages, and that's not something
> 2.2.x ever did.
In late 2.2.x we have sync_page_buffers too but I'm not sure how well it
behaves when the whole MM is costantly kept totally dirty and we don't have
swap. Infact also the 2.4.x implementation:
static void sync_page_buffers(struct buffer_head *bh, int wait)
{
struct buffer_head * tmp = bh;
do {
struct buffer_head *p = tmp;
tmp = tmp->b_this_page;
if (buffer_locked(p)) {
if (wait > 1)
__wait_on_buffer(p);
} else if (buffer_dirty(p))
ll_rw_block(WRITE, 1, &p);
} while (tmp != bh);
}
won't cope with the memory totally dirty. It will make the buffer from dirty to
locked then it will wait I/O completion at the second pass, but it
won't try again to free the page for the third time (when the page is finally
freeable):
if (wait) {
sync_page_buffers(bh, wait);
/* We waited synchronously, so we can free the buffers. */
if (wait > 1 && !loop) {
loop = 1;
goto cleaned_buffers_try_again;
}
Probably not a big deal.
The real point is that even if try_to_free_buffers will deal perfectly with the
VM totally dirty we'll end waiting I/O completion in the wrong place.
setiathome will end waiting I/O completion instead of `cp`. It's not setiathome
but `cp` that should do write throttling. And `cp` will block again very soon
even if setiathome blocks too. The whole point is that the write throttling
must happen in balance_dirty(), _not_ in sync_page_buffers().
Infact from 2.2.19pre2 there's a wait_io per-bh bitflag that remembers when a
dirty bh is very old and it doesn't get flushed away automatically (from
either kupdate or kflushd). So we don't block in sync_page_buffers until it's
necessary to avoid hurting non-IO apps when I/O is going on.
> NOTE! I think that throttling writers is fine and good, but as it stands
> now, the dirty buffer balancing will throttle anybody, not just the
> writer. That's partly because of the 2.4.x mis-feature of doing the
How can it throttle everybody and not only the writers? _Only_ the
writers calls balance_dirty.
> balance_dirty call even for previously dirty buffers (fixed in my tree,
> btw).
Yes I seen, people overwriting dirty data was blocking too, that was
not necessary, but they were still writers.
> It's _really_ bad to wait for bdflush to finish if we hold on to things
> like the superblock lock - which _does_ happen right now. That's why I'm
> pretty convinced that we should NOT blindly do the dirty balance in
> "mark_buffer_dirty()", but instead at more well-defined points (in places
> like "generic_file_write()", for example).
Right way to avoid blocking with lock helds is to replace mark_buffer_dirty
with __mark_buffer_dirty() and to call balance_dirty() later when the locks are
released. That's why it's exported to modules. Everybody is always been
allowed to optimize away the mark_buffer_dirty(), it's just that nobody did
that yet. I think it's useful to keep providing an interface that does the
write throttling automatically.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 21:52 ` Andrea Arcangeli
@ 2001-01-02 22:01 ` Linus Torvalds
2001-01-02 22:23 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2001-01-02 22:01 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Mike Galbraith, Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Andrea Arcangeli wrote:
>
> > NOTE! I think that throttling writers is fine and good, but as it stands
> > now, the dirty buffer balancing will throttle anybody, not just the
> > writer. That's partly because of the 2.4.x mis-feature of doing the
>
> How can it throttle everybody and not only the writers? _Only_ the
> writers calls balance_dirty.
A lot of people call mark_buffer_dirty() on one or two buffers. Things
like file creation etc. Think about inode bitmap blocks that are marked
dirty with the superblock held.. Ugh.
> Right way to avoid blocking with lock helds is to replace mark_buffer_dirty
> with __mark_buffer_dirty() and to call balance_dirty() later when the locks are
> released.
The point being that because _everybody_ should do this, we shouldn't have
the "mark_buffer_dirty()" that we have. There are no really valid uses of
the automatic rebalancing: either we're writing meta-data (which
definitely should balance on its own _after_ the fact), or we're writing
normal data (which already _does_ balance after the fact).
Right now, the automatic balancing only hurts. The stuff that hasn't been
converted is probably worse off doing balancing when they don't want to,
than we would be to leave the balancing altogether.
Which is why I don't like it.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 22:01 ` Linus Torvalds
@ 2001-01-02 22:23 ` Linus Torvalds
0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2001-01-02 22:23 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Mike Galbraith, Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Linus Torvalds wrote:
>
> Right now, the automatic balancing only hurts. The stuff that hasn't been
> converted is probably worse off doing balancing when they don't want to,
> than we would be to leave the balancing altogether.
>
> Which is why I don't like it.
Actually, there is right now another problem with the synchronous waiting,
which is completely different: because bdflush can be waited on
synchronously by various entities that hold various IO locks, bdflush
itself cannot do certain kinds of IO at all. In particular, it has to use
GFP_BUFFER when it calls down to page_launder(), because it cannot afford
to write out dirty pages which might deadlock on the locks that are held
by people waiting for bdflush..
The deadlock issue is the one I dislike the most: bdflush being
synchronously waited on is fundamentally always going to cripple it. In
comparison, the automatic rebalancing is just a latency issue (but the
automatic balancing _is_ the thing that brings on the fact that we call
rebalance with locks held, so they are certainly related).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 14:59 ` Mike Galbraith
2001-01-02 19:02 ` Linus Torvalds
@ 2001-01-02 23:13 ` Daniel Phillips
2001-01-03 4:46 ` Mike Galbraith
1 sibling, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2001-01-02 23:13 UTC (permalink / raw)
To: Mike Galbraith, linux-kernel
Mike Galbraith wrote:
>
> On Wed, 3 Jan 2001, Anton Blanchard wrote:
> >
> > > I am seeing (what I believe is;) severe process CPU starvation in
> > > 2.4.0-prerelease. At first, I attributed it to semaphore troubles
> > > as when I enable semaphore deadlock detection in IKD and set it to
> > > 5 seconds, it triggers 100% of the time on nscd when I do sequential
> > > I/O (iozone eg). In the meantime, I've done a slew of tracing, and
> > > I think the holder of the semaphore I'm timing out on just flat isn't
> > > being scheduled so it can release it. In the usual case of nscd, I
> > > _think_ it's another nscd holding the semaphore. In no trace can I
> > > go back far enough to catch the taker of the semaphore or any user
> > > task other than iozone running between __down() time and timeout 5
> > > seconds later. (trace buffer covers ~8 seconds of kernel time)
> >
> > Did this just appear in recent kernels? Maybe bdflush was hiding the
> > situation in earlier kernels as it would cause io hogs to block when
> > things got only mildly interesting.
>
> Yes and no. I've seen nasty stalls for quite a while now. (I think
> that there is a wakeup problem lurking)
Could you try this patch just to see what happens? It uses semaphores
for the bdflush synchronization instead of banging directly on the task
wait queues. It's supposed to be a drop-in replacement for the bdflush
wakeup/waitfor mechanism, but who knows, it may have subtly different
behavious in your case.
--- 2.4.0.clean/fs/buffer.c Sat Dec 30 20:19:13 2000
+++ 2.4.0/fs/buffer.c Tue Jan 2 23:05:14 2001
@@ -2528,33 +2528,28 @@
* response to dirty buffers. Once this process is activated, we write
back
* a limited number of buffers to the disks and then go back to sleep
again.
*/
-static DECLARE_WAIT_QUEUE_HEAD(bdflush_done);
+
+/* Semaphore wakeups, Daniel Phillips, phillips@innominate.de, 2000/12
*/
+
struct task_struct *bdflush_tsk = 0;
+DECLARE_MUTEX_LOCKED(bdflush_request);
+DECLARE_MUTEX_LOCKED(bdflush_waiter);
+atomic_t bdflush_waiters /*= 0*/;
void wakeup_bdflush(int block)
{
- DECLARE_WAITQUEUE(wait, current);
-
if (current == bdflush_tsk)
return;
- if (!block) {
- wake_up_process(bdflush_tsk);
+ if (!block)
+ {
+ up(&bdflush_request);
return;
}
- /* bdflush can wakeup us before we have a chance to
- go to sleep so we must be smart in handling
- this wakeup event from bdflush to avoid deadlocking in SMP
- (we are not holding any lock anymore in these two paths). */
- __set_current_state(TASK_UNINTERRUPTIBLE);
- add_wait_queue(&bdflush_done, &wait);
-
- wake_up_process(bdflush_tsk);
- schedule();
-
- remove_wait_queue(&bdflush_done, &wait);
- __set_current_state(TASK_RUNNING);
+ atomic_inc(&bdflush_waiters);
+ up(&bdflush_request);
+ down(&bdflush_waiter);
}
/* This is the _only_ function that deals with flushing async writes
@@ -2699,7 +2694,7 @@
int bdflush(void *sem)
{
struct task_struct *tsk = current;
- int flushed;
+ int flushed, waiters;
/*
* We have a bare-bones task_struct, and really should fill
* in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2727,28 +2722,16 @@
if (free_shortage())
flushed += page_launder(GFP_BUFFER, 0);
- /* If wakeup_bdflush will wakeup us
- after our bdflush_done wakeup, then
- we must make sure to not sleep
- in schedule_timeout otherwise
- wakeup_bdflush may wait for our
- bdflush_done wakeup that would never arrive
- (as we would be sleeping) and so it would
- deadlock in SMP. */
- __set_current_state(TASK_INTERRUPTIBLE);
- wake_up_all(&bdflush_done);
- /*
- * If there are still a lot of dirty buffers around,
- * skip the sleep and flush some more. Otherwise, we
- * go to sleep waiting a wakeup.
- */
- if (!flushed || balance_dirty_state(NODEV) < 0) {
+ waiters = atomic_read(&bdflush_waiters);
+ atomic_sub(waiters, &bdflush_waiters);
+ while (waiters--)
+ up(&bdflush_waiter);
+
+ if (!flushed || balance_dirty_state(NODEV) < 0)
+ {
run_task_queue(&tq_disk);
- schedule();
+ down(&bdflush_request);
}
- /* Remember to mark us as running otherwise
- the next schedule will block. */
- __set_current_state(TASK_RUNNING);
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 8:27 scheduling problem? Mike Galbraith
2001-01-02 14:01 ` Anton Blanchard
@ 2001-01-03 2:39 ` Roger Larsson
2001-01-03 5:17 ` Mike Galbraith
1 sibling, 1 reply; 22+ messages in thread
From: Roger Larsson @ 2001-01-03 2:39 UTC (permalink / raw)
To: Mike Galbraith, linux-kernel; +Cc: Andrew Morton
Hi,
I have played around with this code previously.
This is my current understanding.
[yield problem?]
On Tuesday 02 January 2001 09:27, Mike Galbraith wrote:
> Hi,
>
> I am seeing (what I believe is;) severe process CPU starvation in
> 2.4.0-prerelease. At first, I attributed it to semaphore troubles
> as when I enable semaphore deadlock detection in IKD and set it to
> 5 seconds, it triggers 100% of the time on nscd when I do sequential
> I/O (iozone eg). In the meantime, I've done a slew of tracing, and
> I think the holder of the semaphore I'm timing out on just flat isn't
> being scheduled so it can release it. In the usual case of nscd, I
> _think_ it's another nscd holding the semaphore. In no trace can I
> go back far enough to catch the taker of the semaphore or any user
> task other than iozone running between __down() time and timeout 5
> seconds later. (trace buffer covers ~8 seconds of kernel time)
>
> I think the snippet below captures the gist of the problem.
>
> c012f32e nr_free_pages +<e/4c> (0.16) pid(256)
> c012f37a nr_inactive_clean_pages +<e/44> (0.22) pid(256)
wakeup_bdflush (from beginning of __alloc_pages; page_alloc.c:324 )
> c01377f2 wakeup_bdflush +<12/a0> (0.14) pid(256)
> c011620a wake_up_process +<e/58> (0.29) pid(256)
> c012eea4 __alloc_pages_limit +<10/b8> (0.28) pid(256)
> c012eea4 __alloc_pages_limit +<10/b8> (0.30) pid(256)
Two __alloc_pages_limit
wakeup_kswapd(0) (from page_alloc.c:392 )
> c012e3fa wakeup_kswapd +<12/d4> (0.25) pid(256)
> c0115613 __wake_up +<13/130> (0.41) pid(256)
schedule() (from page_alloc.c:396 )
> c011527b schedule +<13/398> (0.66) pid(256->6)
> c01077db __switch_to +<13/d0> (0.70) pid(6)
bdflush is running!!!
> c01893c6 generic_unplug_device +<e/38> (0.25) pid(6)
bdflush is ready. (but how likely is it that it will run
for long enough to get hit by a tick i.e. current->counter--
unless it is it will continue to be preferred to kswapd, and
since only one process is yielded... )
> c011527b schedule +<13/398> (0.50) pid(6->256)
> c01077db __switch_to +<13/d0> (0.29) pid(256)
back to client, not the additionally runable kswapd...
Why not - nothing remaining of timeslice.
Not that the yield only yields one process. Not all
in runqueue - IMHO. [is this intended?]
3:rd __alloc_pages_limit this time direct_reclaim
tests are fulfilled
> c012eea4 __alloc_pages_limit +<10/b8> (0.22) pid(256)
> c012d267 reclaim_page +<13/408> (0.54) pid(256)
Possible (in -prerelease) untested possibilities.
* Be tougher when yielding.
wakeup_kswapd(0);
if (gfp_mask & __GFP_WAIT) {
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
+ current->counter--; /* be faster to let kswapd run */
or
+ current->counter = 0; /* too fast? [not tested] */
schedule();
}
Might be to tough on the client not doing any actual work... think dbench...
* Be tougher on bflushd, decrement its counter now and then...
[naive, not tested]
* Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)'
has been run. Before going to sleep...
[a variant tested with mixed results - this is likely a better one]
/*
* We go to sleep if either the free page shortage
* or the inactive page shortage is gone. We do this
* because:
* 1) we need no more free pages or
* 2) the inactive pages need to be flushed to disk,
* it wouldn't help to eat CPU time now ...
*
* We go to sleep for one second, but if it's needed
* we'll be woken up earlier...
*/
if (!free_shortage() || !inactive_shortage()) {
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
* wake up bdflush.
*/
if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
wakeup_bdflush(0);
interruptible_sleep_on_timeout(&kswapd_wait, HZ);
}
--
Home page:
http://www.norran.net/nra02596/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 23:13 ` Daniel Phillips
@ 2001-01-03 4:46 ` Mike Galbraith
2001-01-03 14:20 ` Daniel Phillips
2001-01-03 14:51 ` Daniel Phillips
0 siblings, 2 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 4:46 UTC (permalink / raw)
To: Daniel Phillips; +Cc: linux-kernel
On Wed, 3 Jan 2001, Daniel Phillips wrote:
> Could you try this patch just to see what happens? It uses semaphores
> for the bdflush synchronization instead of banging directly on the task
> wait queues. It's supposed to be a drop-in replacement for the bdflush
> wakeup/waitfor mechanism, but who knows, it may have subtly different
> behavious in your case.
Semaphore timed out during boot, leaving bdflush as zombie.
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-02 19:02 ` Linus Torvalds
2001-01-02 20:09 ` Andrea Arcangeli
@ 2001-01-03 4:48 ` Mike Galbraith
2001-01-03 5:52 ` Linus Torvalds
1 sibling, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 4:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Linus Torvalds wrote:
> On Tue, 2 Jan 2001, Mike Galbraith wrote:
> >
> > Yes and no. I've seen nasty stalls for quite a while now. (I think
> > that there is a wakeup problem lurking)
> >
> > I found the change which triggers my horrid stalls. Nobody is going
> > to believe this...
>
> Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush()
> is somewhat suspicious. In particular, if/when that ever triggers, and
> bdflush() is busy in flush_dirty_buffers(), then the process that is
> trying to wake bdflush up is going to wait until flush_dirty_buffers() is
> done.
>
> Which, if there is a process dirtying pages, can basically be
> pretty much forever.
>
> This was probably hidden by the lower limits simply by virtue of bdflush
> never being very active before.
>
> What does the system feel like if you just change the "sleep for bdflush"
> logic in wakeup_bdflush() to something like
>
> wake_up_process(bdflush_tsk);
> __set_current_state(TASK_RUNNING);
> current->policy |= SCHED_YIELD;
> schedule();
>
> instead of trying to wait for bdflush to wake us up?
No difference (except more context switching as expected)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 2:39 ` Roger Larsson
@ 2001-01-03 5:17 ` Mike Galbraith
0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 5:17 UTC (permalink / raw)
To: Roger Larsson; +Cc: linux-kernel, Andrew Morton
On Wed, 3 Jan 2001, Roger Larsson wrote:
> Hi,
>
> I have played around with this code previously.
> This is my current understanding.
> [yield problem?]
Hmm.. this ~could be. I once dove into the VM waters (me=stone)
and changed __alloc_pages() to only yield instead of scheduling.
The results (along with many other strange changes) were.. weirdest
feeling kernel I ever ran. Damn fast, but very very weird ;-)
> Possible (in -prerelease) untested possibilities.
>
> * Be tougher when yielding.
>
>
> wakeup_kswapd(0);
> if (gfp_mask & __GFP_WAIT) {
> __set_current_state(TASK_RUNNING);
> current->policy |= SCHED_YIELD;
> + current->counter--; /* be faster to let kswapd run */
> or
> + current->counter = 0; /* too fast? [not tested] */
> schedule();
> }
That looks a lot like cheating.
> * Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)'
> has been run. Before going to sleep...
> [a variant tested with mixed results - this is likely a better one]
I also did some things along this line.. also with mixed results.
:) the changes I've done that I actually like best is to kill bdflush
graveyard dead. Did that twice and didn't miss it at all. (next time,
I think I'll erect a headstone)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 4:48 ` Mike Galbraith
@ 2001-01-03 5:52 ` Linus Torvalds
2001-01-03 7:21 ` Mike Galbraith
0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2001-01-03 5:52 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Anton Blanchard, linux-kernel, Andrew Morton
On Wed, 3 Jan 2001, Mike Galbraith wrote:
>
> No difference (except more context switching as expected)
What about the current prerelese patch in testing? It doesn't switch to
bdflush at all, but instead does the buffer cleaning by hand.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 5:52 ` Linus Torvalds
@ 2001-01-03 7:21 ` Mike Galbraith
2001-01-03 11:30 ` Mike Galbraith
0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 7:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Anton Blanchard, linux-kernel, Andrew Morton
On Tue, 2 Jan 2001, Linus Torvalds wrote:
> On Wed, 3 Jan 2001, Mike Galbraith wrote:
> >
> > No difference (except more context switching as expected)
>
> What about the current prerelese patch in testing? It doesn't switch to
> bdflush at all, but instead does the buffer cleaning by hand.
99% gone. The remaining 1% is refill_freelist(). If I use
flush_dirty_buffers() there instead of waiting, I have no more
semaphore timeouts (so far.. not thoroughly pounded upon). Without
that change, I still take hits. (in my tinker tree, I usually
make a 'small flush' mode for flush_dirty_buffers() to do that)
Feel is _vastly_ improved.
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 7:21 ` Mike Galbraith
@ 2001-01-03 11:30 ` Mike Galbraith
0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 11:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Anton Blanchard, linux-kernel, Andrew Morton
On Wed, 3 Jan 2001, Mike Galbraith wrote:
> Feel is _vastly_ improved.
Except while beating on it, I found a way to turn it into a brick.
If I run Christoph Rohland's swptst proggy, interactive disappears
to the point that login while it is running is impossible. ~15 minutes
later I got 'login timed out after 30 seconds'. ~10 minutes after
that, the prompt came back.
-Mike
on other vt..
./swptst 1 48000000 4 12 100
Script started on Wed Jan 3 11:16:46 2001
[root]:# schedctl -R vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 17688 105040 396 15400 84 100 515 1117 315 1321 2 50 47
0 0 0 17688 105040 396 15400 0 0 0 0 102 10 0 95 5
0 0 0 17688 105040 396 15400 0 0 0 0 111 8 0 96 4
0 0 0 17688 105040 396 15400 0 0 0 0 106 10 0 94 6
0 0 0 17688 105028 396 15412 12 0 4 0 109 21 0 92 8
0 4 2 85240 1748 184 11996 4024 81876 2014 20512 3122 8444 0 35 65
0 5 3 99064 1432 188 25964 8856 5240 2472 1310 3645 5724 0 5 95
0 5 2 102404 1436 188 29072 4476 9896 1172 2474 880 1594 0 14 86
0 4 2 102404 1460 188 29032 5700 0 1425 0 440 656 1 9 91
0 4 2 114452 1432 188 40996 3224 4392 806 1098 344 523 0 19 81
1 3 3 114452 1432 188 40732 4080 11712 1020 2928 438 1061 0 26 74
1 4 2 189568 1732 184 115060 75948 31108 19811 7777 4840 7186 0 18 82
0 5 2 189568 1436 196 114560 5988 38224 1641 9556 1065 938 0 34 66
0 5 2 192564 1432 196 116960 48712 6416 12235 1604 4913 6910 0 8 92
0 4 1 192564 1432 196 116960 4136 0 1034 0 357 479 0 7 93
1 3 2 192528 1432 196 116968 3284 8 834 2 371 510 0 5 95
2 3 3 192512 1436 184 116972 17772 2452 4561 613 894 1110 0 13 87
4 0 1 14964 24988 188 10068 548 0 787 0 162 137 0 75 25
3 1 2 41356 1404 184 8180 108 35200 792 8800 818 5539 0 53 47
4 2 2 75304 1432 184 9536 204 30896 215 7724 570 1705 0 56 44
1 3 0 99440 1584 184 26304 8512 13972 2128 3493 622 858 0 28 72
0 5 2 114228 1432 184 40880 16708 16464 4206 4116 2633 4967 0 11 89
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 4 3 115288 1432 184 41732 3600 11000 900 2750 370 832 1 32 67
1 3 1 116576 1432 184 43020 3168 0 792 0 415 561 0 5 95
1 4 2 117940 1580 180 44092 8256 2644 2089 661 1196 1659 0 6 94
1 4 3 117912 1432 180 44236 1960 0 490 0 965 1216 0 9 91
0 5 2 118076 1884 184 43908 1108 1588 283 397 590 747 0 15 85
1 4 2 161668 2300 184 86164 155412 129228 40971 32307 9921 17331 0 23 77
4 2 2 15368 57384 196 10448 46812 3396 13012 849 2707 3139 0 20 80
0 7 2 134100 1460 180 60068 195944 223620 53567 55908 47036 84839 0 12 88
4 1 1 19824 40776 188 14752 195852 199904 52674 49976 45992 77956 0 11 89
0 5 2 142680 1656 184 68304 184740 184356 49300 46089 48109 82885 0 9 91
6 3 0 19384 1404 184 13268 879100 1027052 246649 256793 204487 547788 0 22 78
4 4 1 53712 50064 208 11264 484836 552776 137106 138209 125215 250077 0 18 82
2 6 3 108684 15180 196 52812 398128 462772 111135 115729 53739 351528 1 53 46
4 5 0 54808 45656 256 13984 12880 840 5412 242 1627 2070 3 97 0
0 5 0 22620 104664 220 17392 18216 50944 6685 12746 2394 11021 1 44 56
0 3 0 22528 103264 284 18476 692 0 666 0 254 390 5 18 77
1 1 0 22160 102096 296 19564 536 0 840 0 242 438 7 15 78
0 0 0 22120 102192 316 19932 60 0 320 8 171 333 3 50 48
0 0 0 22120 102192 316 19932 0 0 0 0 101 7 0 96 4
0 0 0 22120 102184 324 19932 0 0 7 0 108 33 0 84 16
0 0 0 22120 102184 324 19932 0 0 0 0 101 7 0 96 4
[root]:# exit
exit
Script done on Wed Jan 3 12:00:31 2001
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 4:46 ` Mike Galbraith
@ 2001-01-03 14:20 ` Daniel Phillips
2001-01-03 15:02 ` Mike Galbraith
2001-01-03 14:51 ` Daniel Phillips
1 sibling, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2001-01-03 14:20 UTC (permalink / raw)
To: Mike Galbraith, linux-kernel
Mike Galbraith wrote:
>
> On Wed, 3 Jan 2001, Daniel Phillips wrote:
>
> > Could you try this patch just to see what happens? It uses semaphores
> > for the bdflush synchronization instead of banging directly on the task
> > wait queues. It's supposed to be a drop-in replacement for the bdflush
> > wakeup/waitfor mechanism, but who knows, it may have subtly different
> > behavious in your case.
>
> Semaphore timed out during boot, leaving bdflush as zombie.
Hmm, how could that happen? I'm booted and running with that patch
right now and have beaten on it extensively - it sounds like something
else is broken. Or maybe we've already established that - let me read
the thread again.
Which semaphore timed out, bdflush_request or bdflush_waiter?
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 4:46 ` Mike Galbraith
2001-01-03 14:20 ` Daniel Phillips
@ 2001-01-03 14:51 ` Daniel Phillips
2001-01-03 15:39 ` Mike Galbraith
1 sibling, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2001-01-03 14:51 UTC (permalink / raw)
To: Mike Galbraith, linux-kernel
Mike Galbraith wrote:
> Semaphore timed out during boot, leaving bdflush as zombie.
Wait a sec, what do you mean by 'semaphore timed out'? These should
wait patiently forever.
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 14:20 ` Daniel Phillips
@ 2001-01-03 15:02 ` Mike Galbraith
0 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 15:02 UTC (permalink / raw)
To: Daniel Phillips; +Cc: linux-kernel
On Wed, 3 Jan 2001, Daniel Phillips wrote:
> Mike Galbraith wrote:
> >
> > On Wed, 3 Jan 2001, Daniel Phillips wrote:
> >
> > > Could you try this patch just to see what happens? It uses semaphores
> > > for the bdflush synchronization instead of banging directly on the task
> > > wait queues. It's supposed to be a drop-in replacement for the bdflush
> > > wakeup/waitfor mechanism, but who knows, it may have subtly different
> > > behavious in your case.
> >
> > Semaphore timed out during boot, leaving bdflush as zombie.
>
> Hmm, how could that happen? I'm booted and running with that patch
> right now and have beaten on it extensively - it sounds like something
> else is broken. Or maybe we've already established that - let me read
> the thread again.
>
> Which semaphore timed out, bdflush_request or bdflush_waiter?
I didn't watch closely (running virgin prerelease). I can run it again
if you think it's important.
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 14:51 ` Daniel Phillips
@ 2001-01-03 15:39 ` Mike Galbraith
2001-01-03 15:59 ` Daniel Phillips
0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2001-01-03 15:39 UTC (permalink / raw)
To: Daniel Phillips; +Cc: linux-kernel
On Wed, 3 Jan 2001, Daniel Phillips wrote:
> Mike Galbraith wrote:
> > Semaphore timed out during boot, leaving bdflush as zombie.
>
> Wait a sec, what do you mean by 'semaphore timed out'? These should
> wait patiently forever.
IKD has a semaphore deadlock detector. Any place you take a semaphore
and have to wait longer than 5 seconds (what I had it set to because
with trace buffer set to 3000000 entries, it can only cover ~8 seconds
of disk [slowest] load), it triggers and freezes the trace buffer for
later use. It firing under load may not be of interest. (but it firing
looks to be very closly coupled to observed stalls with virgin source.
Linus fixes big stall and deadlock detector mostly shuts up. I fix a
smaller stall and it shuts up entirely.. for this workload)
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: scheduling problem?
2001-01-03 15:39 ` Mike Galbraith
@ 2001-01-03 15:59 ` Daniel Phillips
0 siblings, 0 replies; 22+ messages in thread
From: Daniel Phillips @ 2001-01-03 15:59 UTC (permalink / raw)
To: linux-kernel
Mike Galbraith wrote:
>
> On Wed, 3 Jan 2001, Daniel Phillips wrote:
>
> > Mike Galbraith wrote:
> > > Semaphore timed out during boot, leaving bdflush as zombie.
> >
> > Wait a sec, what do you mean by 'semaphore timed out'? These should
> > wait patiently forever.
>
> IKD has a semaphore deadlock detector.
That was my tentative conclusion.
> Any place you take a semaphore
> and have to wait longer than 5 seconds (what I had it set to because
> with trace buffer set to 3000000 entries, it can only cover ~8 seconds
> of disk [slowest] load), it triggers and freezes the trace buffer for
> later use. It firing under load may not be of interest. (but it firing
> looks to be very closly coupled to observed stalls with virgin source.
> Linus fixes big stall and deadlock detector mostly shuts up. I fix a
> smaller stall and it shuts up entirely.. for this workload)
But it's entirely legal for a semaphore to wait forever when used in the
way I've used them, a producer/consumer pattern. You should be able to
run happily (at least as happily as before) with the watchdog disabled.
This begs the question of what to do about the 99.99% of cases where the
watchdog is a good thing to have. Shouldn't the watchdog just log the
'suspicious' event and continue?
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2001-01-03 16:33 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-02 8:27 scheduling problem? Mike Galbraith
2001-01-02 14:01 ` Anton Blanchard
2001-01-02 14:59 ` Mike Galbraith
2001-01-02 19:02 ` Linus Torvalds
2001-01-02 20:09 ` Andrea Arcangeli
2001-01-02 21:02 ` Linus Torvalds
2001-01-02 21:52 ` Andrea Arcangeli
2001-01-02 22:01 ` Linus Torvalds
2001-01-02 22:23 ` Linus Torvalds
2001-01-03 4:48 ` Mike Galbraith
2001-01-03 5:52 ` Linus Torvalds
2001-01-03 7:21 ` Mike Galbraith
2001-01-03 11:30 ` Mike Galbraith
2001-01-02 23:13 ` Daniel Phillips
2001-01-03 4:46 ` Mike Galbraith
2001-01-03 14:20 ` Daniel Phillips
2001-01-03 15:02 ` Mike Galbraith
2001-01-03 14:51 ` Daniel Phillips
2001-01-03 15:39 ` Mike Galbraith
2001-01-03 15:59 ` Daniel Phillips
2001-01-03 2:39 ` Roger Larsson
2001-01-03 5:17 ` Mike Galbraith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).