All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux balloon driver stops accepting target_kb for a long time
@ 2010-08-23 22:45 Dan Magenheimer
  2010-08-24  7:45 ` Jan Beulich
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Magenheimer @ 2010-08-23 22:45 UTC (permalink / raw)
  To: xen-devel; +Cc: jeremy, Keir Fraser, JBeulich

Balloon experts --

I'm seeing a strange problem in either the balloon driver
or in the Xen code that provides the support for it... still
trying to narrow down which.

The problem appears when I am running in-kernel selfballooning
code and then only rarely... I'm not sure exactly what conditions
are required but for a long period of time (>30 minutes), writing
to target_kb inside a PV guest has no effect at all on the
memory size of the VM (as viewed inside the guest with "free -k")!
Under most conditions, writing to target_kb "immediately" changes
the memory size, but once in this state, no effect at all.
At the end of this long period of time, suddenly everything
is back to normal... and there's no obvious trigger that
signals the return to normalcy.

Note that though the problem is observed with selfballooning,
changing target_kb manually fails as well, so I suspect the
problem exists regardless of selfballooning but only
selfballooning is exercising the balloon sizing enough to
encounter the bug.

Reviewing code, one thing caught my attention.  In balloon_process(),
the balloon_mutex is down'ed then, under certain conditions
schedule() is called with the balloon_mutex still held and without
another timer set.  Any chance this could be a problem, especially
if another kernel thread invokes balloon_set_new_target()?
If so, what might finally kick the scheduled-out thread after
30 minutes to reset the balloon_timer and up the mutex?

If this is wrong, any other ideas what might be causing
this weird problem?

Thanks,
Dan

P.S. This is the Linux 2.6.18-based balloon driver (with latest
patches from xen-unstable), but I may see if I can reproduce it
on an upstream balloon driver as well.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux balloon driver stops accepting target_kb for a long time
  2010-08-23 22:45 Linux balloon driver stops accepting target_kb for a long time Dan Magenheimer
@ 2010-08-24  7:45 ` Jan Beulich
  2010-08-24 22:38   ` Dan Magenheimer
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Beulich @ 2010-08-24  7:45 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: jeremy, xen-devel, Keir Fraser

>>> On 24.08.10 at 00:45, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> Reviewing code, one thing caught my attention.  In balloon_process(),
> the balloon_mutex is down'ed then, under certain conditions
> schedule() is called with the balloon_mutex still held and without
> another timer set.  Any chance this could be a problem, especially
> if another kernel thread invokes balloon_set_new_target()?
> If so, what might finally kick the scheduled-out thread after
> 30 minutes to reset the balloon_timer and up the mutex?

How could this be a problem? Calling schedule() is a yield, not an
indefinite sleep, and hence the loop will resume as soon as there's
no higher priority runnable task anymore for a long enough time
(obviously very much less than 30 minutes, unless something
really odd is running on your box).

Furthermore, besides the obvious option of inserting some debug
code, I think SysRq-t would also allow you to check whether
balloon_process() indeed doesn't exit over a period of minutes.

Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Linux balloon driver stops accepting target_kb for a long time
  2010-08-24  7:45 ` Jan Beulich
@ 2010-08-24 22:38   ` Dan Magenheimer
  2010-08-25  8:16     ` Jan Beulich
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Magenheimer @ 2010-08-24 22:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: jeremy, xen-devel, Keir Fraser

> From: Jan Beulich [mailto:JBeulich@novell.com]
> Subject: Re: Linux balloon driver stops accepting target_kb for a long
> time
> 
> >>> On 24.08.10 at 00:45, Dan Magenheimer <dan.magenheimer@oracle.com>
> wrote:
> > Reviewing code, one thing caught my attention.  In balloon_process(),
> > the balloon_mutex is down'ed then, under certain conditions
> > schedule() is called with the balloon_mutex still held and without
> > another timer set.  Any chance this could be a problem, especially
> > if another kernel thread invokes balloon_set_new_target()?
> > If so, what might finally kick the scheduled-out thread after
> > 30 minutes to reset the balloon_timer and up the mutex?
> 
> How could this be a problem? Calling schedule() is a yield, not an
> indefinite sleep, and hence the loop will resume as soon as there's
> no higher priority runnable task anymore for a long enough time
> (obviously very much less than 30 minutes, unless something
> really odd is running on your box).

Hi Jan --

Well the 1 vcpu system is very busy doing a "make -j64" and there's
a high amount of swap activity.  What priority does balloon_worker
(launched with schedule_work()) have relative to userland
threads and other kernel threads such as kswapd?  I.e. is
it possible that it gets locked out for 30 minutes?  It appears
that the new balloon target is applied only when system activity
goes way down (when the number of cc1's run from make starts
going down).

Is there any way to boost the priority of this thread?
Also, if it matters, the "make -j64" is launched from /etc/rc.local,
so might that boost the priority of the "userland" threads?
 
> Furthermore, besides the obvious option of inserting some debug
> code

Since it's hard to reproduce, I've been avoiding adding debug
code so I don't lose my test case.  I'm about to try some things
now but hoped to narrow down the likely problem sources first.

> I think SysRq-t would also allow you to check whether
> balloon_process() indeed doesn't exit over a period of minutes

This was a good idea, but I haven't yet gotten a full SysRq-t
output because there are so many processes running and I think
the SysRq-t adds to the general chaos... When I use it, the
guest goes into 100% vcpu usage after the "make -j64" is
complete. :-(  However, I can ssh in and top shows the
thread "events/0" using nearly 100% of the cpu.

Assuming I get a good SysRq-t, would I simply be looking for
a process stack dump with balloon_process in the stack?
Would this kind of a yielded kernel thread even show up in
SysRq-t output?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Linux balloon driver stops accepting target_kb for a long time
  2010-08-24 22:38   ` Dan Magenheimer
@ 2010-08-25  8:16     ` Jan Beulich
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Beulich @ 2010-08-25  8:16 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: jeremy, xen-devel, Keir Fraser

>>> On 25.08.10 at 00:38, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> Well the 1 vcpu system is very busy doing a "make -j64" and there's
> a high amount of swap activity.  What priority does balloon_worker
> (launched with schedule_work()) have relative to userland
> threads and other kernel threads such as kswapd?  I.e. is
> it possible that it gets locked out for 30 minutes?  It appears
> that the new balloon target is applied only when system activity
> goes way down (when the number of cc1's run from make starts
> going down).

It should be running in one of the events/<number> worker threads,
which appear to get priority adjusted only in the RT case. It wouldn't
seem right for that thread to get starved for 30 min, but then again
running a "make -j64" on a 1-vCPU and too-little-memory system
seems questionable in the first place.

> Is there any way to boost the priority of this thread?
> Also, if it matters, the "make -j64" is launched from /etc/rc.local,
> so might that boost the priority of the "userland" threads?

I don't think so for both items.

>> I think SysRq-t would also allow you to check whether
>> balloon_process() indeed doesn't exit over a period of minutes
> 
> This was a good idea, but I haven't yet gotten a full SysRq-t
> output because there are so many processes running and I think
> the SysRq-t adds to the general chaos... When I use it, the
> guest goes into 100% vcpu usage after the "make -j64" is
> complete. :-(  However, I can ssh in and top shows the
> thread "events/0" using nearly 100% of the cpu.
>
> Assuming I get a good SysRq-t, would I simply be looking for
> a process stack dump with balloon_process in the stack?
> Would this kind of a yielded kernel thread even show up in
> SysRq-t output?

Hmm, since this is a single vCPU VM, it would be unavoidable
for it to run in the same thread as the balloon worker. As a
result you wouldn't be able to see any trace of it...

Jan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-08-25  8:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-23 22:45 Linux balloon driver stops accepting target_kb for a long time Dan Magenheimer
2010-08-24  7:45 ` Jan Beulich
2010-08-24 22:38   ` Dan Magenheimer
2010-08-25  8:16     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.