linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-14 16:07 chen, xiangping
  2002-05-14 16:32 ` Steven Whitehouse
                   ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: chen, xiangping @ 2002-05-14 16:07 UTC (permalink / raw)
  To: 'Jes Sorensen'; +Cc: 'Steve Whitehouse', linux-kernel

But how to avoid system hangs due to running out of memory?
Is there a safe guide line? Generally slow is tolerable, but
crash is not.

Thanks,

Xiangping

-----Original Message-----
From: Jes Sorensen [mailto:jes@wildopensource.com]
Sent: Tuesday, May 14, 2002 11:11 AM
To: chen, xiangping
Cc: 'Steve Whitehouse'; linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


>>>>> "Xiangping" == chen, xiangping <chen_xiangping@emc.com> writes:

Xiangping> But the acenic driver author suggested that sndbuf should
Xiangping> be at least 262144, and the sndbuf can not exceed
Xiangping> r/wmem_default. Is that correct?

Ehm, the acenic author is me ;-)

The default value is what all sockets are assigned on open, you can
adjust this using SO_SNDBUF and SO_RCVBUF, however the values you set
cannot exceed the [rw]mem_max values. Basically if you set the default
to 4MB, your telnet sockets will have a 4MB default limit as well
which may not be what you want (not saying it will use 4MB).

Thus, set the _max values and use SO_SNDBUF and SO_RCVBUF to set the
per process values. But leave the _default values to their original
setting.

Xiangping> So for gigabit Ethernet driver, what is the optimal mem
Xiangping> configuration for performance and reliability?

It depends on your application, number of streams, general usage of
the connection etc. There's no perfect-for-all magic number.

Jes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-14 16:07 Kernel deadlock using nbd over acenic driver chen, xiangping
@ 2002-05-14 16:32 ` Steven Whitehouse
  2002-05-14 16:48 ` Alan Cox
  2002-05-15 22:31 ` Oliver Xymoron
  2 siblings, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-14 16:32 UTC (permalink / raw)
  To: chen, xiangping; +Cc: 'Jes Sorensen', linux-kernel

Hi,

The TCP stack should auto-tune the amount of memory that it uses, so that
SO_SNDBUF, cat >/proc/sys/net/core/[rw]mem_default etc. is not required. The
important settings for TCP sockets are only /proc/sys/net/ipv4/tcp_[rw]mem
and tcp_mem I think (at least if I've understood the code correctly).

Since I think we are talking about only a single nbd device, there should
only be a single socket thats doing lots of I/O in this case, or is this
machine doing other heavy network tasks ?
> 
> But how to avoid system hangs due to running out of memory?
> Is there a safe guide line? Generally slow is tolerable, but
> crash is not.
> 
I agree. I also think your earlier comments about the buffer flushing are
correct as being the most likely cause.

I don't think the system has "run out" exactly, more just got itself into
a state where the code path writing out dirty blocks has been blocked
due to lack of freeable memory at that moment and where the process
freeing up memory has blocked waiting for the nbd device. It may well
be that there is freeable memory, just that for whatever reason no
process is trying to free it.

The LVM team has had a similar problem in dealing with I/O which needs
extra memory in order to complete, so I'll ask them for some ideas. Also
I'm going to try and come up with some patches to eliminate some of the
possible theories so that we can narrow down the options,

Steve


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-14 16:07 Kernel deadlock using nbd over acenic driver chen, xiangping
  2002-05-14 16:32 ` Steven Whitehouse
@ 2002-05-14 16:48 ` Alan Cox
  2002-05-15 22:31 ` Oliver Xymoron
  2 siblings, 0 replies; 53+ messages in thread
From: Alan Cox @ 2002-05-14 16:48 UTC (permalink / raw)
  To: chen, xiangping
  Cc: 'Jes Sorensen', 'Steve Whitehouse', linux-kernel

> Xiangping> So for gigabit Ethernet driver, what is the optimal mem
> Xiangping> configuration for performance and reliability?
> 
> It depends on your application, number of streams, general usage of
> the connection etc. There's no perfect-for-all magic number.

The primary constraints are

	TCP max window size
	TCP congestion window size (cwnd)
	Latency

Most of the good discussion on this matter can be found in the ietf
archives from the window scaling options work, and in part in the RFC's
that led to


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
  2002-05-14 16:07 Kernel deadlock using nbd over acenic driver chen, xiangping
  2002-05-14 16:32 ` Steven Whitehouse
  2002-05-14 16:48 ` Alan Cox
@ 2002-05-15 22:31 ` Oliver Xymoron
  2002-05-16  5:10   ` Peter T. Breuer
  2 siblings, 1 reply; 53+ messages in thread
From: Oliver Xymoron @ 2002-05-15 22:31 UTC (permalink / raw)
  To: chen, xiangping
  Cc: 'Jes Sorensen', 'Steve Whitehouse', linux-kernel

On Tue, 14 May 2002, chen, xiangping wrote:

> But how to avoid system hangs due to running out of memory?
> Is there a safe guide line? Generally slow is tolerable, but
> crash is not.

If the system runs out of memory, it may try to flush pages that are
queued to your NBD device. That will try to allocate more memory for
sending packets, which will fail, meaning the VM can never make progress
freeing pages. Now your box is dead.

The only way to deal with this is to have a scheme for per-socket memory
reservations in the network layer and have NBD reserve memory for sending
and acknowledging packets. NFS and iSCSI also need this, though it's a
bit harder to tickle for NFS. SCSI has DMA reserved memory for analogous
reasons.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-15 22:31 ` Oliver Xymoron
@ 2002-05-16  5:10   ` Peter T. Breuer
  2002-05-16  5:19     ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16  5:10 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

"A month of sundays ago Oliver Xymoron wrote:"
> On Tue, 14 May 2002, chen, xiangping wrote:
> 
> > But how to avoid system hangs due to running out of memory?
> > Is there a safe guide line? Generally slow is tolerable, but
> > crash is not.
> 
> If the system runs out of memory, it may try to flush pages that are
> queued to your NBD device. That will try to allocate more memory for
> sending packets, which will fail, meaning the VM can never make progress
> freeing pages. Now your box is dead.
> 
> The only way to deal with this is to have a scheme for per-socket memory
> reservations in the network layer and have NBD reserve memory for sending

I entirely agree. However, initial reports are that setting

  current->flags |= PF_MEMALLOC;

in the process about to do the networking (and unsetting it afterwards)
cures the apparant symptoms observed with swap over Enbd (see
freshmeat) in post 2.4.10 kernels.

I'll get back more reports later today.

> and acknowledging packets. NFS and iSCSI also need this, though it's a
> bit harder to tickle for NFS. SCSI has DMA reserved memory for analogous
> reasons.

Ah. I always wondered about NFS. If iSCSI does the reservation in a
controlled way, I will have to look at it.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16  5:10   ` Peter T. Breuer
@ 2002-05-16  5:19     ` Peter T. Breuer
  2002-05-16 14:29       ` Oliver Xymoron
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16  5:19 UTC (permalink / raw)
  To: ptb
  Cc: Oliver Xymoron, chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

"Oliver Xymoron wrote:"
> If the system runs out of memory, it may try to flush pages that are
> queued to your NBD device. That will try to allocate more memory for
> sending packets, which will fail, meaning the VM can never make progress
> freeing pages. Now your box is dead.

The system can avoid this by

 a) not flushing sync  (i.e. giving up on pages that won't flush immediately)
 b) being nondeterministic about it .. not always retrying the same
    thing again and again.

Can one achieve those characteristics? I suspect setting the vm
sync boundary to 100% should arrange for (a)?

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16  5:19     ` Peter T. Breuer
@ 2002-05-16 14:29       ` Oliver Xymoron
  2002-05-16 15:35         ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Oliver Xymoron @ 2002-05-16 14:29 UTC (permalink / raw)
  To: Peter T. Breuer
  Cc: chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

On Thu, 16 May 2002, Peter T. Breuer wrote:

> "Oliver Xymoron wrote:"
> > If the system runs out of memory, it may try to flush pages that are
> > queued to your NBD device. That will try to allocate more memory for
> > sending packets, which will fail, meaning the VM can never make progress
> > freeing pages. Now your box is dead.
>
> The system can avoid this by
>
>  a) not flushing sync  (i.e. giving up on pages that won't flush immediately)
>  b) being nondeterministic about it .. not always retrying the same
>     thing again and again.

Helpful but insufficient. What is to stop getting into a situation where
_all_ memory that is pageable is only pageable via network? Even if you
have a big box, if you do large streaming writes to about 20 NBD devices,
you'll discover that each device queue can hold many megabytes of dirty
data.. Try pulling out your ethernet cable for a moment and watch the
thing strangle itself.

Harder to get into this situation with NFS, but still doable.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16 14:29       ` Oliver Xymoron
@ 2002-05-16 15:35         ` Peter T. Breuer
  2002-05-16 16:22           ` Oliver Xymoron
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16 15:35 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Peter T. Breuer, chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

"A month of sundays ago Oliver Xymoron wrote:"
> On Thu, 16 May 2002, Peter T. Breuer wrote:
> > "Oliver Xymoron wrote:"
> > > If the system runs out of memory, it may try to flush pages that are
> > > queued to your NBD device. That will try to allocate more memory for
> > > sending packets, which will fail, meaning the VM can never make progress
> > > freeing pages. Now your box is dead.
> > The system can avoid this by
> >
> >  a) not flushing sync  (i.e. giving up on pages that won't flush immediately)
> >  b) being nondeterministic about it .. not always retrying the same
> >     thing again and again.
> 
> Helpful but insufficient. What is to stop getting into a situation where
> _all_ memory that is pageable is only pageable via network? Even if you

OK, I agree. The socket (or the process using the socket) needs a
reserve of memory that it can call upon in order to complete each
individual network send, and that the rest of the system cannot touch.

> have a big box, if you do large streaming writes to about 20 NBD devices,
> you'll discover that each device queue can hold many megabytes of dirty
> data.. Try pulling out your ethernet cable for a moment and watch the
> thing strangle itself.

Any way of making sure that send_msg on the socket can always get the
(known a priori) buffers it needs?

> Harder to get into this situation with NFS, but still doable.

OTOH, if there is even a single other thread anywhere holding pages
that we can reclaim, then we can find them by using an async stochastic
algorithm inside the VM, instead of the current sync, deterministic one,
surely!

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16 15:35         ` Peter T. Breuer
@ 2002-05-16 16:22           ` Oliver Xymoron
  2002-05-16 16:45             ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Oliver Xymoron @ 2002-05-16 16:22 UTC (permalink / raw)
  To: Peter T. Breuer
  Cc: chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

On Thu, 16 May 2002, Peter T. Breuer wrote:

> "A month of sundays ago Oliver Xymoron wrote:"
> > On Thu, 16 May 2002, Peter T. Breuer wrote:
> > > "Oliver Xymoron wrote:"
> > > > If the system runs out of memory, it may try to flush pages that are
> > > > queued to your NBD device. That will try to allocate more memory for
> > > > sending packets, which will fail, meaning the VM can never make progress
> > > > freeing pages. Now your box is dead.
> > > The system can avoid this by
> > >
> > >  a) not flushing sync  (i.e. giving up on pages that won't flush immediately)
> > >  b) being nondeterministic about it .. not always retrying the same
> > >     thing again and again.
> >
> > Helpful but insufficient. What is to stop getting into a situation where
> > _all_ memory that is pageable is only pageable via network? Even if you
>
> OK, I agree. The socket (or the process using the socket) needs a
> reserve of memory that it can call upon in order to complete each
> individual network send, and that the rest of the system cannot touch.
>
> > have a big box, if you do large streaming writes to about 20 NBD devices,
> > you'll discover that each device queue can hold many megabytes of dirty
> > data.. Try pulling out your ethernet cable for a moment and watch the
> > thing strangle itself.
>
> Any way of making sure that send_msg on the socket can always get the
> (known a priori) buffers it needs?

Not at present. Note that we also need reservations on the receive side
for ACK handling which is "interesting".

> OTOH, if there is even a single other thread anywhere holding pages
> that we can reclaim, then we can find them by using an async stochastic
> algorithm inside the VM, instead of the current sync, deterministic one,
> surely!

Yes - falling over at the first hard to push page is bad.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16 16:45             ` Peter T. Breuer
@ 2002-05-16 16:35               ` Steven Whitehouse
  2002-05-17  7:01                 ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-16 16:35 UTC (permalink / raw)
  To: ptb; +Cc: Oliver Xymoron, chen xiangping, 'Jes Sorensen', linux-kernel

Hi,

> 
> "Oliver Xymoron wrote:"
> > On Thu, 16 May 2002, Peter T. Breuer wrote:
> > > Any way of making sure that send_msg on the socket can always get the
> > > (known a priori) buffers it needs?
> > 
> > Not at present. Note that we also need reservations on the receive side
> > for ACK handling which is "interesting".
> 
> One thing at a time.  What if there is a zone "ceiling" that we keep
> lowered exactly until it is time for the process that does the send_msg
> to run, when we raise the ceiling.  (I don't know how this VM stuff
> works in detail inside - this is an invitation to list the objections).
> The scheduler could presumably be trained to muck with the ceilings
> according to flags on the process (task?) structs.
> 
> Peter
> 
Thats effectively what PF_MEMALLOC does. The code in question is in
page_alloc.c:__alloc_pages just before and after the rebalance: label.
The z->pages_min gives a per zone minimum for "other processes" that are
not PF_MEMALLOC,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16 16:22           ` Oliver Xymoron
@ 2002-05-16 16:45             ` Peter T. Breuer
  2002-05-16 16:35               ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16 16:45 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Peter T. Breuer, chen, xiangping, 'Jes Sorensen',
	'Steve Whitehouse',
	linux-kernel

"Oliver Xymoron wrote:"
> On Thu, 16 May 2002, Peter T. Breuer wrote:
> > Any way of making sure that send_msg on the socket can always get the
> > (known a priori) buffers it needs?
> 
> Not at present. Note that we also need reservations on the receive side
> for ACK handling which is "interesting".

One thing at a time.  What if there is a zone "ceiling" that we keep
lowered exactly until it is time for the process that does the send_msg
to run, when we raise the ceiling.  (I don't know how this VM stuff
works in detail inside - this is an invitation to list the objections).
The scheduler could presumably be trained to muck with the ceilings
according to flags on the process (task?) structs.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-16 16:35               ` Steven Whitehouse
@ 2002-05-17  7:01                 ` Peter T. Breuer
  2002-05-17  9:26                   ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-17  7:01 UTC (permalink / raw)
  To: Steve Whitehouse
  Cc: ptb, Oliver Xymoron, chen xiangping, 'Jes Sorensen',
	linux-kernel

"Steven Whitehouse wrote:"
> Thats effectively what PF_MEMALLOC does. The code in question is in
> page_alloc.c:__alloc_pages just before and after the rebalance: label.
> The z->pages_min gives a per zone minimum for "other processes" that are
> not PF_MEMALLOC,

A related question, then ... can one adjust the difference between the
ceiling for "normal" processes and PF_MEMALLOC processes, and if so,
how?

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-17  7:01                 ` Peter T. Breuer
@ 2002-05-17  9:26                   ` Steven Whitehouse
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-17  9:26 UTC (permalink / raw)
  To: ptb; +Cc: Oliver Xymoron, chen xiangping, 'Jes Sorensen', linux-kernel

Hi,

> 
> "Steven Whitehouse wrote:"
> > Thats effectively what PF_MEMALLOC does. The code in question is in
> > page_alloc.c:__alloc_pages just before and after the rebalance: label.
> > The z->pages_min gives a per zone minimum for "other processes" that are
> > not PF_MEMALLOC,
> 
> A related question, then ... can one adjust the difference between the
> ceiling for "normal" processes and PF_MEMALLOC processes, and if so,
> how?
> 
> Peter
> 

In page_alloc.c:__alloc_pages() the minimum is calculated for each memory
zone as (1UL << order) plus z->pages_low for each zone scanned whilst
looking for memory. Various scans are done, but z->pages_min is the key
number from which the limits are calculated.

It appears that z->pages_min is initialized in 
page_alloc.c:free_area_init_core() to the real size of the zone divided
by zone_balance_ratio[] (set to 128 by default) for that zone and clamped to 
zone_balance_max[] (set to 255 by default) and zone_balance_min[] (set to 20
by default).

It appears that the memfrac= command line option should allow tweeking
of zone_balance_ratio[] but if you've got more than approx 128M of memory it
would appear (if I've understood this correctly) that you'll have the
maximum of 255 pages for z->pages_min.

Steve.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-06-01 21:13       ` Peter T. Breuer
@ 2002-06-05  8:48         ` Steven Whitehouse
  2002-06-02  6:39           ` Pavel Machek
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-06-05  8:48 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel

Hi,

> 
> "Steven Whitehouse wrote:"
> 
> (somethiung about kernel nbd)
> 
> BTW, are you maintaining kernel nbd? If so, I'd like to propose
> some unifications that would make it possible to run either
> enbd or nbd daemons on the same driver, at least in a "compatibility
> mode".
> 
No. My interest is just to help ensure that its working by sending
the occasional bug fix. Pavel Machek is officially in charge, so you'll
need to convince him of any changes.

> The starting point would be
> 
> 1) make the over-the-wire data formats the same, which means
>    enlarging kernel nbd's nbd_request and nbd_reply structs
>    to match enbd's, or some compromise.
> 
> 2) less important .. make the driver structs the same. enbd has more
>    fields there too, for accounting purposes. That's the nbd_device struct.
> 
> Later on one can add some cross-ioctls.
> 
> Peter
> 
I'm not so convinced that this is a good idea. I've always looked upon nbd
as the "as simple as possible" style of driver and its over the wire format
is good enough to cope with most things I think. Does enbd have a negotiation
sequence at start up like nbd ? Perhaps it would be possible to add some
code so a server could tell which type of client it was talking to ? I
think that would be simpler code changes and I'd be happier to see that kind
of change rather than any change to the over the wire format.

It would be nice to add a bit more accounting. We need also to dynamically
allocate the nbd driver structures because as they get larger its less
efficient to allocate them statically as we currently do. The question is
then when to free them. I think that probably the disconnect ioctl() could
provide a suitable hook for that,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-06-05  8:48         ` Steven Whitehouse
@ 2002-06-02  6:39           ` Pavel Machek
  0 siblings, 0 replies; 53+ messages in thread
From: Pavel Machek @ 2002-06-02  6:39 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, linux kernel

Hi!

> > (somethiung about kernel nbd)
> > 
> > BTW, are you maintaining kernel nbd? If so, I'd like to propose
> > some unifications that would make it possible to run either
> > enbd or nbd daemons on the same driver, at least in a "compatibility
> > mode".
> > 
> No. My interest is just to help ensure that its working by sending
> the occasional bug fix. Pavel Machek is officially in charge, so you'll
> need to convince him of any changes.

...and thanx a lot for your work...

> > The starting point would be
> > 
> > 1) make the over-the-wire data formats the same, which means
> >    enlarging kernel nbd's nbd_request and nbd_reply structs
> >    to match enbd's, or some compromise.
> > 
> > 2) less important .. make the driver structs the same. enbd has more
> >    fields there too, for accounting purposes. That's the nbd_device struct.
> > 
> > Later on one can add some cross-ioctls.
> > 
> I'm not so convinced that this is a good idea. I've always looked upon nbd
> as the "as simple as possible" style of driver and its over the wire format
> is good enough to cope with most things I think. Does enbd have a negotiation
> sequence at start up like nbd ? Perhaps it would be possible to add some
> code so a server could tell which type of client it was talking to ? I
> think that would be simpler code changes and I'd be happier to see that kind
> of change rather than any change to the over the wire format.

Agreed. If you want to integrate enbd, go ahead, and put it into 
drivers/block/enbd.c.

> It would be nice to add a bit more accounting. We need also to dynamically
> allocate the nbd driver structures because as they get larger its less
> efficient to allocate them statically as we currently do. The question is
> then when to free them. I think that probably the disconnect ioctl() could
> provide a suitable hook for that,

Disconnect is actually a bit of problem, and yes it would be nice to get it
solved.
									Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 10:11     ` Steven Whitehouse
  2002-05-24 11:43       ` Peter T. Breuer
@ 2002-06-01 21:13       ` Peter T. Breuer
  2002-06-05  8:48         ` Steven Whitehouse
  1 sibling, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-06-01 21:13 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: linux kernel

"Steven Whitehouse wrote:"

(somethiung about kernel nbd)

BTW, are you maintaining kernel nbd? If so, I'd like to propose
some unifications that would make it possible to run either
enbd or nbd daemons on the same driver, at least in a "compatibility
mode".

The starting point would be

1) make the over-the-wire data formats the same, which means
   enlarging kernel nbd's nbd_request and nbd_reply structs
   to match enbd's, or some compromise.

2) less important .. make the driver structs the same. enbd has more
   fields there too, for accounting purposes. That's the nbd_device struct.

Later on one can add some cross-ioctls.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-29 12:10               ` Peter T. Breuer
@ 2002-05-29 13:24                 ` Jens Axboe
  0 siblings, 0 replies; 53+ messages in thread
From: Jens Axboe @ 2002-05-29 13:24 UTC (permalink / raw)
  To: Peter T. Breuer
  Cc: Pavel Machek, Steve Whitehouse, linux kernel, alan, chen_xiangping

On Wed, May 29 2002, Peter T. Breuer wrote:
> "A month of sundays ago Pavel Machek wrote:"
> > > > Init routine is called from insmod context or at kernel bootup (from pid==1).
> > > 
> > > That's nitpicking!  
> > 
> > I did not want to be nitpicking. init() really is considered process
> 
> Well, OK.
> 
> > context, and it looks to me like unplug is *blocking* operation so it
> > really needs proceess context.
> 
> unplug unsets the plugged flag and calls the request function. The
> question is whether the request function is allowed to block. I argue
> that it is not, on several grounds:
> 
>     1) it's also - and principally - been called from various task
>     queues, which aren't really associated with a process context, and
>     certainly not with the process context that set the task

It's called from tq_disk only, which is in process context. So at least
on that ground lets say that it is at least not technically illegal to
block.

>     2) blocking is really bad news depending on how we got to the
>     request function, which is not a really predictable thing, since
>       i) it can change with every kernel version
>       ii) it depends on what somebody else does

I don't agree with that. You get there from an unplug, which happens
from process context as already established. If you get there from other
places, it means that you are calling your request_fn from elsewhere in
your driver (typically recalling request_fn from isr or bottom half to
queue more I/O), and in that case it's your own responsibility.

>    3) if we block against memory for buffers, in particular, the 
>    the system is now very likely to be dead, since VM just went
>    synchronous.

Of course that is a tricky area. You shouldn't be doing memory
allocations inside the request_fn, that's just bad design, period.

The one reason why blocking inside the request_fn is bad, is that it
prevents the following queues on the tq_disk list from being run. And
subsequent tq_disk runs will not unplug them, since run_task_queue()
clears the list prior to starting.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-29 11:21             ` Pavel Machek
@ 2002-05-29 12:10               ` Peter T. Breuer
  2002-05-29 13:24                 ` Jens Axboe
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-29 12:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Peter T. Breuer, Steve Whitehouse, linux kernel, alan, chen_xiangping

"A month of sundays ago Pavel Machek wrote:"
> > > Init routine is called from insmod context or at kernel bootup (from pid==1).
> > 
> > That's nitpicking!  
> 
> I did not want to be nitpicking. init() really is considered process

Well, OK.

> context, and it looks to me like unplug is *blocking* operation so it
> really needs proceess context.

unplug unsets the plugged flag and calls the request function. The
question is whether the request function is allowed to block. I argue
that it is not, on several grounds:

    1) it's also - and principally - been called from various task
    queues, which aren't really associated with a process context, and
    certainly not with the process context that set the task

    2) blocking is really bad news depending on how we got to the
    request function, which is not a really predictable thing, since
      i) it can change with every kernel version
      ii) it depends on what somebody else does

   3) if we block against memory for buffers, in particular, the 
   the system is now very likely to be dead, since VM just went
   synchronous.

and probably you know lots better arguments!


Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-29 10:51           ` Peter T. Breuer
@ 2002-05-29 11:21             ` Pavel Machek
  2002-05-29 12:10               ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Pavel Machek @ 2002-05-29 11:21 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Steve Whitehouse, linux kernel, alan, chen_xiangping

Hi!

> > > Look in some of the block drivers, floppy.c or loop.c.  These do call
> > > the task queue, even though that's only as an aid to the rest of the
> > > kernel, because they know they can help at that point, and it's not at
> > > all clear what context they're in.  Perhaps it's best to look in
> > > floppy.c, which runs the task queue in its init routine!  I mean to say
> > 
> > Init routine is called from insmod context or at kernel bootup (from pid==1).
> 
> That's nitpicking!  

I did not want to be nitpicking. init() really is considered process
context, and it looks to me like unplug is *blocking* operation so it
really needs proceess context.
									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-27 13:44         ` Pavel Machek
@ 2002-05-29 10:51           ` Peter T. Breuer
  2002-05-29 11:21             ` Pavel Machek
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-29 10:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Peter T. Breuer, Steve Whitehouse, linux kernel, alan, chen_xiangping

"A month of sundays ago Pavel Machek wrote:"
> Hi!
> 
> > Look in some of the block drivers, floppy.c or loop.c.  These do call
> > the task queue, even though that's only as an aid to the rest of the
> > kernel, because they know they can help at that point, and it's not at
> > all clear what context they're in.  Perhaps it's best to look in
> > floppy.c, which runs the task queue in its init routine!  I mean to say
> 
> Init routine is called from insmod context or at kernel bootup (from pid==1).

That's nitpicking!  That the kernel init routine runs after a process is
started is an accident (and I'm not sure it's true, but what the heck
..).  Where does it say that one can rely on this?

My point is that the disk task queue "just happens".  Maybe some days in
the life of the kernel development it happens in a process context,
and maybe some days it doesn't.  Conceptually, I don't see any necesity
that it should or ought to, and even if it does, I don't see why it
should be expected to run in the context of a particular process, let
alone the one you think it should run in, on, by, from ... 

> Both look like process context to me.

And both of the oranges in front of me look orange. Neverthess, I have
eaten some red oranges.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-27 13:04             ` Steven Whitehouse
@ 2002-05-27 19:51               ` Peter T. Breuer
  0 siblings, 0 replies; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-27 19:51 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, linux kernel, alan, chen_xiangping

"Steven Whitehouse wrote:"
> > That is again what I do do in ENBD.
> > 
> Ok. Then I think nbd will shortly start to look more like enbd then :-)

I did send linus a kernel patch for ENBD aimed at 2.5.7. The same patch
works fine in 2.5.12. It's in the distribution package, which should be
easy to find on freshmeat, if anyone is interested.

ENBD has many more "features" than kernel NBD. For one thing, it does
remote ioctls.

> > You end up with some latency, and that's all.
> > 
> Yes, and with the new scheduling code that Ingo wrote recently, that
> should be even less than it used to be I guess.

I was very enthusiastic about the new per-device io locks, but I
haven't noticed much practical benefit. OTOH I'm not doing heavy
testing under 2.5. I intend to live.

> > People are doing large studies of performance over ENBD over
> > various media, with various kernels. I should be able to tell you more
> > one day soon! At the moment I don't see any obvious way-to-go
> > indicators in the results.
> > 
> That sounds very interesting. I'm certainly keen to see the results when
> they are available.

At the moment they show large effects from the VMS on the server side.
It seems that writing to the server drives the server side into a
regime where the disk i/o and the network compete, and they start
pulsing on-off at periods of about 7s. The people at heidelberg
are getting very good studies out.  The older VMS does not seem 
to have this effect, possibly because it uses predictive control
(using the rate of use of buffers as a dataum)?

> > I'm not sure that I see the difficulty. Yes, to answer that question,
> > ENBD does use non-blocking sockets, and does run select on them to
> > detect what happens. But that's more or less just so it can detect
> > errors. I don't think there'd be any significant harm accruing from
> > using blocking sockets.
> > 
> I should have really explained that a bit more. I was thinking about another
> bug which I fixed in nbd before which was related to the network buffer
> queue sizes and certain workloads. It was possible to get into a state

That sounds interesting.

> where both client and server were waiting for each other to process
> requests (and hence reduce the outstanding queue length) before they
> would continue.

I have long suspected that if the socket were to wait for fragments
to build up into a packet size, then it could deadlock. So I try 
and use "write at once". But it looks to me as though there are
probably some very subtle deadlock opportunities left.

> If you use blocking sockets in a loop along the lines of:
> 
>  while(1) {
> 	if (request_waiting)
> 		write_a_request(); /* may block until whole request sent */

I think that's the important bit .. you want to send as send can, bit
by bit.

> 	if (receive_queue_size >= sizeof(an nbd header))
> 		read_a_request(); /* may block until whole request read */

Yes, I also am worried about something here, though I'm not sure quite
what. This kind of deadlock is in theory cured by ENBD's "async mode",
which is a mode in which it acks the kernel before receiving an ack
from the net. It's for situations where you trust the net completely.

There's no point in running a separate receive thread here, because the other
side won't reply until it's got everything we sent.

> 	if (nothing_is_happening)
> 		wait_for_something_to_happen();
>  }
> 
> then I think the same problem applies (assuming that the server uses a similar
> loop to the client and that the relative speeds of the client and server are
> such that it allows getting in such a state).

It's not completely clear what the deadlock is over. Buffers, I suppose?
We need buffers to send, but can't free any before receiving.

> After reading the run_task_queue() source, I agree that we shouldn't block
> in the request function. I'm not completely convinced that its not ok under
> any circumstances - it might be fine when we know it will only be for a
> very short period, but it does seem that there is one reason that we must
> never block in the request function which is to wait for memory.

Somebody like Jens Axboe should be able to say for certain.


Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 11:43       ` Peter T. Breuer
  2002-05-24 13:28         ` Steven Whitehouse
@ 2002-05-27 13:44         ` Pavel Machek
  2002-05-29 10:51           ` Peter T. Breuer
  1 sibling, 1 reply; 53+ messages in thread
From: Pavel Machek @ 2002-05-27 13:44 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: Steve Whitehouse, linux kernel, alan, chen_xiangping

Hi!

> Look in some of the block drivers, floppy.c or loop.c.  These do call
> the task queue, even though that's only as an aid to the rest of the
> kernel, because they know they can help at that point, and it's not at
> all clear what context they're in.  Perhaps it's best to look in
> floppy.c, which runs the task queue in its init routine!  I mean to say

Init routine is called from insmod context or at kernel bootup (from pid==1).

Both look like process context to me.
								Pavel

-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 15:54           ` Peter T. Breuer
@ 2002-05-27 13:04             ` Steven Whitehouse
  2002-05-27 19:51               ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-27 13:04 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel, alan, chen_xiangping

Hi,

> 
> "Steven Whitehouse wrote:"
> > (and ptb wrote, at ever greater quoting distance)
> > > > so I'm still hopeful that it can be solved in a reasonably simple way,
> > > If you can manage to reserve memory for a socket, that should be it.
> > 
> > I'd like to do that if I can, on the otherhand, thats going to be rather
> > a tricky one. We are not going to know which process called run_task_queue()
> 
> We should be able to change the function which registers tasks to also
> record the originating process id, if that's any help.  But it might
> have to be inherited through a lot of places, and it's not therefore a
> point change in the code. Discard that idea.
> 
Yes. Agreed.

> > so that we can't use the local pages list as a hackish way of doing this
> > easily (I looked into that). If we did all sends from one process though
> > it becomes more of a possibility, there are problems with that though....
> 
> I do do that in ENBD.
> 
> > I've wondered before about changing the request function so that it just
> > puts the request on the local queue and wakes up the thread for sending
> 
> That is again what I do do in ENBD.
> 
Ok. Then I think nbd will shortly start to look more like enbd then :-)

> > data. That would solve the blocking problem, but also mean that we had to
> > schedule for every request that comes in which I'd rather avoid for
> > performance reasons (not that I've actually done the experiment to work out
> > what we'd loose, but I suspect it would make a difference).
> 
> There's no difference for NBD, essentially, if the kernel is merging
> requests (and it is, by default, in the make request function). Then
> nearly every read and write through the device ends up as several KB in
> size, maybe up to about 32KB (or even 128KB, or 256KB), and that is
> significant time over a network link, so that whatever else happens 
> in the kernel is dominated by the transmission speed over the medium.
> 
> You end up with some latency, and that's all.
> 
Yes, and with the new scheduling code that Ingo wrote recently, that
should be even less than it used to be I guess.

> People are doing large studies of performance over ENBD over
> various media, with various kernels. I should be able to tell you more
> one day soon! At the moment I don't see any obvious way-to-go
> indicators in the results.
> 
That sounds very interesting. I'm certainly keen to see the results when
they are available.

[snip]
> 
> > Also there is the possibility of combining the sending thread with the
> > receiving thread. This has complications because we'd have to poll()
> 
> I do use the same thread for sending and receiving in ENBD. This is
> because the cycle is invariant .. you send a question, and then you
> receive an answer back. Or you send a command, and you get an ack back.
> Both times it's write then read. Separate threads would be possible,
> but are kind of a luxury item.
> 
I think thats what I'd like to do for nbd having thought things through
a bit more.

> > the socket and be prepared to do non-blocking read or write on it as
> > required. Obviously by no means impossible, but certainly more complicated
> > than the current code.
> 
> I'm not sure that I see the difficulty. Yes, to answer that question,
> ENBD does use non-blocking sockets, and does run select on them to
> detect what happens. But that's more or less just so it can detect
> errors. I don't think there'd be any significant harm accruing from
> using blocking sockets.
> 
I should have really explained that a bit more. I was thinking about another
bug which I fixed in nbd before which was related to the network buffer
queue sizes and certain workloads. It was possible to get into a state
where both client and server were waiting for each other to process
requests (and hence reduce the outstanding queue length) before they
would continue.

If you use blocking sockets in a loop along the lines of:

 while(1) {
	if (request_waiting)
		write_a_request(); /* may block until whole request sent */
	if (receive_queue_size >= sizeof(an nbd header))
		read_a_request(); /* may block until whole request read */
	if (nothing_is_happening)
		wait_for_something_to_happen();
 }

then I think the same problem applies (assuming that the server uses a similar
loop to the client and that the relative speeds of the client and server are
such that it allows getting in such a state).

> > This would prevent blocking in the request function, but I still don't know
> > how we can ensure that there is enough memory available. In some ways I
> 
> This is the key.
> 
Agreed. Though now you've convinced me of the problems involved in blocking
in the request function, I'm going to deal with that first and come back
to the memory management question later.

After reading the run_task_queue() source, I agree that we shouldn't block
in the request function. I'm not completely convinced that its not ok under
any circumstances - it might be fine when we know it will only be for a
very short period, but it does seem that there is one reason that we must
never block in the request function which is to wait for memory.

> > feel that we ought to be able to make use of the local pages list for the
> > process to (ab)use for this, but if the net stack frees any memory from
> > interrupt context that was allocated in process context, that idea won't
> > work,
> 
> What was my idea .. oh yes, that the tcp stack should get its memory
> from a specific place for each socket, if there is a specific place
> defined. This involves looking at the tcp stack, which I hate ...
> 
> 
> Peter
> 

I'm not so worried about looking at this code. I studied it a great deal
when I was looking for inspiration for the DECnet stack. Its changed since
then, but not so much that I'd have to start from scratch,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 13:28         ` Steven Whitehouse
@ 2002-05-24 15:54           ` Peter T. Breuer
  2002-05-27 13:04             ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-24 15:54 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, linux kernel, alan, chen_xiangping

"Steven Whitehouse wrote:"
> (and ptb wrote, at ever greater quoting distance)
> > > so I'm still hopeful that it can be solved in a reasonably simple way,
> > If you can manage to reserve memory for a socket, that should be it.
> 
> I'd like to do that if I can, on the otherhand, thats going to be rather
> a tricky one. We are not going to know which process called run_task_queue()

We should be able to change the function which registers tasks to also
record the originating process id, if that's any help.  But it might
have to be inherited through a lot of places, and it's not therefore a
point change in the code. Discard that idea.

> so that we can't use the local pages list as a hackish way of doing this
> easily (I looked into that). If we did all sends from one process though
> it becomes more of a possibility, there are problems with that though....

I do do that in ENBD.

> I've wondered before about changing the request function so that it just
> puts the request on the local queue and wakes up the thread for sending

That is again what I do do in ENBD.

> data. That would solve the blocking problem, but also mean that we had to
> schedule for every request that comes in which I'd rather avoid for
> performance reasons (not that I've actually done the experiment to work out
> what we'd loose, but I suspect it would make a difference).

There's no difference for NBD, essentially, if the kernel is merging
requests (and it is, by default, in the make request function). Then
nearly every read and write through the device ends up as several KB in
size, maybe up to about 32KB (or even 128KB, or 256KB), and that is
significant time over a network link, so that whatever else happens 
in the kernel is dominated by the transmission speed over the medium.

You end up with some latency, and that's all.

People are doing large studies of performance over ENBD over
various media, with various kernels. I should be able to tell you more
one day soon! At the moment I don't see any obvious way-to-go
indicators in the results.

Some of the latency problems evaporate because people interested in
this sort of thing are usually using SMP machines, and signalling
when there is work to do amounts to letting the kernel work on 
one processor, taking requests off the kernel queue and putting them on
the local queue, and leaving a process blocked on the second cpu,
waiting for the kernel on the other cpu to tell it to go go go.

The ENBD design also has multiple processes doing the networking, so
they tend to pipeline. Obviously, the more cpu's the better. Hey, the
more network cards, the better. Also the 2.5 kernel's abandonment of a
single kernel i/o lock should have helped significantly, but it hasn't
shown up in measurements.

> Also there is the possibility of combining the sending thread with the
> receiving thread. This has complications because we'd have to poll()

I do use the same thread for sending and receiving in ENBD. This is
because the cycle is invariant .. you send a question, and then you
receive an answer back. Or you send a command, and you get an ack back.
Both times it's write then read. Separate threads would be possible,
but are kind of a luxury item.

> the socket and be prepared to do non-blocking read or write on it as
> required. Obviously by no means impossible, but certainly more complicated
> than the current code.

I'm not sure that I see the difficulty. Yes, to answer that question,
ENBD does use non-blocking sockets, and does run select on them to
detect what happens. But that's more or less just so it can detect
errors. I don't think there'd be any significant harm accruing from
using blocking sockets.

> This would prevent blocking in the request function, but I still don't know
> how we can ensure that there is enough memory available. In some ways I

This is the key.

> feel that we ought to be able to make use of the local pages list for the
> process to (ab)use for this, but if the net stack frees any memory from
> interrupt context that was allocated in process context, that idea won't
> work,

What was my idea .. oh yes, that the tcp stack should get its memory
from a specific place for each socket, if there is a specific place
defined. This involves looking at the tcp stack, which I hate ...


Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 11:43       ` Peter T. Breuer
@ 2002-05-24 13:28         ` Steven Whitehouse
  2002-05-24 15:54           ` Peter T. Breuer
  2002-05-27 13:44         ` Pavel Machek
  1 sibling, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-24 13:28 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel, alan, chen_xiangping

Hi,

> 
> "Steven Whitehouse wrote:"
> (and ptb wrote, at a bit of quoting distance)
> > > > Assuming that we are still talking kernel nbd here and not enbd, I think
> > > 
> > > Yes.
> > > 
> > > > you've got that backwards. nbd_send_req() is called from do_nbd_request()
> > > > which is the block device request function and can therefore be called
> > > > from any thread running the disk task queue, which I think would normally
> > > 
> > > The disk task queue is called in the i/o context.
> > > 
> > I'm not sure what you mean by that. It runs in the context of whatever called
> > run_task_queue(&tq_disk); which is always some process or other waiting on
> > I/O so far as I can tell.
> 
> I mean that it is called with the i/o spinlock held, and with interrupts
> disabled.  Plus that run_task_queue (on the tq_disk) is NOT generally called
> from a process context.  It's called when the scheduler and his
> brother fred feels it ought to be called.
> 
Ok. I see now.

[snip]
> > > It maybe is or it maybe is not, but it's not in a process context in
> > > any sense that I recognise at that point! The request function for any
> > > block device driver is normally called when the device is unplugged,
> > > which happens exactly when the unplug task comes to be scheduled.
> > > Normally, user requests land somewhere in make request, which will
> > > discover a plugged queue and add the new requests somewhere to the
> > > queue, and go away happy. The driver request function will run sometime
> > > later (when the queue decides to unplug itself).
> > > 
> > By process context, I mean "not in an interrupt and not in a bottom 
> > half/tasklet/softirq context". In otherwords, its a context where its
> > legal to block under the normal rules for blocking processes.
> 
> You may not block while running the device request function.  The kernel
> expects you to return immediately.  The io lock is held, so if you
> block, everything dies.  If you schedule with the lock, everything else
> dies even more horribly, because nobody else is planning on releasing
> the lock, and you're sleeping.
> 
Agreed, thats just the standard rule about don't schedule whilst holding
a spinlock.

> 
> > > Devices are plugged as soon as the queue is drained, and the unplug
> > > task is scheduled for later at that point.
> > > 
> > > > The loop that the ioctl runs only does network receives and thus doesn't
> > > 
> > > The ioctl does both sends and recieves. It runs a loop waiting for
> > No it doesn't. The ioctl() only deals with receives from the network. Sends 
> > are all done by the thread which calls the request function for nbd.
> 
> If so, this is a recent change, and it is heap ungood. One should not
Its not that recent. Its been like that since before I started working
on nbd.

> attempt anything that may block in the request function.  If pavel has
> it, it is potential death, death, death, ...  ah, I see, he does have
> it, and is worried:
> 
>                 blkdev_dequeue_request(req);
>                 spin_unlock_irq(&io_request_lock);
> 
>                 down (&lo->queue_lock);
>                 list_add(&req->queue, &lo->queue_head);
>                 nbd_send_req(lo->sock, req);    /* Why does this block?  */
>                 up (&lo->queue_lock);
> 
>                 spin_lock_irq(&io_request_lock);
>                 continue;
> 
> so we see him briefly releasing the io lock and trying to send. Uh
> uh. No goodeee. The kernel is running through various drivers
> at this point, as a result of unplugging them (and calling the request
> function). Blocking one of them will block the rest from even running.
> 
> I apologise. You are right. Nowadays kernel nbd does do the send from
> the request function, instead of from the ioctl. In my opinion that is
> dangerous.
> 
Well I think I've been kind of assuming that it was ok since the lock was
dropped there. In the light of your comments I'm going to go back and
check the assumptions about how whether blocking is allowed here to
convince myself of the reasons or otherwise for doing this.

[snip]
> 
> > If we get to that state, we are well and truely stuck so we need to avoid it
> > at all costs.
> 
> There is no avoiding it except by reserving memory for the socket.
> 
Agreed. Or alternatively limiting the amount of non-freeable memory assigned to
other uses (page cache etc.).

> > > > server (we can tell from the socket queue lengths) and we know that we
> > > > can still ping clients which are otherwise dead due to the deadlock. I
> > > > don't think that at the moment there is any problem on the receive side.
> > > 
> > > If so, it's because the implementation uses the receive buffer as a
> > > stack :-). Nothing else qould account for it.
> > > 
> > I'm not sure quite what you are getting at here ...
> 
> That the tcp receive buffer used to accumulate and order fragments is exactly
> the final distination.
> 
Ok. I see now.

[snip]
> 
> > I think it would tend to hide the problem for some people and I'd rather not
> > do that until we have a general solution.
> > 
> > I'm going to try and come up with some more ideas to test in the next few days
> > so I'm still hopeful that it can be solved in a reasonably simple way,
> 
> If you can manage to reserve memory for a socket, that should be it.
> 
> 
> Peter
> 

I'd like to do that if I can, on the otherhand, thats going to be rather
a tricky one. We are not going to know which process called run_task_queue()
so that we can't use the local pages list as a hackish way of doing this
easily (I looked into that). If we did all sends from one process though
it becomes more of a possibility, there are problems with that though....

I've wondered before about changing the request function so that it just
puts the request on the local queue and wakes up the thread for sending
data. That would solve the blocking problem, but also mean that we had to
schedule for every request that comes in which I'd rather avoid for
performance reasons (not that I've actually done the experiment to work out
what we'd loose, but I suspect it would make a difference).

Also there is the possibility of combining the sending thread with the
receiving thread. This has complications because we'd have to poll()
the socket and be prepared to do non-blocking read or write on it as
required. Obviously by no means impossible, but certainly more complicated
than the current code.

This would prevent blocking in the request function, but I still don't know
how we can ensure that there is enough memory available. In some ways I
feel that we ought to be able to make use of the local pages list for the
process to (ab)use for this, but if the net stack frees any memory from
interrupt context that was allocated in process context, that idea won't
work,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-24 10:11     ` Steven Whitehouse
@ 2002-05-24 11:43       ` Peter T. Breuer
  2002-05-24 13:28         ` Steven Whitehouse
  2002-05-27 13:44         ` Pavel Machek
  2002-06-01 21:13       ` Peter T. Breuer
  1 sibling, 2 replies; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-24 11:43 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, linux kernel, alan, chen_xiangping

"Steven Whitehouse wrote:"
(and ptb wrote, at a bit of quoting distance)
> > > Assuming that we are still talking kernel nbd here and not enbd, I think
> > 
> > Yes.
> > 
> > > you've got that backwards. nbd_send_req() is called from do_nbd_request()
> > > which is the block device request function and can therefore be called
> > > from any thread running the disk task queue, which I think would normally
> > 
> > The disk task queue is called in the i/o context.
> > 
> I'm not sure what you mean by that. It runs in the context of whatever called
> run_task_queue(&tq_disk); which is always some process or other waiting on
> I/O so far as I can tell.

I mean that it is called with the i/o spinlock held, and with interrupts
disabled.  Plus that run_task_queue (on the tq_disk) is NOT generally called
from a process context.  It's called when the scheduler and his
brother fred feels it ought to be called.

Look in some of the block drivers, floppy.c or loop.c.  These do call
the task queue, even though that's only as an aid to the rest of the
kernel, because they know they can help at that point, and it's not at
all clear what context they're in.  Perhaps it's best to look in
floppy.c, which runs the task queue in its init routine!  I mean to say
that conceptually, there is no process associated with running tq_disk.
It happens when it happens.

> > > mean that its a thread waiting for I/O as in buffer.c:__wait_on_buffer()

Wait_on_buffer is merely responsible for scheduling itself out so that
somebody else can free up a few buffers if we get stuck without any in
make request. It makes some minor attempts to help, but it might as
well just wait as anything else (ok, it would be slower).

> > It maybe is or it maybe is not, but it's not in a process context in
> > any sense that I recognise at that point! The request function for any
> > block device driver is normally called when the device is unplugged,
> > which happens exactly when the unplug task comes to be scheduled.
> > Normally, user requests land somewhere in make request, which will
> > discover a plugged queue and add the new requests somewhere to the
> > queue, and go away happy. The driver request function will run sometime
> > later (when the queue decides to unplug itself).
> > 
> By process context, I mean "not in an interrupt and not in a bottom 
> half/tasklet/softirq context". In otherwords, its a context where its
> legal to block under the normal rules for blocking processes.

You may not block while running the device request function.  The kernel
expects you to return immediately.  The io lock is held, so if you
block, everything dies.  If you schedule with the lock, everything else
dies even more horribly, because nobody else is planning on releasing
the lock, and you're sleeping.


> > Devices are plugged as soon as the queue is drained, and the unplug
> > task is scheduled for later at that point.
> > 
> > > The loop that the ioctl runs only does network receives and thus doesn't
> > 
> > The ioctl does both sends and recieves. It runs a loop waiting for
> No it doesn't. The ioctl() only deals with receives from the network. Sends 
> are all done by the thread which calls the request function for nbd.

If so, this is a recent change, and it is heap ungood. One should not
attempt anything that may block in the request function.  If pavel has
it, it is potential death, death, death, ...  ah, I see, he does have
it, and is worried:

                blkdev_dequeue_request(req);
                spin_unlock_irq(&io_request_lock);

                down (&lo->queue_lock);
                list_add(&req->queue, &lo->queue_head);
                nbd_send_req(lo->sock, req);    /* Why does this block?  */
                up (&lo->queue_lock);

                spin_lock_irq(&io_request_lock);
                continue;

so we see him briefly releasing the io lock and trying to send. Uh
uh. No goodeee. The kernel is running through various drivers
at this point, as a result of unplugging them (and calling the request
function). Blocking one of them will block the rest from even running.

I apologise. You are right. Nowadays kernel nbd does do the send from
the request function, instead of from the ioctl. In my opinion that is
dangerous.

> > > do any allocations of any kind itself. The only worry on the receive side
> > > is that buffers are not available in the network device driver, but this
> > > doesn't seem to be a problem. There are no backed up replies in the
> > 
> > Well, I imagine it is a fact. Yes, we can starve tcp of receive buffers
> > too. But that doesn't matter, does it? Nobody will die from not being
> > able to read stuff for a little while? Userspace will block a bit ...
> > and yes, maybe the net could time out.
> > 
> > I think we could starve the net to death while nbd is trying to read
> > from it, if tcp can't get buffers because of i/o competition.
> > 
> Since I wrote this I've gained more evidence. Reports suggest that if we mark
> the send thread PF_MEMALLOC during the writeout, we land up completely
> starving everything else of memory in low memory situations. If we run out

Well, that's fine. We need it more than anybody else.

> of buffers for network receives we are equally stuck because no acknowledgements
> get through resulting in no buffers ever being marked as finished with I/O.

Yes.

> If we get to that state, we are well and truely stuck so we need to avoid it
> at all costs.

There is no avoiding it except by reserving memory for the socket.

> > > server (we can tell from the socket queue lengths) and we know that we
> > > can still ping clients which are otherwise dead due to the deadlock. I
> > > don't think that at the moment there is any problem on the receive side.
> > 
> > If so, it's because the implementation uses the receive buffer as a
> > stack :-). Nothing else qould account for it.
> > 
> I'm not sure quite what you are getting at here ...

That the tcp receive buffer used to accumulate and order fragments is exactly
the final distination.

> > > > So I think that your PF_MEMALLOC idea does revert the inversion.
> > > > 
> > > > Would it also be good to prevent other processes running? or is it too
> > > > late. Yes, I think it is too late to do any good, by the time we feel
> > > > this pressure.
> > 
> > > The mechanism works fine for block devices which do not need to allocate
> > > memory in their write out paths. Since we know there is a maximum amount
> > > of memory required by nbd and bounded by the maximum request size plus the
> > > small header per request, it would seem reasonable that to avoid deadlock
> > > we simply need to raise the amount of memory reserved for low memory
> > > situations until we've provided what nbd needs,
> > 
> > That is precisely max_sectors.
> > 
> Well there is extra overhead in allocating the skbs etc., but broadly yes, it
> is bounded by the amount of I/O that we want to do.
> 
> I do still have my zerocopy patch which should help greatly in the writeout
> path for those with suitable hardware, but I've been keeping it back because

Should do indeed.

> I think it would tend to hide the problem for some people and I'd rather not
> do that until we have a general solution.
> 
> I'm going to try and come up with some more ideas to test in the next few days
> so I'm still hopeful that it can be solved in a reasonably simple way,

If you can manage to reserve memory for a socket, that should be it.


Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-23 13:21   ` Peter T. Breuer
@ 2002-05-24 10:11     ` Steven Whitehouse
  2002-05-24 11:43       ` Peter T. Breuer
  2002-06-01 21:13       ` Peter T. Breuer
  0 siblings, 2 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-24 10:11 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel, alan, chen_xiangping

Hi,

> 
> Sorry .. I didn't see this earlier. I was on a trip. Just to clear up a
> couple of thinsg ...
> 
Thats ok. I'm only working on this problem part time anyway between various
other tasks.

> "A month of sundays ago Steven Whitehouse wrote:"
> > > Sorry I didn't pick this up earlier ..
> > > "Steven Whitehouse wrote:"
> > > > we don't want to alter that. The "priority inversion" that I mentioned occurs
> > > > when you get processes without PF_MEMALLOC set calling nbd_send_req() as when
> > > 
> > > There aren't any processes that call nbd_send_req except the unique
> > > nbd client process stuck in the protocol loop in the kernel ioctl
> > > that it entered at startup.
> > > 
> > Assuming that we are still talking kernel nbd here and not enbd, I think
> 
> Yes.
> 
> > you've got that backwards. nbd_send_req() is called from do_nbd_request()
> > which is the block device request function and can therefore be called
> > from any thread running the disk task queue, which I think would normally
> 
> The disk task queue is called in the i/o context.
> 
I'm not sure what you mean by that. It runs in the context of whatever called
run_task_queue(&tq_disk); which is always some process or other waiting on
I/O so far as I can tell.

> > mean that its a thread waiting for I/O as in buffer.c:__wait_on_buffer()
> 
> It maybe is or it maybe is not, but it's not in a process context in
> any sense that I recognise at that point! The request function for any
> block device driver is normally called when the device is unplugged,
> which happens exactly when the unplug task comes to be scheduled.
> Normally, user requests land somewhere in make request, which will
> discover a plugged queue and add the new requests somewhere to the
> queue, and go away happy. The driver request function will run sometime
> later (when the queue decides to unplug itself).
> 
By process context, I mean "not in an interrupt and not in a bottom 
half/tasklet/softirq context". In otherwords, its a context where its
legal to block under the normal rules for blocking processes.

> Devices are plugged as soon as the queue is drained, and the unplug
> task is scheduled for later at that point.
> 
> > The loop that the ioctl runs only does network receives and thus doesn't
> 
> The ioctl does both sends and recieves. It runs a loop waiting for
No it doesn't. The ioctl() only deals with receives from the network. Sends 
are all done by the thread which calls the request function for nbd.

[snip]
> 
> > do any allocations of any kind itself. The only worry on the receive side
> > is that buffers are not available in the network device driver, but this
> > doesn't seem to be a problem. There are no backed up replies in the
> 
> Well, I imagine it is a fact. Yes, we can starve tcp of receive buffers
> too. But that doesn't matter, does it? Nobody will die from not being
> able to read stuff for a little while? Userspace will block a bit ...
> and yes, maybe the net could time out.
> 
> I think we could starve the net to death while nbd is trying to read
> from it, if tcp can't get buffers because of i/o competition.
> 
Since I wrote this I've gained more evidence. Reports suggest that if we mark
the send thread PF_MEMALLOC during the writeout, we land up completely
starving everything else of memory in low memory situations. If we run out
of buffers for network receives we are equally stuck because no acknowledgements
get through resulting in no buffers ever being marked as finished with I/O.
If we get to that state, we are well and truely stuck so we need to avoid it
at all costs.

> > server (we can tell from the socket queue lengths) and we know that we
> > can still ping clients which are otherwise dead due to the deadlock. I
> > don't think that at the moment there is any problem on the receive side.
> 
> If so, it's because the implementation uses the receive buffer as a
> stack :-). Nothing else qould account for it.
> 
I'm not sure quite what you are getting at here ...

> > > So I think that your PF_MEMALLOC idea does revert the inversion.
> > > 
> > > Would it also be good to prevent other processes running? or is it too
> > > late. Yes, I think it is too late to do any good, by the time we feel
> > > this pressure.
> 
> > The mechanism works fine for block devices which do not need to allocate
> > memory in their write out paths. Since we know there is a maximum amount
> > of memory required by nbd and bounded by the maximum request size plus the
> > small header per request, it would seem reasonable that to avoid deadlock
> > we simply need to raise the amount of memory reserved for low memory
> > situations until we've provided what nbd needs,
> 
> That is precisely max_sectors.
> 
> Peter
> 
Well there is extra overhead in allocating the skbs etc., but broadly yes, it
is bounded by the amount of I/O that we want to do.

I do still have my zerocopy patch which should help greatly in the writeout
path for those with suitable hardware, but I've been keeping it back because
I think it would tend to hide the problem for some people and I'd rather not
do that until we have a general solution.

I'm going to try and come up with some more ideas to test in the next few days
so I'm still hopeful that it can be solved in a reasonably simple way,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-17  8:44 ` Steven Whitehouse
@ 2002-05-23 13:21   ` Peter T. Breuer
  2002-05-24 10:11     ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-23 13:21 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, linux kernel, alan, chen_xiangping

Sorry .. I didn't see this earlier. I was on a trip. Just to clear up a
couple of thinsg ...

"A month of sundays ago Steven Whitehouse wrote:"
> > Sorry I didn't pick this up earlier ..
> > "Steven Whitehouse wrote:"
> > > we don't want to alter that. The "priority inversion" that I mentioned occurs
> > > when you get processes without PF_MEMALLOC set calling nbd_send_req() as when
> > 
> > There aren't any processes that call nbd_send_req except the unique
> > nbd client process stuck in the protocol loop in the kernel ioctl
> > that it entered at startup.
> > 
> Assuming that we are still talking kernel nbd here and not enbd, I think

Yes.

> you've got that backwards. nbd_send_req() is called from do_nbd_request()
> which is the block device request function and can therefore be called
> from any thread running the disk task queue, which I think would normally

The disk task queue is called in the i/o context.

> mean that its a thread waiting for I/O as in buffer.c:__wait_on_buffer()

It maybe is or it maybe is not, but it's not in a process context in
any sense that I recognise at that point! The request function for any
block device driver is normally called when the device is unplugged,
which happens exactly when the unplug task comes to be scheduled.
Normally, user requests land somewhere in make request, which will
discover a plugged queue and add the new requests somewhere to the
queue, and go away happy. The driver request function will run sometime
later (when the queue decides to unplug itself).

Devices are plugged as soon as the queue is drained, and the unplug
task is scheduled for later at that point.

> The loop that the ioctl runs only does network receives and thus doesn't

The ioctl does both sends and recieves. It runs a loop waiting for
requests to appear on the internal device queue (the request function
puts them there, after taking them off the kernel queue). When
one appears it sends it over the net. If it's a write, it also sends
data. Then it waits for the ack from the net. If the request is a read,
the ack will also be followed by data. That's the cycle.

> do any allocations of any kind itself. The only worry on the receive side
> is that buffers are not available in the network device driver, but this
> doesn't seem to be a problem. There are no backed up replies in the

Well, I imagine it is a fact. Yes, we can starve tcp of receive buffers
too. But that doesn't matter, does it? Nobody will die from not being
able to read stuff for a little while? Userspace will block a bit ...
and yes, maybe the net could time out.

I think we could starve the net to death while nbd is trying to read
from it, if tcp can't get buffers because of i/o competition.

> server (we can tell from the socket queue lengths) and we know that we
> can still ping clients which are otherwise dead due to the deadlock. I
> don't think that at the moment there is any problem on the receive side.

If so, it's because the implementation uses the receive buffer as a
stack :-). Nothing else qould account for it.

> > So I think that your PF_MEMALLOC idea does revert the inversion.
> > 
> > Would it also be good to prevent other processes running? or is it too
> > late. Yes, I think it is too late to do any good, by the time we feel
> > this pressure.

> The mechanism works fine for block devices which do not need to allocate
> memory in their write out paths. Since we know there is a maximum amount
> of memory required by nbd and bounded by the maximum request size plus the
> small header per request, it would seem reasonable that to avoid deadlock
> we simply need to raise the amount of memory reserved for low memory
> situations until we've provided what nbd needs,

That is precisely max_sectors.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-16 22:54 Peter T. Breuer
@ 2002-05-17  8:44 ` Steven Whitehouse
  2002-05-23 13:21   ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-17  8:44 UTC (permalink / raw)
  To: ptb; +Cc: linux kernel, alan, chen_xiangping

Hi,

> 
> Sorry I didn't pick this up earlier ..
> 
> "Steven Whitehouse wrote:"
> > we don't want to alter that. The "priority inversion" that I mentioned occurs
> > when you get processes without PF_MEMALLOC set calling nbd_send_req() as when
> 
> There aren't any processes that call nbd_send_req except the unique
> nbd client process stuck in the protocol loop in the kernel ioctl
> that it entered at startup.
> 
Assuming that we are still talking kernel nbd here and not enbd, I think
you've got that backwards. nbd_send_req() is called from do_nbd_request()
which is the block device request function and can therefore be called
from any thread running the disk task queue, which I think would normally
mean that its a thread waiting for I/O as in buffer.c:__wait_on_buffer()

The loop that the ioctl runs only does network receives and thus doesn't
do any allocations of any kind itself. The only worry on the receive side
is that buffers are not available in the network device driver, but this
doesn't seem to be a problem. There are no backed up replies in the
server (we can tell from the socket queue lengths) and we know that we
can still ping clients which are otherwise dead due to the deadlock. I
don't think that at the moment there is any problem on the receive side.

> > they call through to page_alloc.c:__alloc_pages() they won't use any memory
> > once the free pages hits the min mark even though there is memory available
> > (see the code just before and after the rebalance label).
> 
> So I think the exact inversion you envisage cannot happen, but ...
> 
> I think that the problem is that the nbd-client process doesn't have
> high memory priority, and high priority processes can scream and holler
> all they like and will claim more memory, but won't make anythung better
> because the nbd process can't run (can't get tcp buffers), and so
> can't release the memory pressure.
> 
> So I think that your PF_MEMALLOC idea does revert the inversion.
> 
> Would it also be good to prevent other processes running? or is it too
> late. Yes, I think it is too late to do any good, by the time we feel
> this pressure.
> 
> Peter
> 
The mechanism works fine for block devices which do not need to allocate
memory in their write out paths. Since we know there is a maximum amount
of memory required by nbd and bounded by the maximum request size plus the
small header per request, it would seem reasonable that to avoid deadlock
we simply need to raise the amount of memory reserved for low memory
situations until we've provided what nbd needs,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
@ 2002-05-16 22:54 Peter T. Breuer
  2002-05-17  8:44 ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16 22:54 UTC (permalink / raw)
  To: linux kernel; +Cc: Steve Whitehouse, alan, chen_xiangping

Sorry I didn't pick this up earlier ..

"Steven Whitehouse wrote:"
> we don't want to alter that. The "priority inversion" that I mentioned occurs
> when you get processes without PF_MEMALLOC set calling nbd_send_req() as when

There aren't any processes that call nbd_send_req except the unique
nbd client process stuck in the protocol loop in the kernel ioctl
that it entered at startup.

> they call through to page_alloc.c:__alloc_pages() they won't use any memory
> once the free pages hits the min mark even though there is memory available
> (see the code just before and after the rebalance label).

So I think the exact inversion you envisage cannot happen, but ...

I think that the problem is that the nbd-client process doesn't have
high memory priority, and high priority processes can scream and holler
all they like and will claim more memory, but won't make anythung better
because the nbd process can't run (can't get tcp buffers), and so
can't release the memory pressure.

So I think that your PF_MEMALLOC idea does revert the inversion.

Would it also be good to prevent other processes running? or is it too
late. Yes, I think it is too late to do any good, by the time we feel
this pressure.

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
       [not found] <3CE40A77.22C74DC1@zip.com.au>
@ 2002-05-16 20:28 ` Peter T. Breuer
  0 siblings, 0 replies; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16 20:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Steve Whitehouse, linux kernel

"A month of sundays ago Andrew Morton wrote:"
> Steven Whitehouse wrote:
> > It would be nice to have a per device "max dirty pages" limit. Also useful
> > would be a per queue priority so that if the max dirty pages limit is reached
> > for that device, then the driver gets higher priority on memory allocations
> > until the number of dirty pages has dropped below an acceptable level. I
> > don't know how easy or desireable it would be to implement such a scheme
> > generally though.

I think that is one possible mechanism, yes.  What we need is for the VM
system to "act more intelligently".  I've given up on trying to get VM
info and throttling the nbd device using it, because the lockup doesn't
involve nbd, and would be made worse by slowing it down.  The lockup is
purely a VM phenonemen - It tries to flush buffers aimed at nbd.  but
won't give the nbd process any tcp buffers to do the flushing with, and
thus blocks itself.

> I'd expect a lot of these problems would be lessened by tweaking
> the fractional settings in /proc/sys/vm/bdflush.  Get it to write
> data earlier.  nfract, ndirty, nfract_sync and nfract_stop_bdflush.

This is part of the standard setup in Enbd, at least - the client
daemons twiddle these settings to at least 15/85% on startup.  It
doesn't help the reported tcp/vm deadlock, though it makes the occasions
on which it happens more "abnormal" than "usual".  Unfortunately those
abnormal conditions are reached under memory pressure while nbd is
running - one simply has to get tcp competing for buffers with other
heavy i/o.  If the i/o is directed at nbd itself, you have a deadlock.

Setting PF_MEMALLOC on the networking process seems to help a lot.
Obviously it can't help if we are competing against "nothing" instead of
another process.  I.e.  when we are doing i/o exclusively to nbd.  (e.g.
swap over nbd).

> Also, test 2.4.18 and 2.4.19-pre8.  There were significant
> bdflush changes in -pre6.

I'm willing to look, but you don't make it sound intrinsically
likely.

And setting vm/bdflush affects everthing, and presumably unoptimizes
the settings for everthing else on the system. This is core
kernel hackers territory .. somebody must be able to do something
here!

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-16 13:18 chen, xiangping
  0 siblings, 0 replies; 53+ messages in thread
From: chen, xiangping @ 2002-05-16 13:18 UTC (permalink / raw)
  To: 'Oliver Xymoron'
  Cc: 'Jes Sorensen', 'Steve Whitehouse',
	linux-kernel, 'Alan Cox'

I can imagine iSCSI projects to have similar problems. But how to
let NBD reserve memory for packets, cause sk_buff is allocated in
the network layer. How can NBD pass reserved memory to the network
layer? Unless there are some kind zero copying networking scheme 
allowing data in buffer cache being directly used in network buffer,
probably the network layer relieves its pain in allocating big sk_buff.

xiangping

-----Original Message-----
From: Oliver Xymoron [mailto:oxymoron@waste.org]
Sent: Wednesday, May 15, 2002 6:32 PM
To: chen, xiangping
Cc: 'Jes Sorensen'; 'Steve Whitehouse'; linux-kernel@vger.kernel.org
Subject: RE: Kernel deadlock using nbd over acenic driver.


On Tue, 14 May 2002, chen, xiangping wrote:

> But how to avoid system hangs due to running out of memory?
> Is there a safe guide line? Generally slow is tolerable, but
> crash is not.

If the system runs out of memory, it may try to flush pages that are
queued to your NBD device. That will try to allocate more memory for
sending packets, which will fail, meaning the VM can never make progress
freeing pages. Now your box is dead.

The only way to deal with this is to have a scheme for per-socket memory
reservations in the network layer and have NBD reserve memory for sending
and acknowledging packets. NFS and iSCSI also need this, though it's a
bit harder to tickle for NFS. SCSI has DMA reserved memory for analogous
reasons.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-16  8:04     ` Steven Whitehouse
@ 2002-05-16  8:49       ` Peter T. Breuer
  0 siblings, 0 replies; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16  8:49 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: ptb, alan, chen_xiangping, linux kernel

"A month of sundays ago Steven Whitehouse wrote:"
> > I don't see any reason to introduce a second flag to say when a flag
> > has been set .. Initial reports are that symptoms go away when
> > 
> >     current->flags |= PF_MEMALLOC;
> > 
> > is set in the process about to do networking (and unset afterwards).
> 
> The reason for adding the second flag is that I suspect that nbd_send_req()
> can be called by processes which already have PF_MEMALLOC set, in which case

If we are talking about kernel nbd, then the send_req is called by
the unique process which entered the kernel ages ago, and has been stuck
in the kernel doing its protocol loop ever sice.  At least, if things
are still the way they were the last time I looked!  Pavel originally
used the client daemons to get a socket, then pass it down into the
kernel, and then they stayed there pumping data into and out of the
socket. They get stuck there in the first place when they call a special
ioctl. No user process in its right mind would call that ioctl, and 
couldn't, unless it were root.

(It's one of the reasons why I wrote ENBD (a long time ago!). Those
processes are now just specialized kernel threads, in effect, and can't
be talked into doing anything different .. like dying, for example.)

> we don't want to alter that. The "priority inversion" that I mentioned occurs
> when you get processes without PF_MEMALLOC set calling nbd_send_req() as when
> they call through to page_alloc.c:__alloc_pages() they won't use any memory
> once the free pages hits the min mark even though there is memory available
> (see the code just before and after the rebalance label).

There is a possible inversion, but I'm not sure it's that. I assume
the problem is that bdflush has high priority, and thus preempts the
nbd processes either from running or from gaining tcp buffers, yet
bdflush needs nbd to run and get buffers in order to flush the requests
through the net.

> Once one process has started sleeping waiting for memory in nbd_send_req()
> thats is, since tx_lock prevents any further writeouts until the sleeping
> process has completed. Unfortunately this has to be the case in order to
> ensure that nbd's requests are sent atomically.

I'm not following you closely here, but in Enbd I don't attempt to send
atomically and I don't believe it's necessary.  The process in Enbd is
an ordinary user process doing ordinary user networking.  It shovels
data to and fro from the kernel via ioctls.  Yes I believe it _does_
stick in networking, when it sticks.  I do have the socket set to "write
at once" instead of accumulating data to make up whole packets, but
that's all.

What's more, in Enbd, the user process can be set to run "async", and it
still locks when swapping over nbd.  This means that it acks the kernel
_before_ sending to the net, so if it's blocked, it's not blocked
against its own request!  It must be blocked against lack of buffers in
general.  It's locked in memory so it can't lack for pages.

> So rather than reserve memory specifically for sockets, in effect the
> min free pages for each zone place a limit on what "normal" allocations
> may use as a maximum. This is fine provided allocations in the write out
> path are not "normal" as well, but able to use whatever they need. At first
> I thought "we only need to set PF_MEMALLOC if we are writing" but in fact
> we have to set it for reads too so that reads don't block writes I think.

I set it for both reads and writes, and the one report I have back said
"that cures it".

However, I don't believe it necessary for reading. I haven't thought
about it. It's just silly!

> There is a difference though between preventing the deadlock and adjusting
> the system so that we get the maximum performance, so it will be interesting
> to see whether we ought to adjust the min free pages figure in order to
> get higher performance, or whether its ok as it is.

Tell me how to, and I'll do it. I can tell you exactly how much min is
required for Enbd, per socket. But how can I make sure that only the
nbd process with the socket can claim that memory? Actually .. that
sounds a lot easier than "how can a socket reserve memory".

> I'm not sure yet that the PF_MEMALLOC change I described actually fixes the
> problem either, although it should make things a lot better. Thats

Yes. I agree that it probably does not avoid the situation where
buffers have completely disappeared, and we need to send via tcp in
order to free some of them.

> something else for further investigation.


Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-15 21:43 Peter T. Breuer
@ 2002-05-16  8:33 ` Steven Whitehouse
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-16  8:33 UTC (permalink / raw)
  To: ptb; +Cc: alan, chen_xiangping, kumbera, linux kernel

Hi,

> 
> (Addresses got munged locally, but as I'm postmaster, I get the mail
> after 26 bounces, so no hassle ...)
> 
Ok. I was wodering after the bounce message that I got :-)

> Let's see if I follow ...
> 
> > thanks for the info. I'm starting to form some ideas of what the problem
> > with nbd might be. Here is my initial idea of what might be going on:
> > 
> >  1. Something happens which starts lots of I/O (e.g. the ext3/xfs journal
> >     flush that Xiangping says usually triggers the problem)
> 
> Is this any kind of i/o? Such as swapping? You mean something which
> takes the i/o lock, or which generally exercises the i/o system .. And
> are there any particular characteristics to the "a lot" that you have
> in mind, such as maybe running us out of requests on that device (no), or 
> running us low on available free buffers (yes?).
> 
Probably swapping would trigger it too, though thats the "difficult" case
so I've been ignoring that one up till now :-)

> >  2. One of the threads doing the writes blocks running the device I/O
> 
> If a thread blocks running its own device i/o queue, that would be 
> a fatal error for most of the kernel. The i/o lock is held. And -
> correct me on this - interrupts are disabled?
> 
> So I assume you are talking about "a bug in something, somewhere".
> 
No. The kernel nbd drops the io request lock before it does anything
which might block, so its ok from that point of view. I suspect that
we'll find that its not a bug in one particular bit of code but two
subsystems which are making assumptions about how the other works, which
whilst being perfectly reasonable on their own conflict in a way which
causes the deadlock we see.

[snip]
> 
> >     only need to have each memory zones free pages just below pages_min
> >     at the right time to trigger this.
> 
> I don't understand the specific allusion, but I gather you are talking
> about low free pages. Yes, being run out of memory matches the reports.
> Particularly the people who are swapping over nbd under memory pressure
> are in that situation.
> 
> So - is that situation handled differently in the old VM?
> 
I'm not enough of an expert on the changes that have gone on to answer
that one, the VM isn't really my area of the kernel.

I think I've answered the other points that you make in my other reply
which I sent a few moments ago, let me know if I missed something.

It would be nice to have a per device "max dirty pages" limit. Also useful 
would be a per queue priority so that if the max dirty pages limit is reached 
for that device, then the driver gets higher priority on memory allocations 
until the number of dirty pages has dropped below an acceptable level. I
don't know how easy or desireable it would be to implement such a scheme
generally though.

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-16  5:15   ` Peter T. Breuer
@ 2002-05-16  8:04     ` Steven Whitehouse
  2002-05-16  8:49       ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-16  8:04 UTC (permalink / raw)
  To: ptb; +Cc: alan, chen_xiangping, linux kernel

Hi,

[snip]
> 
> I don't see any reason to introduce a second flag to say when a flag
> has been set .. Initial reports are that symptoms go away when
> 
>     current->flags |= PF_MEMALLOC;
> 
> is set in the process about to do networking (and unset afterwards).
> 
> There will be more news later today. I believe that this will remove
> deadlock against VM for tcp buffers, but I don't believe it will 
> stop deadlocks against "nothing", when we simply are out of buffers.
> The only thing that can do that is reserved memory for the socket.
> Any pointers?
> 
> Peter
> 

The reason for adding the second flag is that I suspect that nbd_send_req()
can be called by processes which already have PF_MEMALLOC set, in which case
we don't want to alter that. The "priority inversion" that I mentioned occurs
when you get processes without PF_MEMALLOC set calling nbd_send_req() as when
they call through to page_alloc.c:__alloc_pages() they won't use any memory
once the free pages hits the min mark even though there is memory available
(see the code just before and after the rebalance label).

Once one process has started sleeping waiting for memory in nbd_send_req()
thats is, since tx_lock prevents any further writeouts until the sleeping
process has completed. Unfortunately this has to be the case in order to
ensure that nbd's requests are sent atomically.

So rather than reserve memory specifically for sockets, in effect the
min free pages for each zone place a limit on what "normal" allocations
may use as a maximum. This is fine provided allocations in the write out
path are not "normal" as well, but able to use whatever they need. At first
I thought "we only need to set PF_MEMALLOC if we are writing" but in fact
we have to set it for reads too so that reads don't block writes I think.

There is a difference though between preventing the deadlock and adjusting
the system so that we get the maximum performance, so it will be interesting
to see whether we ought to adjust the min free pages figure in order to
get higher performance, or whether its ok as it is.

I'm not sure yet that the PF_MEMALLOC change I described actually fixes the
problem either, although it should make things a lot better. Thats
something else for further investigation.

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-15 19:43 ` Steven Whitehouse
@ 2002-05-16  5:15   ` Peter T. Breuer
  2002-05-16  8:04     ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-16  5:15 UTC (permalink / raw)
  To: Steve Whitehouse; +Cc: alan, chen_xiangping, linux kernel

"A month of sundays ago Steven Whitehouse wrote:"
> So something to try is this, in nbd_send_req() add the lines:
> 
> 	if (current->flags & PF_MEMALLOC == 0) {
> 		current->flags |= PF_MEMALLOC;
> 		we_set_memalloc = 1;
> 	}
> 
> before the first nbd_xmit() call and
> 
> 	if (we_set_memalloc)
> 		current->flags &= ~PF_MEMALLOC;
> 
> at the end just before the return; rememebring to declare the variable:

I don't see any reason to introduce a second flag to say when a flag
has been set .. Initial reports are that symptoms go away when

    current->flags |= PF_MEMALLOC;

is set in the process about to do networking (and unset afterwards).

There will be more news later today. I believe that this will remove
deadlock against VM for tcp buffers, but I don't believe it will 
stop deadlocks against "nothing", when we simply are out of buffers.
The only thing that can do that is reserved memory for the socket.
Any pointers?

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
@ 2002-05-15 21:43 Peter T. Breuer
  2002-05-16  8:33 ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-15 21:43 UTC (permalink / raw)
  To: steve; +Cc: alan, chen_xiangping, kumbera, linux kernel

(Addresses got munged locally, but as I'm postmaster, I get the mail
after 26 bounces, so no hassle ...)

Let's see if I follow ...

> thanks for the info. I'm starting to form some ideas of what the problem
> with nbd might be. Here is my initial idea of what might be going on:
> 
>  1. Something happens which starts lots of I/O (e.g. the ext3/xfs journal
>     flush that Xiangping says usually triggers the problem)

Is this any kind of i/o? Such as swapping? You mean something which
takes the i/o lock, or which generally exercises the i/o system .. And
are there any particular characteristics to the "a lot" that you have
in mind, such as maybe running us out of requests on that device (no), or 
running us low on available free buffers (yes?).

>  2. One of the threads doing the writes blocks running the device I/O

If a thread blocks running its own device i/o queue, that would be 
a fatal error for most of the kernel. The i/o lock is held. And -
correct me on this - interrupts are disabled?

So I assume you are talking about "a bug in something, somewhere".


>     queue and causing nbd_send_req(), nbd_xmit() to block in the 
>     sendmsg() call (trying to allocate memory GFP_NOIO). I think we

Well,  I do the networking in userspace in ENBD, but it is still
going to cause a sendmsg() to happen. If that sendmsg is blocked, then
the client daemon is blocked, and the kernel will time out the daemon,
and rollback the requests in let it have .. and that IS what is
observed.

Yes - a blocked userspace daemon is what I believe to be observed.
Blocked in networking matches what I have heard.

>     only need to have each memory zones free pages just below pages_min
>     at the right time to trigger this.

I don't understand the specific allusion, but I gather you are talking
about low free pages. Yes, being run out of memory matches the reports.
Particularly the people who are swapping over nbd under memory pressure
are in that situation.

So - is that situation handled differently in the old VM?

>  3. Since bdflush will most likely be running it waits for the dirty
>     blocks its submitted to finish being written back to the
>     nbd device to finish.

Umm ... well, that's bdflush for you! As far as I recall bdflush
tries various different kinds of hard to get rid of dirty stuff?
Let's suppose its in sync mode, and see where we get ..

> So something to try is this, in nbd_send_req() add the lines:
> 
> 	if (current->flags & PF_MEMALLOC == 0) {
> 		current->flags |= PF_MEMALLOC;
> 		we_set_memalloc = 1;
> 	}

Uh, I can't try that directly, because my networking is done from
userspace in ENBD.  But I abstract from that that you want us to
notice that PF_MEMALLOC is not set, and set it, and remember when
we set it. Kernel nbd does this in the context of the process that is
stuck in kernel forever, but I can do it whenever the process
enters the kernel to pick up a request.

> before the first nbd_xmit() call and
> 
> 	if (we_set_memalloc)
> 		current->flags &= ~PF_MEMALLOC;
> 
> at the end just before the return; rememebring to declare the variable:

You want to invert the change after having sent the networking data out.

Well, I think this raises the priority for getting memory of the process
doing the networking. Yep. I can do that. It's a real userspace process
for me, but it's in a tight loop doing the protocol, and nothing else.

> int we_set_memalloc = 0;

Why don't we do it always? Surely what you are saying is that a process
doing networking needs priority for memory? If so, the right place to
do this is in send_msg, not in nbd, as it's a generic problem, just
one that happens to be exposed by nbd.

> at the top of the function. We know that since the box stays responsive to
> pings that there must be some free memory, so I suspect some kind of
> "priority inversion" is at work here.

OK ... that is a high priority process waiting on a low priority
process, and thereby keeping it locked out.

> Another interesting idea... if we changed the icmp receive function so that
> it leaked all echo request packets rather than recycling them we could
> measure the free memory after the box has hung by seeing how many times
> we can ping it before it stops replying. An interesting way to
> measure free memory, but probably quite effective :-)

That is not a bad test. I've asked if anybody can get mem stats out via
magic key combos, but never had any comeback on that.

> It looks like adding the line:
> 
> 	if (icmph->type == ICMP_ECHO) return;
> 
> just after (icmp_pointers[icmph->type].handler)(skb); in icmp.c:icmp_rcv()
> should do the trick in this case.

I think it would be easier to get stats out in a sysreq sort of way!

> Those are my current thoughts. Let me know if you are sure that some of
> what I've said is right/wrong, otherwise I'll have a go tomorrow at

Back in 2.2 times, there was a network memory priority problem "solved"
in 2.2.17. I was never convinced by the solution. It addressed the
network competing for memory against other claimants, and arranged that
the network won. But I did not see that that stopped one running out
of memory, and then the network blocking against _nothing_.

The only strategy that will work is to reserve some memory for the net,
and this processes networking in particular! How much? Well, as big
as a max_sectors for the device, at any rate.


> trying to prove some of it on my test system here,

Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
  2002-05-15 17:43 Peter T. Breuer
@ 2002-05-15 19:43 ` Steven Whitehouse
  2002-05-16  5:15   ` Peter T. Breuer
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-15 19:43 UTC (permalink / raw)
  To: guk.ukuu.org.uk; +Cc: alan, chen_xiangping, kumbera, linux kernel

Hi,

thanks for the info. I'm starting to form some ideas of what the problem
with nbd might be. Here is my initial idea of what might be going on:

 1. Something happens which starts lots of I/O (e.g. the ext3/xfs journal
    flush that Xiangping says usually triggers the problem)
 2. One of the threads doing the writes blocks running the device I/O
    queue and causing nbd_send_req(), nbd_xmit() to block in the 
    sendmsg() call (trying to allocate memory GFP_NOIO). I think we
    only need to have each memory zones free pages just below pages_min
    at the right time to trigger this.
 3. Since bdflush will most likely be running it waits for the dirty
    blocks its submitted to finish being written back to the
    nbd device to finish.

So something to try is this, in nbd_send_req() add the lines:

	if (current->flags & PF_MEMALLOC == 0) {
		current->flags |= PF_MEMALLOC;
		we_set_memalloc = 1;
	}

before the first nbd_xmit() call and

	if (we_set_memalloc)
		current->flags &= ~PF_MEMALLOC;

at the end just before the return; rememebring to declare the variable:

int we_set_memalloc = 0;

at the top of the function. We know that since the box stays responsive to
pings that there must be some free memory, so I suspect some kind of
"priority inversion" is at work here.

Another interesting idea... if we changed the icmp receive function so that
it leaked all echo request packets rather than recycling them we could
measure the free memory after the box has hung by seeing how many times
we can ping it before it stops replying. An interesting way to
measure free memory, but probably quite effective :-)

It looks like adding the line:

	if (icmph->type == ICMP_ECHO) return;

just after (icmp_pointers[icmph->type].handler)(skb); in icmp.c:icmp_rcv()
should do the trick in this case.

Those are my current thoughts. Let me know if you are sure that some of
what I've said is right/wrong, otherwise I'll have a go tomorrow at
trying to prove some of it on my test system here,

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver
@ 2002-05-15 17:43 Peter T. Breuer
  2002-05-15 19:43 ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-15 17:43 UTC (permalink / raw)
  To: steve, alan, chen_xiangping, kumbera
  Cc: linux kernel, steve, alan, chen_xiangping, kumbera

"A month of sundays ago ptb wrote:"
> There are also several studies being made from collaborators in
> Heidelberg which show qualitative differences between new VM and old VM
> behaviour on the _server_ side.  Basically, put an old VM on the server,
> and push data to it with VM, and you get something like a steady
> 16.5MB/s.  Put a new VM in and you get pulsed behaviour.  Maybe 18.5MB/s
> tops, dropping to nothing, then picking up again, at maybe 7s intervals.

I'll just let you know (in secret) one of the graphs that Arne sent me
today ...

  http://web.kip.uni-heidelberg.de/~wiebalck/results/plots/disk_net_GbE.pdf

I don't know the details of this experiment, but it's a push of data
over the Giga-net through ENBD to a server with a modern VM. I think
the timing on the server graph is slightly shifted .. I believe it's
been shown in other experiments that the server disk i/o blocks first,
and then the network backs up afterwards.



Peter

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver
@ 2002-05-15 16:01 Peter T. Breuer
  0 siblings, 0 replies; 53+ messages in thread
From: Peter T. Breuer @ 2002-05-15 16:01 UTC (permalink / raw)
  To: linux kernel; +Cc: steve, alan, chen_xiangping, kumbera


Steven Whitehouse wrote:
> I don't think the system has "run out" exactly, more just got itself into
> a state where the code path writing out dirty blocks has been blocked
> due to lack of freeable memory at that moment and where the process
> freeing up memory has blocked waiting for the nbd device. It may well
> be that there is freeable memory, just that for whatever reason no
> process is trying to free it.

I'll provide some datapoints vis-a-vis the "Enhanced" NBD (ENBD, see
freshmeat). There have been sporadic reports of lock up with ENBD
too. Lately I have had a good report in which the reporter was
swapping over ENBD (it _is_ possible) and where the kernel locked
almost immediately under kernel 2.4.18, and the thing ran perfectly
under kernel 2.4.9. That coincides with my general feeling that a
lockup of VM against tcp is possible with the current VM.


The precise report is:

(Michael Kumbera)
) I'm trying to get swap working with nbd on a 2.4.18
) kernel (redhat 7.3) and my machine hangs when I start
) to swap.
) 
) This works great if a boot to a 2.4.9 kernel so I 
) assume it's an interaction with the "new" VM system. 
) Does anyone know if I can tune /proc/sys/vm/* to make
) it more stable?
) 
) I have tried to set the bdflush_nfract_sync to 100%
) and the bdflush_nfract to 0% and still hang almost
) instantly.
) 
) Any ideas? Does anyone have nbd swap working on a
) 2.4.10 (or greater) kernel?

and

) > You ARE using "-s", yes?
) 
) Yes, I'm using the "-s" options.  I just tried the -s -a and no change.
) As soon as the machine starts to really swap it hangs.
) The system uses the network heavily for about 3-5 seconds and then
) nothing.

and

) My client machien is a Dual CPU Athlon 1.6GHz and I'm
) running on a 100M connection to a P4 server. 

His test is:

) > The program malloc's out 1G of memory 64M at a time.  After each
) > malloc
) > I memset the allocated block to zero. I picked 1G since that is my
) > physical memory size. (I create a 512M swap partition with enbd)

Now, the primary difference (in this area) between ENBD and NBD is that
ENBD does its networking in userspace, so that it can use other
transport media, and do more sophisticated reconnections.  The client
daemons call mlockall() to lock themselves in memory when the -s flag is
given.  They transfer data to and from the kernel via ioctls on the nbd
device. When -a is given, they "trust the net" and ack the kernel
before getting an ack back from the net, so a kernel request may
be released before the corresponding network packets have been made,
sent out, and recieved back, which avoids one potential deadlock if
tcp can deadlock.

The report says that neither of these strategies (together) is
effective in kernel 2.4.18, and it works fine in kernel 2.4.9.

I have also found that many potential problems in 2.4.18 are solved if
one lowers the VM async buffers trigger point so that buffers start to
go to the net immediately, and if one increases the VM sync buffer point
so that the system never goes synchronous.  I guess that when the kernel
hits the sync point, that all buffer i/o goes synchronous, and we cannot
use the net to send out the requests and get back the ack that would
free the nbd requests that are creating the pressure.  Anecdotal
evidence suggests that if you supply more real ram, this kind of
deadlock ceases to be even a possibililty.

Can one change the VM parameters on a per-device basis? Or at least
monitor them?

There are also several studies being made from collaborators in
Heidelberg which show qualitative differences between new VM and old VM
behaviour on the _server_ side.  Basically, put an old VM on the server,
and push data to it with VM, and you get something like a steady
16.5MB/s.  Put a new VM in and you get pulsed behaviour.  Maybe 18.5MB/s
tops, dropping to nothing, then picking up again, at maybe 7s intervals.

To my control-theoretic trained eyes, that looks like two control
systems locked together, one of them at least doing stop-start control,
with hysteresis.  The old VM, by contrast, looks as though it's using
predictive control (i.e.  the equations contain a differential term,
which would be "acceleration" in this setting) and thus damping the
resonance.  The new VMs look to be resonating when linked by nbd.

I suspect that the new VM has got rid of some slow-throttling mechanisms
that previously avoided the possibility of deadlocking against tcp.
But I'm at a loss to explain why swap is showing up as being
particularly sensitive. Maybe it's simply treated differently by the
new VM.

Anyway, if you have any comments, please let me see them!

Should Andrea or Rik be in on this?





Peter


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-14 17:36 chen, xiangping
@ 2002-05-14 18:02 ` Alan Cox
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Cox @ 2002-05-14 18:02 UTC (permalink / raw)
  To: chen, xiangping; +Cc: 'Alan Cox', jes, Steve, linux-kernel

> But ... It seems that there is no direct way to adjust the tcp max 
> window size in Linux kernel.

setsockopt SO_SNDBUF and SO_RCVBUF - same as all Unix and unixlike boxes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-14 17:42 chen, xiangping
  0 siblings, 0 replies; 53+ messages in thread
From: chen, xiangping @ 2002-05-14 17:42 UTC (permalink / raw)
  To: 'Steve Whitehouse'; +Cc: jes, linux-kernel

Yes. I am testing a single nbd device, thus single socket in this case.
There is no other heavy networking tasks on the testing machine.


-----Original Message-----
From: Steven Whitehouse [mailto:steve@gw.chygwyn.com]
Sent: Tuesday, May 14, 2002 12:32 PM
To: chen, xiangping
Cc: jes@wildopensource.com; linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


Hi,

The TCP stack should auto-tune the amount of memory that it uses, so that
SO_SNDBUF, cat >/proc/sys/net/core/[rw]mem_default etc. is not required. The
important settings for TCP sockets are only /proc/sys/net/ipv4/tcp_[rw]mem
and tcp_mem I think (at least if I've understood the code correctly).

Since I think we are talking about only a single nbd device, there should
only be a single socket thats doing lots of I/O in this case, or is this
machine doing other heavy network tasks ?
> 
> But how to avoid system hangs due to running out of memory?
> Is there a safe guide line? Generally slow is tolerable, but
> crash is not.
> 
I agree. I also think your earlier comments about the buffer flushing are
correct as being the most likely cause.

I don't think the system has "run out" exactly, more just got itself into
a state where the code path writing out dirty blocks has been blocked
due to lack of freeable memory at that moment and where the process
freeing up memory has blocked waiting for the nbd device. It may well
be that there is freeable memory, just that for whatever reason no
process is trying to free it.

The LVM team has had a similar problem in dealing with I/O which needs
extra memory in order to complete, so I'll ask them for some ideas. Also
I'm going to try and come up with some patches to eliminate some of the
possible theories so that we can narrow down the options,

Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-14 17:36 chen, xiangping
  2002-05-14 18:02 ` Alan Cox
  0 siblings, 1 reply; 53+ messages in thread
From: chen, xiangping @ 2002-05-14 17:36 UTC (permalink / raw)
  To: 'Alan Cox'; +Cc: jes, Steve, linux-kernel

But ... It seems that there is no direct way to adjust the tcp max 
window size in Linux kernel.

-----Original Message-----
From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk]
Sent: Tuesday, May 14, 2002 12:48 PM
To: chen, xiangping
Cc: jes@wildopensource.com; Steve@ChyGwyn.com;
linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


> Xiangping> So for gigabit Ethernet driver, what is the optimal mem
> Xiangping> configuration for performance and reliability?
> 
> It depends on your application, number of streams, general usage of
> the connection etc. There's no perfect-for-all magic number.

The primary constraints are

	TCP max window size
	TCP congestion window size (cwnd)
	Latency

Most of the good discussion on this matter can be found in the ietf
archives from the window scaling options work, and in part in the RFC's
that led to

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-14 15:05 chen, xiangping
@ 2002-05-14 15:11 ` Jes Sorensen
  0 siblings, 0 replies; 53+ messages in thread
From: Jes Sorensen @ 2002-05-14 15:11 UTC (permalink / raw)
  To: chen, xiangping; +Cc: 'Steve Whitehouse', linux-kernel

>>>>> "Xiangping" == chen, xiangping <chen_xiangping@emc.com> writes:

Xiangping> But the acenic driver author suggested that sndbuf should
Xiangping> be at least 262144, and the sndbuf can not exceed
Xiangping> r/wmem_default. Is that correct?

Ehm, the acenic author is me ;-)

The default value is what all sockets are assigned on open, you can
adjust this using SO_SNDBUF and SO_RCVBUF, however the values you set
cannot exceed the [rw]mem_max values. Basically if you set the default
to 4MB, your telnet sockets will have a 4MB default limit as well
which may not be what you want (not saying it will use 4MB).

Thus, set the _max values and use SO_SNDBUF and SO_RCVBUF to set the
per process values. But leave the _default values to their original
setting.

Xiangping> So for gigabit Ethernet driver, what is the optimal mem
Xiangping> configuration for performance and reliability?

It depends on your application, number of streams, general usage of
the connection etc. There's no perfect-for-all magic number.

Jes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-14 15:05 chen, xiangping
  2002-05-14 15:11 ` Jes Sorensen
  0 siblings, 1 reply; 53+ messages in thread
From: chen, xiangping @ 2002-05-14 15:05 UTC (permalink / raw)
  To: 'Jes Sorensen'; +Cc: 'Steve Whitehouse', linux-kernel

But the acenic driver author suggested that sndbuf should be at least
262144, and the sndbuf can not exceed r/wmem_default. Is that correct?

So for gigabit Ethernet driver, what is the optimal mem configuration
for performance and reliability?

Thanks,

Xiangping

-----Original Message-----
From: Jes Sorensen [mailto:jes@wildopensource.com]
Sent: Tuesday, May 14, 2002 10:59 AM
To: chen, xiangping
Cc: 'Steve Whitehouse'; linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


>>>>> "xiangping" == chen, xiangping <chen_xiangping@emc.com> writes:

xiangping> Hi, When the system stucks, I could not get any response
xiangping> from the console or terminal.  But basically the only
xiangping> network connections on both machine are the nbd connection
xiangping> and a couple of telnet sessions. That is what shows on
xiangping> "netstat -t".

xiangping> /proc/sys/net/ipv4/tcp_[rw]mem are "4096 262144 4096000",
xiangping> /proc/sys/net/core/*mem_default are 4096000,
xiangping> /proc/sys/net/core/*mem_max are 8192000, I did not change
xiangping> /proc/sys/net/ipv4/tcp_mem.

Don't do this, setting the [rw]mem_default values to that is just
insane. Do it in the applications that needs it and nowhere else.

xiangping> The system was low in memory, I started up 20 to 40 thread
xiangping> to do block write simultaneously.

If you have a lot of outstanding connections and active threads, it's
not unlikely you run out of memory if each socket eats 4MB.

Jes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-10 15:02 chen, xiangping
  2002-05-10 15:11 ` Steven Whitehouse
@ 2002-05-14 14:58 ` Jes Sorensen
  1 sibling, 0 replies; 53+ messages in thread
From: Jes Sorensen @ 2002-05-14 14:58 UTC (permalink / raw)
  To: chen, xiangping; +Cc: 'Steve Whitehouse', linux-kernel

>>>>> "xiangping" == chen, xiangping <chen_xiangping@emc.com> writes:

xiangping> Hi, When the system stucks, I could not get any response
xiangping> from the console or terminal.  But basically the only
xiangping> network connections on both machine are the nbd connection
xiangping> and a couple of telnet sessions. That is what shows on
xiangping> "netstat -t".

xiangping> /proc/sys/net/ipv4/tcp_[rw]mem are "4096 262144 4096000",
xiangping> /proc/sys/net/core/*mem_default are 4096000,
xiangping> /proc/sys/net/core/*mem_max are 8192000, I did not change
xiangping> /proc/sys/net/ipv4/tcp_mem.

Don't do this, setting the [rw]mem_default values to that is just
insane. Do it in the applications that needs it and nowhere else.

xiangping> The system was low in memory, I started up 20 to 40 thread
xiangping> to do block write simultaneously.

If you have a lot of outstanding connections and active threads, it's
not unlikely you run out of memory if each socket eats 4MB.

Jes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-10 15:39 chen, xiangping
  0 siblings, 0 replies; 53+ messages in thread
From: chen, xiangping @ 2002-05-10 15:39 UTC (permalink / raw)
  To: 'Steve Whitehouse'; +Cc: linux-kernel

The hang happens quickly after I start the test if using EXT3 or XFS,
it rarely happens when I use EXT2 filesystem. So I guess the behavior
is related to the file system buffer flush pattern.


xiangping

-----Original Message-----
From: Steven Whitehouse [mailto:steve@gw.chygwyn.com]
Sent: Friday, May 10, 2002 11:11 AM
To: chen, xiangping
Cc: linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


Hi,

> 
> Hi,
> 
[deadlock conditions snipped]
> 
> The nbd_client get stuck in sock_recvmsg, and one other process stucks
> in do_nbd_request (sock_sendmsg). I will try to use kdb to give you
> more foot print.
> 
Anything extra you can send me like that will be very helpful.

> The system was low in memory, I started up 20 to 40 thread to do block
> write simultaneously.
> 
Ok. I'll have to try and set something similar up because I've not seen
any hangs with the latest nbd in 2.4 at all. Do you find that the hangs
happen relatively quickly after you start the I/O or is it something
which takes some time ?

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-10 15:02 chen, xiangping
@ 2002-05-10 15:11 ` Steven Whitehouse
  2002-05-14 14:58 ` Jes Sorensen
  1 sibling, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-10 15:11 UTC (permalink / raw)
  To: chen, xiangping; +Cc: linux-kernel

Hi,

> 
> Hi,
> 
[deadlock conditions snipped]
> 
> The nbd_client get stuck in sock_recvmsg, and one other process stucks
> in do_nbd_request (sock_sendmsg). I will try to use kdb to give you
> more foot print.
> 
Anything extra you can send me like that will be very helpful.

> The system was low in memory, I started up 20 to 40 thread to do block
> write simultaneously.
> 
Ok. I'll have to try and set something similar up because I've not seen
any hangs with the latest nbd in 2.4 at all. Do you find that the hangs
happen relatively quickly after you start the I/O or is it something
which takes some time ?

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-10 15:02 chen, xiangping
  2002-05-10 15:11 ` Steven Whitehouse
  2002-05-14 14:58 ` Jes Sorensen
  0 siblings, 2 replies; 53+ messages in thread
From: chen, xiangping @ 2002-05-10 15:02 UTC (permalink / raw)
  To: 'Steve Whitehouse'; +Cc: linux-kernel

Hi,

When the system stucks, I could not get any response from the console or
terminal.
But basically the only network connections on both machine are the nbd
connection
and a couple of telnet sessions. That is what shows on "netstat -t".

/proc/sys/net/ipv4/tcp_[rw]mem are "4096  262144 4096000",
/proc/sys/net/core/*mem_default are 4096000, 
/proc/sys/net/core/*mem_max   are 8192000,
I did not change /proc/sys/net/ipv4/tcp_mem.

The nbd_client get stuck in sock_recvmsg, and one other process stucks
in do_nbd_request (sock_sendmsg). I will try to use kdb to give you
more foot print.

The system was low in memory, I started up 20 to 40 thread to do block
write simultaneously.

The nbd device was not used as swap device.

Thanks,

Xiangping

-----Original Message-----
From: Steven Whitehouse [mailto:steve@gw.chygwyn.com]
Sent: Tuesday, May 07, 2002 4:16 AM
To: chen, xiangping
Cc: linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


Hi,

I suggest trying 2.4.19-pre8 first. This has the fix for the deadlock that
I'm
aware of in it. If that still doesn't work, then try and send me as much
information as the system will let you extract. What I'm most interested
in is:

 o State of the sockets (netstat -t on both client and server)
 o Values of /proc/sys/net/ipv4/tcp_[rw]mem and tcp_mem
 o Does the nbd client get stuck in the D state before or after any other
   processes doing I/O through nbd? This is useful as it tells me whether
   the problem is on a transmit or receive.
 o Was your system low on memory at the time ?
 o Were you trying to use nbd as a swap device ?

Steve.
 
> 
> Hi,
> 
> I am using 2.4.16 with xfs patch from SGI. It may not be the acenic
> driver problem, I can reproduce the deadlock in a 100 base-T network
> using eepro100 driver. Closing the server did not release the deadlock.
> What else can I try?
> 
> 
[original messages cut here]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-06 15:05 chen, xiangping
@ 2002-05-07  8:15 ` Steven Whitehouse
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-07  8:15 UTC (permalink / raw)
  To: chen, xiangping; +Cc: linux-kernel

Hi,

I suggest trying 2.4.19-pre8 first. This has the fix for the deadlock that I'm
aware of in it. If that still doesn't work, then try and send me as much
information as the system will let you extract. What I'm most interested
in is:

 o State of the sockets (netstat -t on both client and server)
 o Values of /proc/sys/net/ipv4/tcp_[rw]mem and tcp_mem
 o Does the nbd client get stuck in the D state before or after any other
   processes doing I/O through nbd? This is useful as it tells me whether
   the problem is on a transmit or receive.
 o Was your system low on memory at the time ?
 o Were you trying to use nbd as a swap device ?

Steve.
 
> 
> Hi,
> 
> I am using 2.4.16 with xfs patch from SGI. It may not be the acenic
> driver problem, I can reproduce the deadlock in a 100 base-T network
> using eepro100 driver. Closing the server did not release the deadlock.
> What else can I try?
> 
> 
[original messages cut here]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Kernel deadlock using nbd over acenic driver.
@ 2002-05-06 15:05 chen, xiangping
  2002-05-07  8:15 ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: chen, xiangping @ 2002-05-06 15:05 UTC (permalink / raw)
  To: 'Steve Whitehouse'; +Cc: linux-kernel

Hi,

I am using 2.4.16 with xfs patch from SGI. It may not be the acenic
driver problem, I can reproduce the deadlock in a 100 base-T network
using eepro100 driver. Closing the server did not release the deadlock.
What else can I try?


-----Original Message-----
From: Steven Whitehouse [mailto:steve@gw.chygwyn.com]
Sent: Monday, May 06, 2002 4:46 AM
To: chen, xiangping
Cc: linux-kernel@vger.kernel.org
Subject: Re: Kernel deadlock using nbd over acenic driver.


Hi,

What kernel version are you using ? I suspect that its not the ethernet
driver causing this deadlock. Am I right in thinking that if you kill the
nbd server process that the hanging process is released ?

Steve.

> 
> Hi,
> 
> I encounter a deadlock situation when using nbd device over gigabit
> ethernet. The network card is 3c 985 giga card using acenic driver. When
the
> network has some significant back ground traffic, even making a ext2 file
> system can not succeed. When the deadlock happens, the nbd client daemon
> just stuck in tcp_recvmsg() without receiving any data, and the sender
> threads continue to send out requests until the whole system hangs. Even I
> set the nbd client daemon SNDTIMEO, the nbd client daemon could not exit
> from tcp_recvmsg(). 
> 
> Is there any known problem with the acenic driver? How can I identify it
is
> a problem of the NIC driver, or somewhere else?
> 
> Thanks for help!
> 
> 
> Xiangping Chen 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Kernel deadlock using nbd over acenic driver.
  2002-05-06  2:26 chen, xiangping
@ 2002-05-06  8:45 ` Steven Whitehouse
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Whitehouse @ 2002-05-06  8:45 UTC (permalink / raw)
  To: chen, xiangping; +Cc: 'linux-kernel@vger.kernel.org'

Hi,

What kernel version are you using ? I suspect that its not the ethernet
driver causing this deadlock. Am I right in thinking that if you kill the
nbd server process that the hanging process is released ?

Steve.

> 
> Hi,
> 
> I encounter a deadlock situation when using nbd device over gigabit
> ethernet. The network card is 3c 985 giga card using acenic driver. When the
> network has some significant back ground traffic, even making a ext2 file
> system can not succeed. When the deadlock happens, the nbd client daemon
> just stuck in tcp_recvmsg() without receiving any data, and the sender
> threads continue to send out requests until the whole system hangs. Even I
> set the nbd client daemon SNDTIMEO, the nbd client daemon could not exit
> from tcp_recvmsg(). 
> 
> Is there any known problem with the acenic driver? How can I identify it is
> a problem of the NIC driver, or somewhere else?
> 
> Thanks for help!
> 
> 
> Xiangping Chen 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Kernel deadlock using nbd over acenic driver.
@ 2002-05-06  2:26 chen, xiangping
  2002-05-06  8:45 ` Steven Whitehouse
  0 siblings, 1 reply; 53+ messages in thread
From: chen, xiangping @ 2002-05-06  2:26 UTC (permalink / raw)
  To: 'linux-kernel@vger.kernel.org'

Hi,

I encounter a deadlock situation when using nbd device over gigabit
ethernet. The network card is 3c 985 giga card using acenic driver. When the
network has some significant back ground traffic, even making a ext2 file
system can not succeed. When the deadlock happens, the nbd client daemon
just stuck in tcp_recvmsg() without receiving any data, and the sender
threads continue to send out requests until the whole system hangs. Even I
set the nbd client daemon SNDTIMEO, the nbd client daemon could not exit
from tcp_recvmsg(). 

Is there any known problem with the acenic driver? How can I identify it is
a problem of the NIC driver, or somewhere else?

Thanks for help!


Xiangping Chen 

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2002-06-06 18:28 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-14 16:07 Kernel deadlock using nbd over acenic driver chen, xiangping
2002-05-14 16:32 ` Steven Whitehouse
2002-05-14 16:48 ` Alan Cox
2002-05-15 22:31 ` Oliver Xymoron
2002-05-16  5:10   ` Peter T. Breuer
2002-05-16  5:19     ` Peter T. Breuer
2002-05-16 14:29       ` Oliver Xymoron
2002-05-16 15:35         ` Peter T. Breuer
2002-05-16 16:22           ` Oliver Xymoron
2002-05-16 16:45             ` Peter T. Breuer
2002-05-16 16:35               ` Steven Whitehouse
2002-05-17  7:01                 ` Peter T. Breuer
2002-05-17  9:26                   ` Steven Whitehouse
  -- strict thread matches above, loose matches on Subject: below --
2002-05-16 22:54 Peter T. Breuer
2002-05-17  8:44 ` Steven Whitehouse
2002-05-23 13:21   ` Peter T. Breuer
2002-05-24 10:11     ` Steven Whitehouse
2002-05-24 11:43       ` Peter T. Breuer
2002-05-24 13:28         ` Steven Whitehouse
2002-05-24 15:54           ` Peter T. Breuer
2002-05-27 13:04             ` Steven Whitehouse
2002-05-27 19:51               ` Peter T. Breuer
2002-05-27 13:44         ` Pavel Machek
2002-05-29 10:51           ` Peter T. Breuer
2002-05-29 11:21             ` Pavel Machek
2002-05-29 12:10               ` Peter T. Breuer
2002-05-29 13:24                 ` Jens Axboe
2002-06-01 21:13       ` Peter T. Breuer
2002-06-05  8:48         ` Steven Whitehouse
2002-06-02  6:39           ` Pavel Machek
     [not found] <3CE40A77.22C74DC1@zip.com.au>
2002-05-16 20:28 ` Peter T. Breuer
2002-05-16 13:18 chen, xiangping
2002-05-15 21:43 Peter T. Breuer
2002-05-16  8:33 ` Steven Whitehouse
2002-05-15 17:43 Peter T. Breuer
2002-05-15 19:43 ` Steven Whitehouse
2002-05-16  5:15   ` Peter T. Breuer
2002-05-16  8:04     ` Steven Whitehouse
2002-05-16  8:49       ` Peter T. Breuer
2002-05-15 16:01 Peter T. Breuer
2002-05-14 17:42 chen, xiangping
2002-05-14 17:36 chen, xiangping
2002-05-14 18:02 ` Alan Cox
2002-05-14 15:05 chen, xiangping
2002-05-14 15:11 ` Jes Sorensen
2002-05-10 15:39 chen, xiangping
2002-05-10 15:02 chen, xiangping
2002-05-10 15:11 ` Steven Whitehouse
2002-05-14 14:58 ` Jes Sorensen
2002-05-06 15:05 chen, xiangping
2002-05-07  8:15 ` Steven Whitehouse
2002-05-06  2:26 chen, xiangping
2002-05-06  8:45 ` Steven Whitehouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).