linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Linux networking and disk IO issues
       [not found] <3B1BB85B.360CE0F6@northforknet.com.suse.lists.linux.kernel>
@ 2001-06-13 10:36 ` Andi Kleen
  0 siblings, 0 replies; 3+ messages in thread
From: Andi Kleen @ 2001-06-13 10:36 UTC (permalink / raw)
  To: Mark Hayden; +Cc: linux-kernel


[this time with l-k cc]

Mark Hayden <mark@northforknet.com> writes:

> * The Linux networking stack requires all skbuff buffers to be
>   contiguous.  As far as I can tell, this makes it impossible to
>   write high-bandwidth UDP applications on Linux.  For instance, the
>   kernel will drop a fragmented 8KB message if it cannot allocate 8KB
>   of contiguous memory to reassemble it into.  I have found that it
>   is relatively easy to enter regimes where this can cause massive
>   packet loss.

2.4.4+ supports fragmented packets and packet lists.

You're probably seeing the 8K allocation problem for incoming packets which need to be
allocated by the driver on interrupt time with GFP_ATOMIC. GFP_ATOMIC memory is limited.
The 2.4 VM unfortunately has no way to keep more GFP_ATOMIC free ATM and tune for heavy
interrupt load (2.2 allowed this by increasing the freepages sysctl). Hopefully this VM bug 
will be fixed in the not too far future.

A workaround in the driver would be to use the 2.4.4 fragmented buffers 
(of course you'll still run into GFP_ATOMIC limits without manual tuning)
or allocate RX memory from a thread with GFP_KERNEL.



-Andi

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Linux networking and disk IO issues
  2001-06-04 16:33 Mark Hayden
@ 2001-06-04 20:02 ` Alan Cox
  0 siblings, 0 replies; 3+ messages in thread
From: Alan Cox @ 2001-06-04 20:02 UTC (permalink / raw)
  To: Mark Hayden; +Cc: linux-kernel

> * The Linux networking stack requires all skbuff buffers to be
>   contiguous.  As far as I can tell, this makes it impossible to
>   write high-bandwidth UDP applications on Linux.  For instance, the
>   kernel will drop a fragmented 8KB message if it cannot allocate 8KB
>   of contiguous memory to reassemble it into.  I have found that it
>   is relatively easy to enter regimes where this can cause massive
>   packet loss.

If you are fragmenting messages then you want to optimise the protocol a bit
more. IP fragmentation increases processing overheads and reduces performance
badly in the presence of link congestion and error.

Most modern file sharing protocols are TCP based for good reason

> * readv()/writev().  Linux serializes scatter/gather IO operations
>   into an operation for each iovec entry.  This is the relevent code
>   from a 2.4-series kernel:

Not on a socket. On a file it makes very little difference. Socket readv/writev
behaviour varies by protocol family.

>   * For writes, it forces read-modify-write when the individual
>     iovecs are not block-aligned.

>From cache, of data live in the L1 cache of the CPU

> * There is no preadv(), pwritev().  (The pread/pwrite() system calls
>   combine a llseek with a read/write system call.)  This means that

True. The single unix specification does not include preadv(). Really you want
to take it up with the Opengroup. That said Linux does add syscalls that are
not in SuS sometimes.

> * The requirement that everything about operations to raw character
>   device files (length, offset in file, *and* address in memory) has
>   to be 512-byte aligned is a real hassle.

Welcome to PC hardware. Large amounts of PC hardware genuinely has limitations
of this nature. Most disk controllers can only write whole sectors on a sector
alignment. Many network controllers can only handle burst or 32bit alignment
policies


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Linux networking and disk IO issues
@ 2001-06-04 16:33 Mark Hayden
  2001-06-04 20:02 ` Alan Cox
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Hayden @ 2001-06-04 16:33 UTC (permalink / raw)
  To: linux-kernel

I recently released a clusted storage system for Linux (the software
in binary form and manual can be downloaded from
www.northforknet.com).  This software, you can create a highly
available storage cluster out of standard PC hardware.

During this work, we encountered a number of problems with the Linux
kernel.  I believe these all apply to the current kernels (though I'm
working with the 2.4.2 kernel).  If you respond, please CC me
directly, since I follow Linux kernel development through weekly
summaries in Linux Weekly News.

regards, Mark Hayden
mark@northforknet.com

* The Linux networking stack requires all skbuff buffers to be
  contiguous.  As far as I can tell, this makes it impossible to
  write high-bandwidth UDP applications on Linux.  For instance, the
  kernel will drop a fragmented 8KB message if it cannot allocate 8KB
  of contiguous memory to reassemble it into.  I have found that it
  is relatively easy to enter regimes where this can cause massive
  packet loss.

* readv()/writev().  Linux serializes scatter/gather IO operations
  into an operation for each iovec entry.  This is the relevent code
  from a 2.4-series kernel:

	/* VERIFY_WRITE actually means a read, as we write to user space */
	fn = (type == VERIFY_WRITE ? file->f_op->read :
	      (io_fn_t) file->f_op->write);

	ret = 0;
	vector = iov;
	while (count > 0) {
		void * base;
		size_t len;
		ssize_t nr;

		base = vector->iov_base;
		len = vector->iov_len;
		vector++;
		count--;

		nr = fn(file, base, len, &file->f_pos);

		if (nr < 0) {
			if (!ret) ret = nr;
			break;
		}
		ret += nr;
		if (nr != len)
			break;
	}

  This causes several problems:

  * For writes, it forces read-modify-write when the individual
    iovecs are not block-aligned.

  * For reads, it prevents all the read requests from being presented
    at the same time to the IO system.  This is a problem for raw IO
    without read-ahead.

* There is no preadv(), pwritev().  (The pread/pwrite() system calls
  combine a llseek with a read/write system call.)  This means that
  if you want to have multiple threads in a process write random
  blocks using scatter-gather, you need to open() a device file
  multiple times and make the extra llseek() calls.

* The requirement that everything about operations to raw character
  device files (length, offset in file, *and* address in memory) has
  to be 512-byte aligned is a real hassle.

* There are several assumptions in the kernel that make it very
  difficult to write virtual block devices that convert IO operations
  into networked RPC requests.  For instance, if you run the normal
  NBD device where the server is on the same machine in the client,
  you will likely deadlock your system.  Our software distribution
  includes a patch to the 2.4.2 kernel that prevents these deadlock
  scenarios with NBD, but it is something of a hack (I want to thank
  Stephen Tweedie for his help in developing this work-around, though
  of course the hack is my responsibility.)  I don't know what could
  be done to fix these problems correctly, without a major changes to
  block IO in the kernel.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2001-06-13 10:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3B1BB85B.360CE0F6@northforknet.com.suse.lists.linux.kernel>
2001-06-13 10:36 ` Linux networking and disk IO issues Andi Kleen
2001-06-04 16:33 Mark Hayden
2001-06-04 20:02 ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).