linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sendfile() with 100 simultaneous 100MB files
@ 2006-01-20 21:53 Jon Smirl
  2006-01-21  2:22 ` Matti Aarnio
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jon Smirl @ 2006-01-20 21:53 UTC (permalink / raw)
  To: lkml

I was reading this blog post about the lighttpd web server.
http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
It describes problems they are having downloading 100 simultaneous 100MB files.

In this post they complain about sendfile() getting into seek storms and
ending up in 72% IO wait. As a result they built a user space
mechanism to work around the problems.

I tried looking at how the kernel implements sendfile(), I have
minimal understanding of how the fs code works but it looks to me like
sendfile() is working a page at a time. I was looking for code that
does something like this...

1) Compute an adaptive window size and read ahead the appropriate
number of pages.  A larger window would minimize disk seeks.

2) Something along the lines of as soon as a page is sent age the page
down in to the middle of page ages. That would allow for files that
are repeatedly sent, but also reduce thrashing from files that are not
sent frequently and shouldn't stay in the page cache.

Any other ideas why sendfile() would get into a seek storm?

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-20 21:53 sendfile() with 100 simultaneous 100MB files Jon Smirl
@ 2006-01-21  2:22 ` Matti Aarnio
  2006-01-21  3:43   ` Jon Smirl
  2006-01-21  3:52 ` Phillip Susi
  2006-01-22 14:24 ` Jim Nance
  2 siblings, 1 reply; 10+ messages in thread
From: Matti Aarnio @ 2006-01-21  2:22 UTC (permalink / raw)
  To: Jon Smirl; +Cc: lkml

On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
> I was reading this blog post about the lighttpd web server.
> http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> It describes problems they are having downloading 100 simultaneous 100MB files.

    "more than 100 files of each more than 100 MB"

> In this post they complain about sendfile() getting into seek storms and
> ending up in 72% IO wait. As a result they built a user space
> mechanism to work around the problems.
> 
> I tried looking at how the kernel implements sendfile(), I have
> minimal understanding of how the fs code works but it looks to me like
> sendfile() is working a page at a time. I was looking for code that
> does something like this...
> 
> 1) Compute an adaptive window size and read ahead the appropriate
> number of pages.  A larger window would minimize disk seeks.

Or maybe not..   larger main memory would help more.  But there is
another issue...

> 2) Something along the lines of as soon as a page is sent age the page
> down in to the middle of page ages. That would allow for files that
> are repeatedly sent, but also reduce thrashing from files that are not
> sent frequently and shouldn't stay in the page cache.
> 
> Any other ideas why sendfile() would get into a seek storm?


Deep inside the  do_generic_mapping_read() there is a loop that
reads the source file with read-ahead processing, processes it
one page at the time, calls actor (which sends the file) and
releases the page cache of that page.  -- with convoluted things
done when page isn't in page cache, etc..


                /*
                 * Ok, we have the page, and it's up-to-date, so
                 * now we can copy it to user space...
                 *
                 * The actor routine returns how many bytes were actually used..
                 * NOTE! This may not be the same as how much of a user buffer
                 * we filled up (we may be padding etc), so we can only update
                 * "pos" here (the actor routine has to update the user buffer
                 * pointers and the remaining count).
                 */
                ret = actor(desc, page, offset, nr);
                offset += ret;
                index += offset >> PAGE_CACHE_SHIFT;
                offset &= ~PAGE_CACHE_MASK;

                page_cache_release(page);
                if (ret == nr && desc->count)
                        continue;


That is, if machine memory is so limited (file pages + network
tcp buffers!) that source file pages gets constantly purged out, 
there is not much that one can do.

That described workaround is essentially to read the file to server
process memory with half an MB sliding window, and then  writev()
from there to socket.  Most importantly it does the reading in _large_
chunks.

The read-ahead in sendfile is done by  page_cache_readahead(), and
via fairly complicated circumstances it ends up using 

        bdi = mapping->backing_dev_info;

        switch (advice) {
        case POSIX_FADV_NORMAL:
                file->f_ra.ra_pages = bdi->ra_pages;
                break;
        case POSIX_FADV_RANDOM:
                file->f_ra.ra_pages = 0;
                break;
        case POSIX_FADV_SEQUENTIAL:
                file->f_ra.ra_pages = bdi->ra_pages * 2;
                break;
	....


Default value for ra_pages is equivalent of  128 kB, which
should be enough...  

Why it goes to seek trashing ?   Because read-ahead buffer memory
space is being processed in very small fragments, and the sendpage
to socket writing logic pauses frequently, during which read-ahead
buffers become recycled...

In  writev()  solution the pausing in socket sending side does
not appear so heavily in source file reading side, as things
get buffered in non-discardable memory space of userspace process.

> --
> Jon Smirl
> jonsmirl@gmail.com

/Matti Aarnio

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-21  2:22 ` Matti Aarnio
@ 2006-01-21  3:43   ` Jon Smirl
  2006-01-22  3:46     ` Benjamin LaHaise
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Smirl @ 2006-01-21  3:43 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: lkml

On 1/20/06, Matti Aarnio <matti.aarnio@zmailer.org> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
> > I was reading this blog post about the lighttpd web server.
> > http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> > It describes problems they are having downloading 100 simultaneous 100MB files.
>
>     "more than 100 files of each more than 100 MB"
>
> > In this post they complain about sendfile() getting into seek storms and
> > ending up in 72% IO wait. As a result they built a user space
> > mechanism to work around the problems.
> >
> > I tried looking at how the kernel implements sendfile(), I have
> > minimal understanding of how the fs code works but it looks to me like
> > sendfile() is working a page at a time. I was looking for code that
> > does something like this...
> >
> > 1) Compute an adaptive window size and read ahead the appropriate
> > number of pages.  A larger window would minimize disk seeks.
>
> Or maybe not..   larger main memory would help more.  But there is
> another issue...
>
> > 2) Something along the lines of as soon as a page is sent age the page
> > down in to the middle of page ages. That would allow for files that
> > are repeatedly sent, but also reduce thrashing from files that are not
> > sent frequently and shouldn't stay in the page cache.
> >
> > Any other ideas why sendfile() would get into a seek storm?
>
>

Thanks for pointing me in the right direction in the source.
Is there a write up anywhere on how sendfile() works?


> Deep inside the  do_generic_mapping_read() there is a loop that
> reads the source file with read-ahead processing, processes it
> one page at the time, calls actor (which sends the file) and
> releases the page cache of that page.  -- with convoluted things
> done when page isn't in page cache, etc..
>
>
>                 /*
>                  * Ok, we have the page, and it's up-to-date, so
>                  * now we can copy it to user space...
>                  *
>                  * The actor routine returns how many bytes were actually used..
>                  * NOTE! This may not be the same as how much of a user buffer
>                  * we filled up (we may be padding etc), so we can only update
>                  * "pos" here (the actor routine has to update the user buffer
>                  * pointers and the remaining count).
>                  */
>                 ret = actor(desc, page, offset, nr);
>                 offset += ret;
>                 index += offset >> PAGE_CACHE_SHIFT;
>                 offset &= ~PAGE_CACHE_MASK;
>
>                 page_cache_release(page);
>                 if (ret == nr && desc->count)
>                         continue;
>
>
> That is, if machine memory is so limited (file pages + network
> tcp buffers!) that source file pages gets constantly purged out,
> there is not much that one can do.
>
> That described workaround is essentially to read the file to server
> process memory with half an MB sliding window, and then  writev()
> from there to socket.  Most importantly it does the reading in _large_
> chunks.

100 users at 500K each is 50MB of read ahead, that's not a huge amount of memory

>
> The read-ahead in sendfile is done by  page_cache_readahead(), and
> via fairly complicated circumstances it ends up using
>
>         bdi = mapping->backing_dev_info;
>
>         switch (advice) {
>         case POSIX_FADV_NORMAL:
>                 file->f_ra.ra_pages = bdi->ra_pages;
>                 break;
>         case POSIX_FADV_RANDOM:
>                 file->f_ra.ra_pages = 0;
>                 break;
>         case POSIX_FADV_SEQUENTIAL:
>                 file->f_ra.ra_pages = bdi->ra_pages * 2;
>                 break;
>         ....
>
>
> Default value for ra_pages is equivalent of  128 kB, which
> should be enough...

Does using sendfile() set MADV_SEQUENTIAL and MADV_DONTNEED implicitly?
If not would setting these help?

> Why it goes to seek trashing ?   Because read-ahead buffer memory
> space is being processed in very small fragments, and the sendpage
> to socket writing logic pauses frequently, during which read-ahead
> buffers become recycled...

I was following with you until this part. I thought sendfile() worked
using mmap'd files and that readahead was done into the global page
cache.

But this makes me think that read ahead is instead going into another
pool. How large is this pool? The user space scheme is using 50MB of
readahead cache, will the kernel do that much readahead if needed?

> In  writev()  solution the pausing in socket sending side does
> not appear so heavily in source file reading side, as things
> get buffered in non-discardable memory space of userspace process.

Does this scenario illustrate a problem with the current sendfile()
implementation? I thought the goal of sendfile() was to always be the
best way to send complete files. This is a case where user space is
clearly beating sendfile().

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-20 21:53 sendfile() with 100 simultaneous 100MB files Jon Smirl
  2006-01-21  2:22 ` Matti Aarnio
@ 2006-01-21  3:52 ` Phillip Susi
  2006-01-22 14:24 ` Jim Nance
  2 siblings, 0 replies; 10+ messages in thread
From: Phillip Susi @ 2006-01-21  3:52 UTC (permalink / raw)
  To: Jon Smirl; +Cc: lkml

I took a look at that article, and well, it looks a bit off to me.  I 
looked at the code it refered to and it mmap's the file and optionally 
copies from the map to a private buffer before writing to the socket.

The double buffering that is enabled by LOCAL_BUFFERING is a complete 
and total waste of both cpu and ram.  There is no reason to allocate 
more ram and waste more cpu cycles to make a second copy of the data 
before passing it to the network layer.  The mmap and madvice though, is 
a good idea, and I imagine it is causing the kernel to perform large 
block readahead.

If you really want to be able to simultainiously push hundreds of 
streams efficiently though, you want to use zero copy aio, which can 
have tremendous benefits in throughput and cpu usage.  Unfortunately, I 
believe the current kernel does not support O_DIRECT on sockets.

I last looked at the kernel implementation of sendfile about 6 years 
ago, but I remember it not looking very good.  I believe it WAS only 
transfering a single page at a time, and it was still making a copy from 
fs cache to socket buffers, so it wasn't really doing zero copy IO ( 
though it was one less copy than doing a read and write ).

About that time I was writing an ftp server on the NT kernel and 
discovered zero copy async IO.  I ended up using a small thread pool and 
an IO completion port to service the async IO requests.  The files were 
mmaped in 64 KB chunks, three at a time, and queued asynchronously to 
the socket which was set to use no kernel buffering.  This allowed a 
PII-233 machine to push 11,820 KB/s ( that's real KB, not salesman's ) 
over a single session on a 100Base-T network, and saturate dual network 
interfaces with multiple connections, all using less than 1% of the cpu, 
because the NICs were able to directly perform scatter/gather DMA on the 
filesystem cache pages.

I'm hopefull that the Linux kernel will be able to do this soon as well, 
when the network stack supports O_DIRECT on sockets.

Jon Smirl wrote:
> I was reading this blog post about the lighttpd web server.
> http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> It describes problems they are having downloading 100 simultaneous 100MB files.
> 
> In this post they complain about sendfile() getting into seek storms and
> ending up in 72% IO wait. As a result they built a user space
> mechanism to work around the problems.
> 
> I tried looking at how the kernel implements sendfile(), I have
> minimal understanding of how the fs code works but it looks to me like
> sendfile() is working a page at a time. I was looking for code that
> does something like this...
> 
> 1) Compute an adaptive window size and read ahead the appropriate
> number of pages.  A larger window would minimize disk seeks.
> 
> 2) Something along the lines of as soon as a page is sent age the page
> down in to the middle of page ages. That would allow for files that
> are repeatedly sent, but also reduce thrashing from files that are not
> sent frequently and shouldn't stay in the page cache.
> 
> Any other ideas why sendfile() would get into a seek storm?
> 
> --
> Jon Smirl
> jonsmirl@gmail.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-21  3:43   ` Jon Smirl
@ 2006-01-22  3:46     ` Benjamin LaHaise
  0 siblings, 0 replies; 10+ messages in thread
From: Benjamin LaHaise @ 2006-01-22  3:46 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Matti Aarnio, lkml

On Fri, Jan 20, 2006 at 10:43:44PM -0500, Jon Smirl wrote:
> 100 users at 500K each is 50MB of read ahead, that's not a huge amount of 
> memory

The system might be overrunning the number of requests the disk elevator 
has, which would result in the sort of disk seek storm you're seeing.  
Also, what filesystem is being used?  XFS would likely do substantially 
better than ext3 because of its use of extents vs indirect blocks.

> Does using sendfile() set MADV_SEQUENTIAL and MADV_DONTNEED implicitly?
> If not would setting these help?

No.  Readahead should be doing the right thing.  Rik van Riel did some 
work on drop behind for exactly this sort of case.

> I was following with you until this part. I thought sendfile() worked
> using mmap'd files and that readahead was done into the global page
> cache.

sendfile() uses the page cache directly, so it's like an mmap(), but it 
does not carry the overhead associated with tlb manipulation.

> But this makes me think that read ahead is instead going into another
> pool. How large is this pool? The user space scheme is using 50MB of
> readahead cache, will the kernel do that much readahead if needed?

The kernel performs readahead using the system memory pool, which means 
the VM gets involved and performs page reclaim to free up previously 
cached pages.

> Does this scenario illustrate a problem with the current sendfile()
> implementation? I thought the goal of sendfile() was to always be the
> best way to send complete files. This is a case where user space is
> clearly beating sendfile().

Yes, this would be called a bug. =-)

		-ben
-- 
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-20 21:53 sendfile() with 100 simultaneous 100MB files Jon Smirl
  2006-01-21  2:22 ` Matti Aarnio
  2006-01-21  3:52 ` Phillip Susi
@ 2006-01-22 14:24 ` Jim Nance
  2006-01-22 17:31   ` Jon Smirl
  2006-01-23 16:50   ` jerome lacoste
  2 siblings, 2 replies; 10+ messages in thread
From: Jim Nance @ 2006-01-22 14:24 UTC (permalink / raw)
  To: Jon Smirl; +Cc: lkml

On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:

> Any other ideas why sendfile() would get into a seek storm?

I can't really comment on the quality of the linux sendfile() implementation,
I've never looked at the code.  However, a couple of general observations.

The seek storm happens because linux is trying to be "fair," where fair
means no one process get to starve another for I/O bandwidth.

The fastest way to transfer 100 100M files would be to send them one at a
time.  The 99th person in line of course would percieve this as a very poor
implementation.  The current sendfile implementation seems to live at the
other end of the extream.

It is possible to come up with a compromise behavior by limiting the
number of concurrent sendfiles running, and the maximum size they are
allowed to send in one squirt.

Thanks,

Jim

-- 
jlnance@sdf.lonestar.org
SDF Public Access UNIX System - http://sdf.lonestar.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-22 14:24 ` Jim Nance
@ 2006-01-22 17:31   ` Jon Smirl
  2006-01-23 15:22     ` Jon Smirl
  2006-01-23 16:50   ` jerome lacoste
  1 sibling, 1 reply; 10+ messages in thread
From: Jon Smirl @ 2006-01-22 17:31 UTC (permalink / raw)
  To: Jim Nance; +Cc: lkml

On 1/22/06, Jim Nance <jlnance@sdf.lonestar.org> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
>
> > Any other ideas why sendfile() would get into a seek storm?
>
> I can't really comment on the quality of the linux sendfile() implementation,
> I've never looked at the code.  However, a couple of general observations.
>
> The seek storm happens because linux is trying to be "fair," where fair
> means no one process get to starve another for I/O bandwidth.

I think there is something more going on. The user space processes
submitted requests for the same IO in 500K chunks and didn't get into
a seek storm. If it was a disk fairness problem the user space
implementation would have gotten in trouble too.

There seems to be some difference in the way sendfile() submits the
requests to the disk system and how the 500K requests from user space
are handled. I believe both tests were using the same disk scheduler
algorithm so the data points to differences in how the requests are
submitted to the disk system. The sendfile() submission pattern
triggers a storm and the user space one doesn't.

I've asked the lighttpd people for more data but I haven't gotten
anything back yet. Things like RAM, network speed, disk scheduler
algorithm, etc.

>
> The fastest way to transfer 100 100M files would be to send them one at a
> time.  The 99th person in line of course would percieve this as a very poor
> implementation.  The current sendfile implementation seems to live at the
> other end of the extream.

One at a time may not be the fastest. When the network transmission
window is full you will stop transmitting on that socket but you can
probably still transmit on the others. Packet loss is another reason
for sockets blocking.

>
> It is possible to come up with a compromise behavior by limiting the
> number of concurrent sendfiles running, and the maximum size they are
> allowed to send in one squirt.
>
> Thanks,
>
> Jim
>
> --
> jlnance@sdf.lonestar.org
> SDF Public Access UNIX System - http://sdf.lonestar.org
>


--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-22 17:31   ` Jon Smirl
@ 2006-01-23 15:22     ` Jon Smirl
  2006-01-24 16:30       ` Jon Smirl
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Smirl @ 2006-01-23 15:22 UTC (permalink / raw)
  To: lkml

On 1/22/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> I've asked the lighttpd people for more data but I haven't gotten
> anything back yet. Things like RAM, network speed, disk scheduler
> algorithm, etc.

The developer is using this hardware:
82541GI/PI Gigabit  ethernet
1.3Ghz Duron
7200RPM IDE disk
768MB RAM

Kernel:
2.6.13-1.1526_FC4
CFQ disk scheduler

Customer is getting same problem on highend hardware.

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-22 14:24 ` Jim Nance
  2006-01-22 17:31   ` Jon Smirl
@ 2006-01-23 16:50   ` jerome lacoste
  1 sibling, 0 replies; 10+ messages in thread
From: jerome lacoste @ 2006-01-23 16:50 UTC (permalink / raw)
  To: Jim Nance; +Cc: Jon Smirl, lkml

On 1/22/06, Jim Nance <jlnance@sdf.lonestar.org> wrote:
> On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
[...]
> The fastest way to transfer 100 100M files would be to send them one at a
> time.

... assuming the bottleneck is not the end user upload network
bandwidth, which is, in the case of a big network file server with
many clients over the Internet, almost never the case.

J

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sendfile() with 100 simultaneous 100MB files
  2006-01-23 15:22     ` Jon Smirl
@ 2006-01-24 16:30       ` Jon Smirl
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Smirl @ 2006-01-24 16:30 UTC (permalink / raw)
  To: lkml

I've filed a kernel bug summarizing the issue:
http://bugzilla.kernel.org/show_bug.cgi?id=5949

The lighttpd author is willing to provide more info if anyone is interested.

--
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-01-24 16:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-20 21:53 sendfile() with 100 simultaneous 100MB files Jon Smirl
2006-01-21  2:22 ` Matti Aarnio
2006-01-21  3:43   ` Jon Smirl
2006-01-22  3:46     ` Benjamin LaHaise
2006-01-21  3:52 ` Phillip Susi
2006-01-22 14:24 ` Jim Nance
2006-01-22 17:31   ` Jon Smirl
2006-01-23 15:22     ` Jon Smirl
2006-01-24 16:30       ` Jon Smirl
2006-01-23 16:50   ` jerome lacoste

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).