All of lore.kernel.org
 help / color / mirror / Atom feed
* Unexpected splice "always copy" behavior observed
@ 2010-05-18 15:34 ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-18 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

Hi,

I'm currently digging into the splice code to figure out why it's always in copy
mode even though I specified the SPLICE_F_MOVE flag and released the page
references from the LTTng ring buffer. I'm splicing to a pipe and then from the
pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
something and I don't like that.

My simple test case is to add a printk around the splice copy:

fs/splice.c: pipe_to_file()
       if (buf->page != page) {
                /*
                 * Careful, ->map() uses KM_USER0!
                 */
                char *src = buf->ops->map(pipe, buf, 1);
                char *dst = kmap_atomic(page, KM_USER1);

                printk(KERN_WARNING "SPLICE COPY!!!\n");
                memcpy(dst + offset, src + buf->offset, this_len);
                flush_dcache_page(page);
                kunmap_atomic(dst, KM_USER1);
                buf->ops->unmap(pipe, buf, src);
        }

I'll start with a disclaimer that I only recently improved my splice
understanding, so AFAIU:

* pipe_to_file() allocates a struct page *page on its stack.

* It is passed, uninitialized, to

        ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
                                AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);

    that looks already odd to me, as I would expect pipe_to_file to populate
    this page pointer with buf->page initially if the proper conditions are met.

* Looking at the ext2 and ext3 write_begin code, neither are using the pagep
  parameter:

  ext2:

static int
ext2_write_begin(struct file *file, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned flags,
                struct page **pagep, void **fsdata)
{
        *pagep = NULL;
        return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
}


  ext3:

static int ext3_write_begin(struct file *file, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata)
{
        struct page *page;
        ....

retry:
        page = grab_cache_page_write_begin(mapping, index, flags);
        if (!page)
                return -ENOMEM;
        *pagep = page;

* So, considering the test to check if the page content must be copied:

       if (buf->page != page) {

  how is it ever possible that buf->page == page ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Unexpected splice "always copy" behavior observed
@ 2010-05-18 15:34 ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-18 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

Hi,

I'm currently digging into the splice code to figure out why it's always in copy
mode even though I specified the SPLICE_F_MOVE flag and released the page
references from the LTTng ring buffer. I'm splicing to a pipe and then from the
pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
something and I don't like that.

My simple test case is to add a printk around the splice copy:

fs/splice.c: pipe_to_file()
       if (buf->page != page) {
                /*
                 * Careful, ->map() uses KM_USER0!
                 */
                char *src = buf->ops->map(pipe, buf, 1);
                char *dst = kmap_atomic(page, KM_USER1);

                printk(KERN_WARNING "SPLICE COPY!!!\n");
                memcpy(dst + offset, src + buf->offset, this_len);
                flush_dcache_page(page);
                kunmap_atomic(dst, KM_USER1);
                buf->ops->unmap(pipe, buf, src);
        }

I'll start with a disclaimer that I only recently improved my splice
understanding, so AFAIU:

* pipe_to_file() allocates a struct page *page on its stack.

* It is passed, uninitialized, to

        ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
                                AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);

    that looks already odd to me, as I would expect pipe_to_file to populate
    this page pointer with buf->page initially if the proper conditions are met.

* Looking at the ext2 and ext3 write_begin code, neither are using the pagep
  parameter:

  ext2:

static int
ext2_write_begin(struct file *file, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned flags,
                struct page **pagep, void **fsdata)
{
        *pagep = NULL;
        return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
}


  ext3:

static int ext3_write_begin(struct file *file, struct address_space *mapping,
                                loff_t pos, unsigned len, unsigned flags,
                                struct page **pagep, void **fsdata)
{
        struct page *page;
        ....

retry:
        page = grab_cache_page_write_begin(mapping, index, flags);
        if (!page)
                return -ENOMEM;
        *pagep = page;

* So, considering the test to check if the page content must be copied:

       if (buf->page != page) {

  how is it ever possible that buf->page == page ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 15:34 ` Mathieu Desnoyers
@ 2010-05-18 15:51   ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 15:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe, Linus Torvalds

Hi,

The basic problem is that the filesystem APIs were never designed with
this usage in mind, so we had to disable the SPLICE_F_MOVE support by
default.

So short answer is that this is expected.

What would be needed is to have filesystem maintainers go through and
enable it on a case by case basis. It's trivial for tmpfs/ramfs type
filesystems and I have a patch for those, but I never posted it on.yet.
Even basic buffer head filesystems IIRC get a little more complex --
but we may get some milage just out of invalidating the existing
pagecache rather than getting fancy and trying to move buffers over
to the new page.

Nick

On Tue, May 18, 2010 at 11:34:40AM -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
> 
> My simple test case is to add a printk around the splice copy:
> 
> fs/splice.c: pipe_to_file()
>        if (buf->page != page) {
>                 /*
>                  * Careful, ->map() uses KM_USER0!
>                  */
>                 char *src = buf->ops->map(pipe, buf, 1);
>                 char *dst = kmap_atomic(page, KM_USER1);
> 
>                 printk(KERN_WARNING "SPLICE COPY!!!\n");
>                 memcpy(dst + offset, src + buf->offset, this_len);
>                 flush_dcache_page(page);
>                 kunmap_atomic(dst, KM_USER1);
>                 buf->ops->unmap(pipe, buf, src);
>         }
> 
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:
> 
> * pipe_to_file() allocates a struct page *page on its stack.
> 
> * It is passed, uninitialized, to
> 
>         ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
>                                 AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> 
>     that looks already odd to me, as I would expect pipe_to_file to populate
>     this page pointer with buf->page initially if the proper conditions are met.
> 
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
>   parameter:
> 
>   ext2:
> 
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
>                 loff_t pos, unsigned len, unsigned flags,
>                 struct page **pagep, void **fsdata)
> {
>         *pagep = NULL;
>         return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
> 
> 
>   ext3:
> 
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
>                                 loff_t pos, unsigned len, unsigned flags,
>                                 struct page **pagep, void **fsdata)
> {
>         struct page *page;
>         ....
> 
> retry:
>         page = grab_cache_page_write_begin(mapping, index, flags);
>         if (!page)
>                 return -ENOMEM;
>         *pagep = page;
> 
> * So, considering the test to check if the page content must be copied:
> 
>        if (buf->page != page) {
> 
>   how is it ever possible that buf->page == page ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 15:51   ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 15:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe, Linus Torvalds

Hi,

The basic problem is that the filesystem APIs were never designed with
this usage in mind, so we had to disable the SPLICE_F_MOVE support by
default.

So short answer is that this is expected.

What would be needed is to have filesystem maintainers go through and
enable it on a case by case basis. It's trivial for tmpfs/ramfs type
filesystems and I have a patch for those, but I never posted it on.yet.
Even basic buffer head filesystems IIRC get a little more complex --
but we may get some milage just out of invalidating the existing
pagecache rather than getting fancy and trying to move buffers over
to the new page.

Nick

On Tue, May 18, 2010 at 11:34:40AM -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
> 
> My simple test case is to add a printk around the splice copy:
> 
> fs/splice.c: pipe_to_file()
>        if (buf->page != page) {
>                 /*
>                  * Careful, ->map() uses KM_USER0!
>                  */
>                 char *src = buf->ops->map(pipe, buf, 1);
>                 char *dst = kmap_atomic(page, KM_USER1);
> 
>                 printk(KERN_WARNING "SPLICE COPY!!!\n");
>                 memcpy(dst + offset, src + buf->offset, this_len);
>                 flush_dcache_page(page);
>                 kunmap_atomic(dst, KM_USER1);
>                 buf->ops->unmap(pipe, buf, src);
>         }
> 
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:
> 
> * pipe_to_file() allocates a struct page *page on its stack.
> 
> * It is passed, uninitialized, to
> 
>         ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
>                                 AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> 
>     that looks already odd to me, as I would expect pipe_to_file to populate
>     this page pointer with buf->page initially if the proper conditions are met.
> 
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
>   parameter:
> 
>   ext2:
> 
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
>                 loff_t pos, unsigned len, unsigned flags,
>                 struct page **pagep, void **fsdata)
> {
>         *pagep = NULL;
>         return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
> 
> 
>   ext3:
> 
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
>                                 loff_t pos, unsigned len, unsigned flags,
>                                 struct page **pagep, void **fsdata)
> {
>         struct page *page;
>         ....
> 
> retry:
>         page = grab_cache_page_write_begin(mapping, index, flags);
>         if (!page)
>                 return -ENOMEM;
>         *pagep = page;
> 
> * So, considering the test to check if the page content must be copied:
> 
>        if (buf->page != page) {
> 
>   how is it ever possible that buf->page == page ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 15:34 ` Mathieu Desnoyers
@ 2010-05-18 15:53   ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-18 15:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

Hehe, I just notice this this morning too, while investigating.


On Tue, 2010-05-18 at 11:34 -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
> 
> My simple test case is to add a printk around the splice copy:
> 
> fs/splice.c: pipe_to_file()
>        if (buf->page != page) {
>                 /*
>                  * Careful, ->map() uses KM_USER0!
>                  */
>                 char *src = buf->ops->map(pipe, buf, 1);
>                 char *dst = kmap_atomic(page, KM_USER1);
> 
>                 printk(KERN_WARNING "SPLICE COPY!!!\n");
>                 memcpy(dst + offset, src + buf->offset, this_len);
>                 flush_dcache_page(page);
>                 kunmap_atomic(dst, KM_USER1);
>                 buf->ops->unmap(pipe, buf, src);

I used trace_printk() since it is not as invasive.

>         }
> 
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:

Same here ;-)

> 
> * pipe_to_file() allocates a struct page *page on its stack.
> 
> * It is passed, uninitialized, to
> 
>         ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
>                                 AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> 
>     that looks already odd to me, as I would expect pipe_to_file to populate
>     this page pointer with buf->page initially if the proper conditions are met.
> 
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
>   parameter:
> 
>   ext2:
> 
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
>                 loff_t pos, unsigned len, unsigned flags,
>                 struct page **pagep, void **fsdata)
> {
>         *pagep = NULL;
>         return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
> 
> 
>   ext3:
> 
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
>                                 loff_t pos, unsigned len, unsigned flags,
>                                 struct page **pagep, void **fsdata)
> {
>         struct page *page;
>         ....
> 
> retry:
>         page = grab_cache_page_write_begin(mapping, index, flags);
>         if (!page)
>                 return -ENOMEM;
>         *pagep = page;
> 
> * So, considering the test to check if the page content must be copied:
> 
>        if (buf->page != page) {
> 
>   how is it ever possible that buf->page == page ?

I'm currently looking at the network code to see if it is better.

-- Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 15:53   ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-18 15:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

Hehe, I just notice this this morning too, while investigating.


On Tue, 2010-05-18 at 11:34 -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> I'm currently digging into the splice code to figure out why it's always in copy
> mode even though I specified the SPLICE_F_MOVE flag and released the page
> references from the LTTng ring buffer. I'm splicing to a pipe and then from the
> pipe to an ext3 filesystem (2.6.33.4 kernel). I've got the feeling I'm missing
> something and I don't like that.
> 
> My simple test case is to add a printk around the splice copy:
> 
> fs/splice.c: pipe_to_file()
>        if (buf->page != page) {
>                 /*
>                  * Careful, ->map() uses KM_USER0!
>                  */
>                 char *src = buf->ops->map(pipe, buf, 1);
>                 char *dst = kmap_atomic(page, KM_USER1);
> 
>                 printk(KERN_WARNING "SPLICE COPY!!!\n");
>                 memcpy(dst + offset, src + buf->offset, this_len);
>                 flush_dcache_page(page);
>                 kunmap_atomic(dst, KM_USER1);
>                 buf->ops->unmap(pipe, buf, src);

I used trace_printk() since it is not as invasive.

>         }
> 
> I'll start with a disclaimer that I only recently improved my splice
> understanding, so AFAIU:

Same here ;-)

> 
> * pipe_to_file() allocates a struct page *page on its stack.
> 
> * It is passed, uninitialized, to
> 
>         ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
>                                 AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> 
>     that looks already odd to me, as I would expect pipe_to_file to populate
>     this page pointer with buf->page initially if the proper conditions are met.
> 
> * Looking at the ext2 and ext3 write_begin code, neither are using the pagep
>   parameter:
> 
>   ext2:
> 
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
>                 loff_t pos, unsigned len, unsigned flags,
>                 struct page **pagep, void **fsdata)
> {
>         *pagep = NULL;
>         return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
> }
> 
> 
>   ext3:
> 
> static int ext3_write_begin(struct file *file, struct address_space *mapping,
>                                 loff_t pos, unsigned len, unsigned flags,
>                                 struct page **pagep, void **fsdata)
> {
>         struct page *page;
>         ....
> 
> retry:
>         page = grab_cache_page_write_begin(mapping, index, flags);
>         if (!page)
>                 return -ENOMEM;
>         *pagep = page;
> 
> * So, considering the test to check if the page content must be copied:
> 
>        if (buf->page != page) {
> 
>   how is it ever possible that buf->page == page ?

I'm currently looking at the network code to see if it is better.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 15:51   ` Nick Piggin
@ 2010-05-18 15:56     ` Christoph Lameter
  -1 siblings, 0 replies; 71+ messages in thread
From: Christoph Lameter @ 2010-05-18 15:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Wed, 19 May 2010, Nick Piggin wrote:

> What would be needed is to have filesystem maintainers go through and
> enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> filesystems and I have a patch for those, but I never posted it on.yet.
> Even basic buffer head filesystems IIRC get a little more complex --
> but we may get some milage just out of invalidating the existing
> pagecache rather than getting fancy and trying to move buffers over
> to the new page.

There is a "migration" address space operation for moving pages. Page
migration requires that in order to be able to move dirty pages. Can
splice use that?



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 15:56     ` Christoph Lameter
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Lameter @ 2010-05-18 15:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Wed, 19 May 2010, Nick Piggin wrote:

> What would be needed is to have filesystem maintainers go through and
> enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> filesystems and I have a patch for those, but I never posted it on.yet.
> Even basic buffer head filesystems IIRC get a little more complex --
> but we may get some milage just out of invalidating the existing
> pagecache rather than getting fancy and trying to move buffers over
> to the new page.

There is a "migration" address space operation for moving pages. Page
migration requires that in order to be able to move dirty pages. Can
splice use that?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 15:56     ` Christoph Lameter
@ 2010-05-18 16:00       ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 16:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> On Wed, 19 May 2010, Nick Piggin wrote:
> 
> > What would be needed is to have filesystem maintainers go through and
> > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > filesystems and I have a patch for those, but I never posted it on.yet.
> > Even basic buffer head filesystems IIRC get a little more complex --
> > but we may get some milage just out of invalidating the existing
> > pagecache rather than getting fancy and trying to move buffers over
> > to the new page.
> 
> There is a "migration" address space operation for moving pages. Page
> migration requires that in order to be able to move dirty pages. Can
> splice use that?

Hmm yes I didn't think of that, it probably could.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 16:00       ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 16:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> On Wed, 19 May 2010, Nick Piggin wrote:
> 
> > What would be needed is to have filesystem maintainers go through and
> > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > filesystems and I have a patch for those, but I never posted it on.yet.
> > Even basic buffer head filesystems IIRC get a little more complex --
> > but we may get some milage just out of invalidating the existing
> > pagecache rather than getting fancy and trying to move buffers over
> > to the new page.
> 
> There is a "migration" address space operation for moving pages. Page
> migration requires that in order to be able to move dirty pages. Can
> splice use that?

Hmm yes I didn't think of that, it probably could.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 15:53   ` Steven Rostedt
@ 2010-05-18 16:10     ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-18 16:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

On Tue, 2010-05-18 at 11:53 -0400, Steven Rostedt wrote:

> I'm currently looking at the network code to see if it is better.

The network code seems to do the right thing. It sends the actual page
directly to the network.

Hopefully we can find a way to avoid the copy to file. But the splice
code was created to avoid the copy to and from userspace, it did not
guarantee no copy within the kernel itself.

-- Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 16:10     ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-18 16:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe, Linus Torvalds, Nick Piggin

On Tue, 2010-05-18 at 11:53 -0400, Steven Rostedt wrote:

> I'm currently looking at the network code to see if it is better.

The network code seems to do the right thing. It sends the actual page
directly to the network.

Hopefully we can find a way to avoid the copy to file. But the splice
code was created to avoid the copy to and from userspace, it did not
guarantee no copy within the kernel itself.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 16:00       ` Nick Piggin
@ 2010-05-18 16:13         ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Wed, May 19, 2010 at 02:00:51AM +1000, Nick Piggin wrote:
> On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> > On Wed, 19 May 2010, Nick Piggin wrote:
> > 
> > > What would be needed is to have filesystem maintainers go through and
> > > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > > filesystems and I have a patch for those, but I never posted it on.yet.
> > > Even basic buffer head filesystems IIRC get a little more complex --
> > > but we may get some milage just out of invalidating the existing
> > > pagecache rather than getting fancy and trying to move buffers over
> > > to the new page.
> > 
> > There is a "migration" address space operation for moving pages. Page
> > migration requires that in order to be able to move dirty pages. Can
> > splice use that?
> 
> Hmm yes I didn't think of that, it probably could.

It's not the only requirement, of course, just that it could
potentially reuse some of the code.

The big difference is that the source page is already dirty, and
the destination page might not exist, might exist and be partially
uptodate, not have blocks allocated, might be past i_size, fully
uptodate, etc.

So it's more than a matter of just a simple copy to another page
and taking over exactly the same filesystem state as the old page.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 16:13         ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-18 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mathieu Desnoyers, Peter Zijlstra, Steven Rostedt,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Tejun Heo, Jens Axboe,
	Linus Torvalds

On Wed, May 19, 2010 at 02:00:51AM +1000, Nick Piggin wrote:
> On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> > On Wed, 19 May 2010, Nick Piggin wrote:
> > 
> > > What would be needed is to have filesystem maintainers go through and
> > > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > > filesystems and I have a patch for those, but I never posted it on.yet.
> > > Even basic buffer head filesystems IIRC get a little more complex --
> > > but we may get some milage just out of invalidating the existing
> > > pagecache rather than getting fancy and trying to move buffers over
> > > to the new page.
> > 
> > There is a "migration" address space operation for moving pages. Page
> > migration requires that in order to be able to move dirty pages. Can
> > splice use that?
> 
> Hmm yes I didn't think of that, it probably could.

It's not the only requirement, of course, just that it could
potentially reuse some of the code.

The big difference is that the source page is already dirty, and
the destination page might not exist, might exist and be partially
uptodate, not have blocks allocated, might be past i_size, fully
uptodate, etc.

So it's more than a matter of just a simple copy to another page
and taking over exactly the same filesystem state as the old page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 16:10     ` Steven Rostedt
@ 2010-05-18 16:25       ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-18 16:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Peter Zijlstra, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe, Nick Piggin



On Tue, 18 May 2010, Steven Rostedt wrote:
> 
> Hopefully we can find a way to avoid the copy to file. But the splice
> code was created to avoid the copy to and from userspace, it did not
> guarantee no copy within the kernel itself.

Well, we always _wanted_ to splice directly to a file, but it's just not 
been done properly. It's not entirely trivial, since you need to worry 
about preexisting pages and generally just do the right thing wrt the 
filesystem.

And no, it should NOT use migration code. I suspect you could do something 
fairly simple like:

 - get the inode semaphore.
 - check if the splice is a pure "extend size" operation for that page
 - if so, just create the page cache entry and mark it dirty
 - otherwise, fall back to copying.

because the "extend file" case is the easiest one, and is likely the only 
one that matters in practice (if you are overwriting an existing file, 
things get _way_ hairier, and why the hell would anybody expect that to be 
fast anyway?)

But somebody needs to write the code..

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-18 16:25       ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-18 16:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Peter Zijlstra, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe, Nick Piggin



On Tue, 18 May 2010, Steven Rostedt wrote:
> 
> Hopefully we can find a way to avoid the copy to file. But the splice
> code was created to avoid the copy to and from userspace, it did not
> guarantee no copy within the kernel itself.

Well, we always _wanted_ to splice directly to a file, but it's just not 
been done properly. It's not entirely trivial, since you need to worry 
about preexisting pages and generally just do the right thing wrt the 
filesystem.

And no, it should NOT use migration code. I suspect you could do something 
fairly simple like:

 - get the inode semaphore.
 - check if the splice is a pure "extend size" operation for that page
 - if so, just create the page cache entry and mark it dirty
 - otherwise, fall back to copying.

because the "extend file" case is the easiest one, and is likely the only 
one that matters in practice (if you are overwriting an existing file, 
things get _way_ hairier, and why the hell would anybody expect that to be 
fast anyway?)

But somebody needs to write the code..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-18 16:25       ` Linus Torvalds
@ 2010-05-19  6:31         ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19  6:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Tue, May 18, 2010 at 09:25:05AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 18 May 2010, Steven Rostedt wrote:
> > 
> > Hopefully we can find a way to avoid the copy to file. But the splice
> > code was created to avoid the copy to and from userspace, it did not
> > guarantee no copy within the kernel itself.
> 
> Well, we always _wanted_ to splice directly to a file, but it's just not 
> been done properly. It's not entirely trivial, since you need to worry 
> about preexisting pages and generally just do the right thing wrt the 
> filesystem.
> 
> And no, it should NOT use migration code. I suspect you could do something 
> fairly simple like:

I was thinking it could possibly reuse some of the migration code for
swapping filesystem state to the new page. But I agree it gets hairy and
is probably better to just insert new pages.

> 
>  - get the inode semaphore.
>  - check if the splice is a pure "extend size" operation for that page
>  - if so, just create the page cache entry and mark it dirty
>  - otherwise, fall back to copying.
> 
> because the "extend file" case is the easiest one, and is likely the only 
> one that matters in practice (if you are overwriting an existing file, 
> things get _way_ hairier, and why the hell would anybody expect that to be 
> fast anyway?)
> 
> But somebody needs to write the code..

We can possibly do an attempt to invalidate existing pagecache and
then try to install the new page. The filesystem still needs a look
over to ensure error handling will work properly, and that it does
not make incorrect assumptions about the contents of the page being
passed in.

This still isn't ideal because we drop the filesystem state (eg bufer
heads) on a page which, by definition, will need to be written out soon.
But something smarter could be added if it turns out to be important.

Big if, because I don't like adding complex code without having a
really good reason. I do like having the splice flag there, though.
The more the app can tell the kernel the better. Hopefully people use
it and we can get a better idea of whether these fancy optimisations
will be worth it.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19  6:31         ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19  6:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Tue, May 18, 2010 at 09:25:05AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 18 May 2010, Steven Rostedt wrote:
> > 
> > Hopefully we can find a way to avoid the copy to file. But the splice
> > code was created to avoid the copy to and from userspace, it did not
> > guarantee no copy within the kernel itself.
> 
> Well, we always _wanted_ to splice directly to a file, but it's just not 
> been done properly. It's not entirely trivial, since you need to worry 
> about preexisting pages and generally just do the right thing wrt the 
> filesystem.
> 
> And no, it should NOT use migration code. I suspect you could do something 
> fairly simple like:

I was thinking it could possibly reuse some of the migration code for
swapping filesystem state to the new page. But I agree it gets hairy and
is probably better to just insert new pages.

> 
>  - get the inode semaphore.
>  - check if the splice is a pure "extend size" operation for that page
>  - if so, just create the page cache entry and mark it dirty
>  - otherwise, fall back to copying.
> 
> because the "extend file" case is the easiest one, and is likely the only 
> one that matters in practice (if you are overwriting an existing file, 
> things get _way_ hairier, and why the hell would anybody expect that to be 
> fast anyway?)
> 
> But somebody needs to write the code..

We can possibly do an attempt to invalidate existing pagecache and
then try to install the new page. The filesystem still needs a look
over to ensure error handling will work properly, and that it does
not make incorrect assumptions about the contents of the page being
passed in.

This still isn't ideal because we drop the filesystem state (eg bufer
heads) on a page which, by definition, will need to be written out soon.
But something smarter could be added if it turns out to be important.

Big if, because I don't like adding complex code without having a
really good reason. I do like having the splice flag there, though.
The more the app can tell the kernel the better. Hopefully people use
it and we can get a better idea of whether these fancy optimisations
will be worth it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19  6:31         ` Nick Piggin
@ 2010-05-19 14:39           ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 14:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Wed, 19 May 2010, Nick Piggin wrote:
> 
> We can possibly do an attempt to invalidate existing pagecache and
> then try to install the new page.

Yes, but that's going to be rather hairier. We need to make sure that the 
filesystem doesn't have some kind of dirty pointers to the old page etc. 
Although I guess that should always show up in the page counters, so I 
guess we can always handle the case of page_count() being 1 (only page 
cache) and the page being unlocked.

So I'd much rather just handle the "append to the end".

The real limitation is likely always going to be the fact that it has to 
be page-aligned and a full page. For a lot of splice inputs, that simply 
won't be the case, and you'll end up copying for alignment reasons anyway.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 14:39           ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 14:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Wed, 19 May 2010, Nick Piggin wrote:
> 
> We can possibly do an attempt to invalidate existing pagecache and
> then try to install the new page.

Yes, but that's going to be rather hairier. We need to make sure that the 
filesystem doesn't have some kind of dirty pointers to the old page etc. 
Although I guess that should always show up in the page counters, so I 
guess we can always handle the case of page_count() being 1 (only page 
cache) and the page being unlocked.

So I'd much rather just handle the "append to the end".

The real limitation is likely always going to be the fact that it has to 
be page-aligned and a full page. For a lot of splice inputs, that simply 
won't be the case, and you'll end up copying for alignment reasons anyway.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:39           ` Linus Torvalds
@ 2010-05-19 14:56             ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds wrote:

> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

That's understandable. For the use cases of splice I use, I work to make
it page aligned and full pages. Anyone else using splice for
optimizations, should do the same. It only makes sense.

The end of buffer may not be a full page, but then it's the end anyway,
and I'm not as interested in the speed.

-- Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 14:56             ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds wrote:

> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

That's understandable. For the use cases of splice I use, I work to make
it page aligned and full pages. Anyone else using splice for
optimizations, should do the same. It only makes sense.

The end of buffer may not be a full page, but then it's the end anyway,
and I'm not as interested in the speed.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:56             ` Steven Rostedt
@ 2010-05-19 14:59               ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 14:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Wed, 19 May 2010, Steven Rostedt wrote:

> On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds wrote:
> 
> > The real limitation is likely always going to be the fact that it has to 
> > be page-aligned and a full page. For a lot of splice inputs, that simply 
> > won't be the case, and you'll end up copying for alignment reasons anyway.
> 
> That's understandable. For the use cases of splice I use, I work to make
> it page aligned and full pages. Anyone else using splice for
> optimizations, should do the same. It only makes sense.
> 
> The end of buffer may not be a full page, but then it's the end anyway,
> and I'm not as interested in the speed.

Btw, since you apparently have a real case - is the "splice to file" 
always just an append? IOW, if I'm not right in assuming that the only 
sane thing people would reasonable care about is "append to a file", then 
holler now.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 14:59               ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 14:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Wed, 19 May 2010, Steven Rostedt wrote:

> On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds wrote:
> 
> > The real limitation is likely always going to be the fact that it has to 
> > be page-aligned and a full page. For a lot of splice inputs, that simply 
> > won't be the case, and you'll end up copying for alignment reasons anyway.
> 
> That's understandable. For the use cases of splice I use, I work to make
> it page aligned and full pages. Anyone else using splice for
> optimizations, should do the same. It only makes sense.
> 
> The end of buffer may not be a full page, but then it's the end anyway,
> and I'm not as interested in the speed.

Btw, since you apparently have a real case - is the "splice to file" 
always just an append? IOW, if I'm not right in assuming that the only 
sane thing people would reasonable care about is "append to a file", then 
holler now.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:59               ` Linus Torvalds
@ 2010-05-19 15:12                 ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 15:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, 2010-05-19 at 07:59 -0700, Linus Torvalds wrote:
> 

> Btw, since you apparently have a real case - is the "splice to file" 
> always just an append? IOW, if I'm not right in assuming that the only 
> sane thing people would reasonable care about is "append to a file", then 
> holler now.

My use case is just to move the data from the ring buffer into a file
(or network) as fast as possible. It creates a new file and all
additions are "append to a file".

I believe Mathieu does the same.

With me, you are correct.

-- Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:12                 ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 15:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, 2010-05-19 at 07:59 -0700, Linus Torvalds wrote:
> 

> Btw, since you apparently have a real case - is the "splice to file" 
> always just an append? IOW, if I'm not right in assuming that the only 
> sane thing people would reasonable care about is "append to a file", then 
> holler now.

My use case is just to move the data from the ring buffer into a file
(or network) as fast as possible. It creates a new file and all
additions are "append to a file".

I believe Mathieu does the same.

With me, you are correct.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:39           ` Linus Torvalds
@ 2010-05-19 15:17             ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, May 19, 2010 at 07:39:11AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 19 May 2010, Nick Piggin wrote:
> > 
> > We can possibly do an attempt to invalidate existing pagecache and
> > then try to install the new page.
> 
> Yes, but that's going to be rather hairier. We need to make sure that the 
> filesystem doesn't have some kind of dirty pointers to the old page etc. 
> Although I guess that should always show up in the page counters, so I 
> guess we can always handle the case of page_count() being 1 (only page 
> cache) and the page being unlocked.

Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
literally no pagecache there at all.

Then we just need to ensure that the filesystem doesn't do anything
funny with the page in write_begin (I don't know, such as zero out holes
or something strange). I don't think any do except maybe for something
obscure like jffs2, but obviously it needs to be looked at.

Error handling may need to be looked at too, but shouldn't be much
issue I'd think.
 
Even so, it's all going to add branches and complexity to an important
fast path, so we'd want to see numbers.


> So I'd much rather just handle the "append to the end".
> 
> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

That's true.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:17             ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, May 19, 2010 at 07:39:11AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 19 May 2010, Nick Piggin wrote:
> > 
> > We can possibly do an attempt to invalidate existing pagecache and
> > then try to install the new page.
> 
> Yes, but that's going to be rather hairier. We need to make sure that the 
> filesystem doesn't have some kind of dirty pointers to the old page etc. 
> Although I guess that should always show up in the page counters, so I 
> guess we can always handle the case of page_count() being 1 (only page 
> cache) and the page being unlocked.

Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
literally no pagecache there at all.

Then we just need to ensure that the filesystem doesn't do anything
funny with the page in write_begin (I don't know, such as zero out holes
or something strange). I don't think any do except maybe for something
obscure like jffs2, but obviously it needs to be looked at.

Error handling may need to be looked at too, but shouldn't be much
issue I'd think.
 
Even so, it's all going to add branches and complexity to an important
fast path, so we'd want to see numbers.


> So I'd much rather just handle the "append to the end".
> 
> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

That's true.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:39           ` Linus Torvalds
@ 2010-05-19 15:28             ` Miklos Szeredi
  -1 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

Another limitation I found while splicing from one file to another is
that stealing from the source file's page cache does not always
succeed.  This turned out to be because of a reference from the lru
cache for freshly read pages.  I'm not sure how this could be fixed.

Miklos

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:28             ` Miklos Szeredi
  0 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> The real limitation is likely always going to be the fact that it has to 
> be page-aligned and a full page. For a lot of splice inputs, that simply 
> won't be the case, and you'll end up copying for alignment reasons anyway.

Another limitation I found while splicing from one file to another is
that stealing from the source file's page cache does not always
succeed.  This turned out to be because of a reference from the lru
cache for freshly read pages.  I'm not sure how this could be fixed.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:17             ` Nick Piggin
@ 2010-05-19 15:30               ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 15:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Thu, 20 May 2010, Nick Piggin wrote:
> 
> Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
> literally no pagecache there at all.

Umm. That won't work. Think mapped pages. You can't handle them 
atomically, so somebody will page-fault them in.

So you'd have to have a "invalidate_and_replace()" to do it atomically 
while holding the mapping spinlock or something. 

And WHAT IS THE POINT? That will be about a million times slower than 
just doing the effing copy in the first place!

Memory copies are _not_ slow. Not compared to taking locks and doing TLB 
invalidates.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:30               ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 15:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe



On Thu, 20 May 2010, Nick Piggin wrote:
> 
> Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
> literally no pagecache there at all.

Umm. That won't work. Think mapped pages. You can't handle them 
atomically, so somebody will page-fault them in.

So you'd have to have a "invalidate_and_replace()" to do it atomically 
while holding the mapping spinlock or something. 

And WHAT IS THE POINT? That will be about a million times slower than 
just doing the effing copy in the first place!

Memory copies are _not_ slow. Not compared to taking locks and doing TLB 
invalidates.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:28             ` Miklos Szeredi
@ 2010-05-19 15:32               ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 15:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Miklos Szeredi wrote:
> 
> Another limitation I found while splicing from one file to another is
> that stealing from the source file's page cache does not always
> succeed.  This turned out to be because of a reference from the lru
> cache for freshly read pages.  I'm not sure how this could be fixed.

It should be fixed by saying "you can't always just move the page".

Copying is not evil. Complexity  to avoid copies is evil.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:32               ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 15:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Miklos Szeredi wrote:
> 
> Another limitation I found while splicing from one file to another is
> that stealing from the source file's page cache does not always
> succeed.  This turned out to be because of a reference from the lru
> cache for freshly read pages.  I'm not sure how this could be fixed.

It should be fixed by saying "you can't always just move the page".

Copying is not evil. Complexity  to avoid copies is evil.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:59               ` Linus Torvalds
@ 2010-05-19 15:33                 ` Miklos Szeredi
  -1 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: rostedt, npiggin, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> Btw, since you apparently have a real case - is the "splice to file" 
> always just an append? IOW, if I'm not right in assuming that the only 
> sane thing people would reasonable care about is "append to a file", then 
> holler now.

Virtual machines might reasonably need this for splicing to a disk
image.

Miklos

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:33                 ` Miklos Szeredi
  0 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: rostedt, npiggin, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> Btw, since you apparently have a real case - is the "splice to file" 
> always just an append? IOW, if I'm not right in assuming that the only 
> sane thing people would reasonable care about is "append to a file", then 
> holler now.

Virtual machines might reasonably need this for splicing to a disk
image.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:30               ` Linus Torvalds
@ 2010-05-19 15:44                 ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, May 19, 2010 at 08:30:10AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 20 May 2010, Nick Piggin wrote:
> > 
> > Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
> > literally no pagecache there at all.
> 
> Umm. That won't work. Think mapped pages. You can't handle them 
> atomically, so somebody will page-fault them in.
> 
> So you'd have to have a "invalidate_and_replace()" to do it atomically 
> while holding the mapping spinlock or something. 
> 
> And WHAT IS THE POINT? That will be about a million times slower than 
> just doing the effing copy in the first place!
> 
> Memory copies are _not_ slow. Not compared to taking locks and doing TLB 
> invalidates.

No I never thought it would be a good idea to try to avoid all races
or anything. Obviously some cases *cannot* be easily invalidated, if
there is a ref on the page or whatever, so the fallback code has to
be there anyway.

So you would just invalidate and try to insert your page. 99.something%
of the time it will work fine. If the insert fails, fall back to
copying.

And hey you *may* even want a heuristic that avoids trying to invalidate
if the page is mapped, due to cost of TLB flushing and faulting etc.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:44                 ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
	Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
	linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
	Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
	Jens Axboe

On Wed, May 19, 2010 at 08:30:10AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 20 May 2010, Nick Piggin wrote:
> > 
> > Well I mean a full invalidate -- invalidate_mapping_pages -- so there is
> > literally no pagecache there at all.
> 
> Umm. That won't work. Think mapped pages. You can't handle them 
> atomically, so somebody will page-fault them in.
> 
> So you'd have to have a "invalidate_and_replace()" to do it atomically 
> while holding the mapping spinlock or something. 
> 
> And WHAT IS THE POINT? That will be about a million times slower than 
> just doing the effing copy in the first place!
> 
> Memory copies are _not_ slow. Not compared to taking locks and doing TLB 
> invalidates.

No I never thought it would be a good idea to try to avoid all races
or anything. Obviously some cases *cannot* be easily invalidated, if
there is a ref on the page or whatever, so the fallback code has to
be there anyway.

So you would just invalidate and try to insert your page. 99.something%
of the time it will work fine. If the insert fails, fall back to
copying.

And hey you *may* even want a heuristic that avoids trying to invalidate
if the page is mapped, due to cost of TLB flushing and faulting etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:33                 ` Miklos Szeredi
@ 2010-05-19 15:45                   ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 15:45 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, npiggin, mathieu.desnoyers, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> On Wed, 19 May 2010, Linus Torvalds wrote:
> > Btw, since you apparently have a real case - is the "splice to file" 
> > always just an append? IOW, if I'm not right in assuming that the only 
> > sane thing people would reasonable care about is "append to a file", then 
> > holler now.
> 
> Virtual machines might reasonably need this for splicing to a disk
> image.

This comes down to balancing speed and complexity. Perhaps a copy is
fine in this case.

I'm concerned about high speed tracing, where we are always just taking
pages from the trace ring buffer and appending them to a file or sending
them off to the network. The slower this is, the more likely you will
lose events.

If the "move only on append to file" is easy to implement, I would
really like to see that happen. The speed of splicing a disk image for a
virtual machine only impacts the patience of the user. The speed of
splicing tracing output, impacts how much you can trace without losing
events.

-- Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:45                   ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 15:45 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, npiggin, mathieu.desnoyers, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> On Wed, 19 May 2010, Linus Torvalds wrote:
> > Btw, since you apparently have a real case - is the "splice to file" 
> > always just an append? IOW, if I'm not right in assuming that the only 
> > sane thing people would reasonable care about is "append to a file", then 
> > holler now.
> 
> Virtual machines might reasonably need this for splicing to a disk
> image.

This comes down to balancing speed and complexity. Perhaps a copy is
fine in this case.

I'm concerned about high speed tracing, where we are always just taking
pages from the trace ring buffer and appending them to a file or sending
them off to the network. The slower this is, the more likely you will
lose events.

If the "move only on append to file" is easy to implement, I would
really like to see that happen. The speed of splicing a disk image for a
virtual machine only impacts the patience of the user. The speed of
splicing tracing output, impacts how much you can trace without losing
events.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:12                 ` Steven Rostedt
@ 2010-05-19 15:51                   ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 15:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Nick Piggin, Peter Zijlstra, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-19 at 07:59 -0700, Linus Torvalds wrote:
> > 
> 
> > Btw, since you apparently have a real case - is the "splice to file" 
> > always just an append? IOW, if I'm not right in assuming that the only 
> > sane thing people would reasonable care about is "append to a file", then 
> > holler now.
> 
> My use case is just to move the data from the ring buffer into a file
> (or network) as fast as possible. It creates a new file and all
> additions are "append to a file".
> 
> I believe Mathieu does the same.
> 
> With me, you are correct.

Same here. My ring buffer only ever use splice() to append at the end of a file
or to the network, and always outputs data in multiples of the page size.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:51                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 15:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Nick Piggin, Peter Zijlstra, Frederic Weisbecker,
	Pierre Tardy, Ingo Molnar, Arnaldo Carvalho de Melo, Tom Zanussi,
	Paul Mackerras, linux-kernel, arjan, ziga.mahkovec, davem,
	linux-mm, Andrew Morton, KOSAKI Motohiro, Christoph Lameter,
	Tejun Heo, Jens Axboe

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-19 at 07:59 -0700, Linus Torvalds wrote:
> > 
> 
> > Btw, since you apparently have a real case - is the "splice to file" 
> > always just an append? IOW, if I'm not right in assuming that the only 
> > sane thing people would reasonable care about is "append to a file", then 
> > holler now.
> 
> My use case is just to move the data from the ring buffer into a file
> (or network) as fast as possible. It creates a new file and all
> additions are "append to a file".
> 
> I believe Mathieu does the same.
> 
> With me, you are correct.

Same here. My ring buffer only ever use splice() to append at the end of a file
or to the network, and always outputs data in multiples of the page size.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:45                   ` Steven Rostedt
@ 2010-05-19 15:55                     ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Miklos Szeredi, Linus Torvalds, mathieu.desnoyers, peterz,
	fweisbec, tardyp, mingo, acme, tzanussi, paulus, linux-kernel,
	arjan, ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl,
	tj, jens.axboe

On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:
> On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > Btw, since you apparently have a real case - is the "splice to file" 
> > > always just an append? IOW, if I'm not right in assuming that the only 
> > > sane thing people would reasonable care about is "append to a file", then 
> > > holler now.
> > 
> > Virtual machines might reasonably need this for splicing to a disk
> > image.
> 
> This comes down to balancing speed and complexity. Perhaps a copy is
> fine in this case.
> 
> I'm concerned about high speed tracing, where we are always just taking
> pages from the trace ring buffer and appending them to a file or sending
> them off to the network. The slower this is, the more likely you will
> lose events.
> 
> If the "move only on append to file" is easy to implement, I would
> really like to see that happen. The speed of splicing a disk image for a
> virtual machine only impacts the patience of the user. The speed of
> splicing tracing output, impacts how much you can trace without losing
> events.

It's not "easy" to implement :) What's your ring buffer look like?
Is it a normal user address which the kernel does copy_to_user()ish
things into? Or a mmapped special driver?

If the latter, it get's even harder again. But either way if the
source pages just have to be regenerated anyway (eg. via page fault
on next access), then it might not even be worthwhile to do the
splice move.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:55                     ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 15:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Miklos Szeredi, Linus Torvalds, mathieu.desnoyers, peterz,
	fweisbec, tardyp, mingo, acme, tzanussi, paulus, linux-kernel,
	arjan, ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl,
	tj, jens.axboe

On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:
> On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > Btw, since you apparently have a real case - is the "splice to file" 
> > > always just an append? IOW, if I'm not right in assuming that the only 
> > > sane thing people would reasonable care about is "append to a file", then 
> > > holler now.
> > 
> > Virtual machines might reasonably need this for splicing to a disk
> > image.
> 
> This comes down to balancing speed and complexity. Perhaps a copy is
> fine in this case.
> 
> I'm concerned about high speed tracing, where we are always just taking
> pages from the trace ring buffer and appending them to a file or sending
> them off to the network. The slower this is, the more likely you will
> lose events.
> 
> If the "move only on append to file" is easy to implement, I would
> really like to see that happen. The speed of splicing a disk image for a
> virtual machine only impacts the patience of the user. The speed of
> splicing tracing output, impacts how much you can trace without losing
> events.

It's not "easy" to implement :) What's your ring buffer look like?
Is it a normal user address which the kernel does copy_to_user()ish
things into? Or a mmapped special driver?

If the latter, it get's even harder again. But either way if the
source pages just have to be regenerated anyway (eg. via page fault
on next access), then it might not even be worthwhile to do the
splice move.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:32               ` Linus Torvalds
@ 2010-05-19 15:56                 ` Miklos Szeredi
  -1 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: miklos, npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> On Wed, 19 May 2010, Miklos Szeredi wrote:
> > 
> > Another limitation I found while splicing from one file to another is
> > that stealing from the source file's page cache does not always
> > succeed.  This turned out to be because of a reference from the lru
> > cache for freshly read pages.  I'm not sure how this could be fixed.
> 
> It should be fixed by saying "you can't always just move the page".
> 
> Copying is not evil. Complexity  to avoid copies is evil.

And predictability is good.  The thing I don't like about the above is
that it makes it totally unpredictable which pages will get moved, if
any.

Another related thing: if splicing from a file knowing that it will
need to be stolen, then it makes zero sense to first insert the pages
into the page cache then remove them shortly to be inserted into
another file's cache.  So we could have a flag saying "don't cache
newly read pages, just put them in the pipe buffer", which would solve
the above problem as well as speeding up the operation.

Miklos

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:56                 ` Miklos Szeredi
  0 siblings, 0 replies; 71+ messages in thread
From: Miklos Szeredi @ 2010-05-19 15:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: miklos, npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, 19 May 2010, Linus Torvalds wrote:
> On Wed, 19 May 2010, Miklos Szeredi wrote:
> > 
> > Another limitation I found while splicing from one file to another is
> > that stealing from the source file's page cache does not always
> > succeed.  This turned out to be because of a reference from the lru
> > cache for freshly read pages.  I'm not sure how this could be fixed.
> 
> It should be fixed by saying "you can't always just move the page".
> 
> Copying is not evil. Complexity  to avoid copies is evil.

And predictability is good.  The thing I don't like about the above is
that it makes it totally unpredictable which pages will get moved, if
any.

Another related thing: if splicing from a file knowing that it will
need to be stolen, then it makes zero sense to first insert the pages
into the page cache then remove them shortly to be inserted into
another file's cache.  So we could have a flag saying "don't cache
newly read pages, just put them in the pipe buffer", which would solve
the above problem as well as speeding up the operation.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:45                   ` Steven Rostedt
@ 2010-05-19 15:57                     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 15:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Miklos Szeredi, Linus Torvalds, npiggin, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > Btw, since you apparently have a real case - is the "splice to file" 
> > > always just an append? IOW, if I'm not right in assuming that the only 
> > > sane thing people would reasonable care about is "append to a file", then 
> > > holler now.
> > 
> > Virtual machines might reasonably need this for splicing to a disk
> > image.
> 
> This comes down to balancing speed and complexity. Perhaps a copy is
> fine in this case.
> 
> I'm concerned about high speed tracing, where we are always just taking
> pages from the trace ring buffer and appending them to a file or sending
> them off to the network. The slower this is, the more likely you will
> lose events.
> 
> If the "move only on append to file" is easy to implement, I would
> really like to see that happen. The speed of splicing a disk image for a
> virtual machine only impacts the patience of the user. The speed of
> splicing tracing output, impacts how much you can trace without losing
> events.

I'm with Steven here. I only care about appending full pages at the end of a
file. If possible, I'd also like to steal back the pages after waiting for the
writeback I/O to complete so we can put them back in the ring buffer without
stressing the page cache and the page allocator needlessly.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 15:57                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 15:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Miklos Szeredi, Linus Torvalds, npiggin, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > Btw, since you apparently have a real case - is the "splice to file" 
> > > always just an append? IOW, if I'm not right in assuming that the only 
> > > sane thing people would reasonable care about is "append to a file", then 
> > > holler now.
> > 
> > Virtual machines might reasonably need this for splicing to a disk
> > image.
> 
> This comes down to balancing speed and complexity. Perhaps a copy is
> fine in this case.
> 
> I'm concerned about high speed tracing, where we are always just taking
> pages from the trace ring buffer and appending them to a file or sending
> them off to the network. The slower this is, the more likely you will
> lose events.
> 
> If the "move only on append to file" is easy to implement, I would
> really like to see that happen. The speed of splicing a disk image for a
> virtual machine only impacts the patience of the user. The speed of
> splicing tracing output, impacts how much you can trace without losing
> events.

I'm with Steven here. I only care about appending full pages at the end of a
file. If possible, I'd also like to steal back the pages after waiting for the
writeback I/O to complete so we can put them back in the ring buffer without
stressing the page cache and the page allocator needlessly.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:56                 ` Miklos Szeredi
@ 2010-05-19 16:01                   ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 16:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Miklos Szeredi wrote:
> 
> And predictability is good.  The thing I don't like about the above is
> that it makes it totally unpredictable which pages will get moved, if
> any.

Tough. 

Think of it this way: it is predictable. They get predictably moved when 
moving is cheap and easy. It's about _performance_. 

Do you know when TLB misses happen? They are unpredictable. Do you know 
when the OS sends IPI's around? Do you know when scheduling happens? 

No you don't. So stop whining.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 16:01                   ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 16:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: npiggin, rostedt, mathieu.desnoyers, peterz, fweisbec, tardyp,
	mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Miklos Szeredi wrote:
> 
> And predictability is good.  The thing I don't like about the above is
> that it makes it totally unpredictable which pages will get moved, if
> any.

Tough. 

Think of it this way: it is predictable. They get predictably moved when 
moving is cheap and easy. It's about _performance_. 

Do you know when TLB misses happen? They are unpredictable. Do you know 
when the OS sends IPI's around? Do you know when scheduling happens? 

No you don't. So stop whining.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:55                     ` Nick Piggin
@ 2010-05-19 16:01                       ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 16:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Nick Piggin (npiggin@suse.de) wrote:
> On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:
> > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > sane thing people would reasonable care about is "append to a file", then 
> > > > holler now.
> > > 
> > > Virtual machines might reasonably need this for splicing to a disk
> > > image.
> > 
> > This comes down to balancing speed and complexity. Perhaps a copy is
> > fine in this case.
> > 
> > I'm concerned about high speed tracing, where we are always just taking
> > pages from the trace ring buffer and appending them to a file or sending
> > them off to the network. The slower this is, the more likely you will
> > lose events.
> > 
> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> It's not "easy" to implement :) What's your ring buffer look like?
> Is it a normal user address which the kernel does copy_to_user()ish
> things into? Or a mmapped special driver?
> 
> If the latter, it get's even harder again. But either way if the
> source pages just have to be regenerated anyway (eg. via page fault
> on next access), then it might not even be worthwhile to do the
> splice move.

Steven and I use pages to which we write directly by using the page address from
the linear memory mapping returned by page_address(). These pages have no other
mapping. They are moved to the pipe, and then from the pipe to a file (or to the
network). It's possibly the simplest scenario you could think of for splice().

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 16:01                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 16:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Nick Piggin (npiggin@suse.de) wrote:
> On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:
> > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > sane thing people would reasonable care about is "append to a file", then 
> > > > holler now.
> > > 
> > > Virtual machines might reasonably need this for splicing to a disk
> > > image.
> > 
> > This comes down to balancing speed and complexity. Perhaps a copy is
> > fine in this case.
> > 
> > I'm concerned about high speed tracing, where we are always just taking
> > pages from the trace ring buffer and appending them to a file or sending
> > them off to the network. The slower this is, the more likely you will
> > lose events.
> > 
> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> It's not "easy" to implement :) What's your ring buffer look like?
> Is it a normal user address which the kernel does copy_to_user()ish
> things into? Or a mmapped special driver?
> 
> If the latter, it get's even harder again. But either way if the
> source pages just have to be regenerated anyway (eg. via page fault
> on next access), then it might not even be worthwhile to do the
> splice move.

Steven and I use pages to which we write directly by using the page address from
the linear memory mapping returned by page_address(). These pages have no other
mapping. They are moved to the pipe, and then from the pipe to a file (or to the
network). It's possibly the simplest scenario you could think of for splice().

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:57                     ` Mathieu Desnoyers
@ 2010-05-19 16:27                       ` Nick Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 16:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > sane thing people would reasonable care about is "append to a file", then 
> > > > holler now.
> > > 
> > > Virtual machines might reasonably need this for splicing to a disk
> > > image.
> > 
> > This comes down to balancing speed and complexity. Perhaps a copy is
> > fine in this case.
> > 
> > I'm concerned about high speed tracing, where we are always just taking
> > pages from the trace ring buffer and appending them to a file or sending
> > them off to the network. The slower this is, the more likely you will
> > lose events.
> > 
> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> I'm with Steven here. I only care about appending full pages at the end of a
> file. If possible, I'd also like to steal back the pages after waiting for the
> writeback I/O to complete so we can put them back in the ring buffer without
> stressing the page cache and the page allocator needlessly.

Got to think about complexity and how much is really worth trying to
speed up strange cases. The page allocator is the generic "pipe" in
the kernel to move pages between subsystems when they become unused :)

The page cache can be directed to be written out and discarded with
fadvise and such.

You might also consider using direct IO.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 16:27                       ` Nick Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nick Piggin @ 2010-05-19 16:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > sane thing people would reasonable care about is "append to a file", then 
> > > > holler now.
> > > 
> > > Virtual machines might reasonably need this for splicing to a disk
> > > image.
> > 
> > This comes down to balancing speed and complexity. Perhaps a copy is
> > fine in this case.
> > 
> > I'm concerned about high speed tracing, where we are always just taking
> > pages from the trace ring buffer and appending them to a file or sending
> > them off to the network. The slower this is, the more likely you will
> > lose events.
> > 
> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> I'm with Steven here. I only care about appending full pages at the end of a
> file. If possible, I'd also like to steal back the pages after waiting for the
> writeback I/O to complete so we can put them back in the ring buffer without
> stressing the page cache and the page allocator needlessly.

Got to think about complexity and how much is really worth trying to
speed up strange cases. The page allocator is the generic "pipe" in
the kernel to move pages between subsystems when they become unused :)

The page cache can be directed to be written out and discarded with
fadvise and such.

You might also consider using direct IO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 15:55                     ` Nick Piggin
@ 2010-05-19 16:36                       ` Steven Rostedt
  -1 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 16:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, Linus Torvalds, mathieu.desnoyers, peterz,
	fweisbec, tardyp, mingo, acme, tzanussi, paulus, linux-kernel,
	arjan, ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl,
	tj, jens.axboe

On Thu, 2010-05-20 at 01:55 +1000, Nick Piggin wrote:
> On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:

> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> It's not "easy" to implement :) What's your ring buffer look like?
> Is it a normal user address which the kernel does copy_to_user()ish
> things into? Or a mmapped special driver?

Neither ;-)

> 
> If the latter, it get's even harder again. But either way if the
> source pages just have to be regenerated anyway (eg. via page fault
> on next access), then it might not even be worthwhile to do the
> splice move.

The ring buffer is written to by kernel events. To read it, the user can
either do a sys_read() and that is copied, or use splice. I do not
support mmap(), and if we were to do that, it would then not support
splice(). We have been talking about implementing both but with flags on
allocation of the ring buffer. You can either support mmap() or splice()
but not both with one instance of the ring buffer.

-- Steve






^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 16:36                       ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2010-05-19 16:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miklos Szeredi, Linus Torvalds, mathieu.desnoyers, peterz,
	fweisbec, tardyp, mingo, acme, tzanussi, paulus, linux-kernel,
	arjan, ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl,
	tj, jens.axboe

On Thu, 2010-05-20 at 01:55 +1000, Nick Piggin wrote:
> On Wed, May 19, 2010 at 11:45:42AM -0400, Steven Rostedt wrote:

> > If the "move only on append to file" is easy to implement, I would
> > really like to see that happen. The speed of splicing a disk image for a
> > virtual machine only impacts the patience of the user. The speed of
> > splicing tracing output, impacts how much you can trace without losing
> > events.
> 
> It's not "easy" to implement :) What's your ring buffer look like?
> Is it a normal user address which the kernel does copy_to_user()ish
> things into? Or a mmapped special driver?

Neither ;-)

> 
> If the latter, it get's even harder again. But either way if the
> source pages just have to be regenerated anyway (eg. via page fault
> on next access), then it might not even be worthwhile to do the
> splice move.

The ring buffer is written to by kernel events. To read it, the user can
either do a sys_read() and that is copied, or use splice. I do not
support mmap(), and if we were to do that, it would then not support
splice(). We have been talking about implementing both but with flags on
allocation of the ring buffer. You can either support mmap() or splice()
but not both with one instance of the ring buffer.

-- Steve





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 16:27                       ` Nick Piggin
@ 2010-05-19 19:14                         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 19:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Nick Piggin (npiggin@suse.de) wrote:
> On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > > sane thing people would reasonable care about is "append to a file", then 
> > > > > holler now.
> > > > 
> > > > Virtual machines might reasonably need this for splicing to a disk
> > > > image.
> > > 
> > > This comes down to balancing speed and complexity. Perhaps a copy is
> > > fine in this case.
> > > 
> > > I'm concerned about high speed tracing, where we are always just taking
> > > pages from the trace ring buffer and appending them to a file or sending
> > > them off to the network. The slower this is, the more likely you will
> > > lose events.
> > > 
> > > If the "move only on append to file" is easy to implement, I would
> > > really like to see that happen. The speed of splicing a disk image for a
> > > virtual machine only impacts the patience of the user. The speed of
> > > splicing tracing output, impacts how much you can trace without losing
> > > events.
> > 
> > I'm with Steven here. I only care about appending full pages at the end of a
> > file. If possible, I'd also like to steal back the pages after waiting for the
> > writeback I/O to complete so we can put them back in the ring buffer without
> > stressing the page cache and the page allocator needlessly.
> 
> Got to think about complexity and how much is really worth trying to
> speed up strange cases. The page allocator is the generic "pipe" in
> the kernel to move pages between subsystems when they become unused :)
> 
> The page cache can be directed to be written out and discarded with
> fadvise and such.

Good point. This discard flag might do the trick and let us keep things simple.
The major concern here is to keep the page cache disturbance relatively low.
Which of new page allocation or stealing back the page has the lowest overhead
would have to be determined with benchmarks.

So I would tend to simply use this discard fadvise with new page allocation for
now.

> 
> You might also consider using direct IO.

Maybe. I'm unsure about what it implies in the splice() context though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 19:14                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 19:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Steven Rostedt, Miklos Szeredi, Linus Torvalds, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Nick Piggin (npiggin@suse.de) wrote:
> On Wed, May 19, 2010 at 11:57:32AM -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > On Wed, 2010-05-19 at 17:33 +0200, Miklos Szeredi wrote:
> > > > On Wed, 19 May 2010, Linus Torvalds wrote:
> > > > > Btw, since you apparently have a real case - is the "splice to file" 
> > > > > always just an append? IOW, if I'm not right in assuming that the only 
> > > > > sane thing people would reasonable care about is "append to a file", then 
> > > > > holler now.
> > > > 
> > > > Virtual machines might reasonably need this for splicing to a disk
> > > > image.
> > > 
> > > This comes down to balancing speed and complexity. Perhaps a copy is
> > > fine in this case.
> > > 
> > > I'm concerned about high speed tracing, where we are always just taking
> > > pages from the trace ring buffer and appending them to a file or sending
> > > them off to the network. The slower this is, the more likely you will
> > > lose events.
> > > 
> > > If the "move only on append to file" is easy to implement, I would
> > > really like to see that happen. The speed of splicing a disk image for a
> > > virtual machine only impacts the patience of the user. The speed of
> > > splicing tracing output, impacts how much you can trace without losing
> > > events.
> > 
> > I'm with Steven here. I only care about appending full pages at the end of a
> > file. If possible, I'd also like to steal back the pages after waiting for the
> > writeback I/O to complete so we can put them back in the ring buffer without
> > stressing the page cache and the page allocator needlessly.
> 
> Got to think about complexity and how much is really worth trying to
> speed up strange cases. The page allocator is the generic "pipe" in
> the kernel to move pages between subsystems when they become unused :)
> 
> The page cache can be directed to be written out and discarded with
> fadvise and such.

Good point. This discard flag might do the trick and let us keep things simple.
The major concern here is to keep the page cache disturbance relatively low.
Which of new page allocation or stealing back the page has the lowest overhead
would have to be determined with benchmarks.

So I would tend to simply use this discard fadvise with new page allocation for
now.

> 
> You might also consider using direct IO.

Maybe. I'm unsure about what it implies in the splice() context though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 19:14                         ` Mathieu Desnoyers
@ 2010-05-19 19:31                           ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 19:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
> Good point. This discard flag might do the trick and let us keep things simple.
> The major concern here is to keep the page cache disturbance relatively low.
> Which of new page allocation or stealing back the page has the lowest overhead
> would have to be determined with benchmarks.

We could probably make it easier somehow to do the writeback and discard 
thing, but I have had _very_ good experiences with even a rather trivial 
file writer that basically used (iirc) 8MB windows, and the logic was very 
trivial:

 - before writing a new 8M window, do "start writeback" 
   (SYNC_FILE_RANGE_WRITE) on the previous window, and do 
   a wait (SYNC_FILE_RANGE_WAIT_AFTER) on the window before that.

in fact, in its simplest form, you can do it like this (this is from my 
"overwrite disk images" program that I use on old disks):

	for (index = 0; index < max_index ;index++) {
		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
			break;
		/* This won't block, but will start writeout asynchronously */
		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
		/* This does a blocking write-and-wait on any old ranges */
		if (index)
			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
	}

and even if you don't actually do a discard (maybe we should add a 
SYNC_FILE_RANGE_DISCARD bit, right now you'd need to do a separate 
fadvise(FADV_DONTNEED) to throw it out) the system behavior is pretty 
nice, because the heavy writer gets good IO performance _and_ leaves only 
easy-to-free pages around after itself.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 19:31                           ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-19 19:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
> Good point. This discard flag might do the trick and let us keep things simple.
> The major concern here is to keep the page cache disturbance relatively low.
> Which of new page allocation or stealing back the page has the lowest overhead
> would have to be determined with benchmarks.

We could probably make it easier somehow to do the writeback and discard 
thing, but I have had _very_ good experiences with even a rather trivial 
file writer that basically used (iirc) 8MB windows, and the logic was very 
trivial:

 - before writing a new 8M window, do "start writeback" 
   (SYNC_FILE_RANGE_WRITE) on the previous window, and do 
   a wait (SYNC_FILE_RANGE_WAIT_AFTER) on the window before that.

in fact, in its simplest form, you can do it like this (this is from my 
"overwrite disk images" program that I use on old disks):

	for (index = 0; index < max_index ;index++) {
		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
			break;
		/* This won't block, but will start writeout asynchronously */
		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
		/* This does a blocking write-and-wait on any old ranges */
		if (index)
			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
	}

and even if you don't actually do a discard (maybe we should add a 
SYNC_FILE_RANGE_DISCARD bit, right now you'd need to do a separate 
fadvise(FADV_DONTNEED) to throw it out) the system behavior is pretty 
nice, because the heavy writer gets good IO performance _and_ leaves only 
easy-to-free pages around after itself.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 14:59               ` Linus Torvalds
@ 2010-05-19 20:59                 ` Rick Sherm
  -1 siblings, 0 replies; 71+ messages in thread
From: Rick Sherm @ 2010-05-19 20:59 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds; +Cc: linux-kernel, arjan, linux-mm

Sorry for deleting CC'd addresses. yahoo was whining...

--- On Wed, 5/19/10, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Subject: Re: Unexpected splice "always copy" behavior observed
> To: "Steven Rostedt" <rostedt@goodmis.org>
> Cc: "Nick Piggin" <npiggin@suse.de>, "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Peter Zijlstra" <peterz@infradead.org>, "Frederic Weisbecker" <fweisbec@gmail.com>, "Pierre Tardy" <tardyp@gmail.com>, "Ingo Molnar" <mingo@elte.hu>, "Arnaldo Carvalho de Melo" <acme@redhat.com>, "Tom Zanussi" <tzanussi@gmail.com>, "Paul Mackerras" <paulus@samba.org>, linux-kernel@vger.kernel.org, arjan@infradead.org, ziga.mahkovec@gmail.com, "davem" <davem@davemloft.net>, linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Christoph Lameter" <cl@linux-foundation.org>, "Tejun Heo" <tj@kernel.org>, "Jens Axboe" <jens.axboe@oracle.com>
> Date: Wednesday, May 19, 2010, 2:59 PM
> 
> 
> On Wed, 19 May 2010, Steven Rostedt wrote:
> 
> > On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds
> wrote:
> > 
> > > The real limitation is likely always going to be
> the fact that it has to 
> > > be page-aligned and a full page. For a lot of
> splice inputs, that simply 
> > > won't be the case, and you'll end up copying for
> alignment reasons anyway.
> > 
> > That's understandable. For the use cases of splice I
> use, I work to make
> > it page aligned and full pages. Anyone else using
> splice for
> > optimizations, should do the same. It only makes
> sense.
> > 
> > The end of buffer may not be a full page, but then
> it's the end anyway,
> > and I'm not as interested in the speed.
> 
> Btw, since you apparently have a real case - is the "splice
> to file" 
> always just an append? IOW, if I'm not right in assuming
> that the only 
> sane thing people would reasonable care about is "append to
> a file", then 
> holler now.
> 

I've a similar 'append' use case:
http://marc.info/?l=linux-kernel&m=127143736527459&w=4

My mmapped buffers are pinned down.


      


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 20:59                 ` Rick Sherm
  0 siblings, 0 replies; 71+ messages in thread
From: Rick Sherm @ 2010-05-19 20:59 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds; +Cc: linux-kernel, arjan, linux-mm

Sorry for deleting CC'd addresses. yahoo was whining...

--- On Wed, 5/19/10, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> Subject: Re: Unexpected splice "always copy" behavior observed
> To: "Steven Rostedt" <rostedt@goodmis.org>
> Cc: "Nick Piggin" <npiggin@suse.de>, "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Peter Zijlstra" <peterz@infradead.org>, "Frederic Weisbecker" <fweisbec@gmail.com>, "Pierre Tardy" <tardyp@gmail.com>, "Ingo Molnar" <mingo@elte.hu>, "Arnaldo Carvalho de Melo" <acme@redhat.com>, "Tom Zanussi" <tzanussi@gmail.com>, "Paul Mackerras" <paulus@samba.org>, linux-kernel@vger.kernel.org, arjan@infradead.org, ziga.mahkovec@gmail.com, "davem" <davem@davemloft.net>, linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Christoph Lameter" <cl@linux-foundation.org>, "Tejun Heo" <tj@kernel.org>, "Jens Axboe" <jens.axboe@oracle.com>
> Date: Wednesday, May 19, 2010, 2:59 PM
> 
> 
> On Wed, 19 May 2010, Steven Rostedt wrote:
> 
> > On Wed, 2010-05-19 at 07:39 -0700, Linus Torvalds
> wrote:
> > 
> > > The real limitation is likely always going to be
> the fact that it has to 
> > > be page-aligned and a full page. For a lot of
> splice inputs, that simply 
> > > won't be the case, and you'll end up copying for
> alignment reasons anyway.
> > 
> > That's understandable. For the use cases of splice I
> use, I work to make
> > it page aligned and full pages. Anyone else using
> splice for
> > optimizations, should do the same. It only makes
> sense.
> > 
> > The end of buffer may not be a full page, but then
> it's the end anyway,
> > and I'm not as interested in the speed.
> 
> Btw, since you apparently have a real case - is the "splice
> to file" 
> always just an append? IOW, if I'm not right in assuming
> that the only 
> sane thing people would reasonable care about is "append to
> a file", then 
> holler now.
> 

I've a similar 'append' use case:
http://marc.info/?l=linux-kernel&m=127143736527459&w=4

My mmapped buffers are pinned down.


      

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 19:31                           ` Linus Torvalds
@ 2010-05-19 21:49                             ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> > 
> > Good point. This discard flag might do the trick and let us keep things simple.
> > The major concern here is to keep the page cache disturbance relatively low.
> > Which of new page allocation or stealing back the page has the lowest overhead
> > would have to be determined with benchmarks.
> 
> We could probably make it easier somehow to do the writeback and discard 
> thing, but I have had _very_ good experiences with even a rather trivial 
> file writer that basically used (iirc) 8MB windows, and the logic was very 
> trivial:
> 
>  - before writing a new 8M window, do "start writeback" 
>    (SYNC_FILE_RANGE_WRITE) on the previous window, and do 
>    a wait (SYNC_FILE_RANGE_WAIT_AFTER) on the window before that.
> 
> in fact, in its simplest form, you can do it like this (this is from my 
> "overwrite disk images" program that I use on old disks):
> 
> 	for (index = 0; index < max_index ;index++) {
> 		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
> 			break;
> 		/* This won't block, but will start writeout asynchronously */
> 		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
> 		/* This does a blocking write-and-wait on any old ranges */
> 		if (index)
> 			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 	}
> 
> and even if you don't actually do a discard (maybe we should add a 
> SYNC_FILE_RANGE_DISCARD bit, right now you'd need to do a separate 
> fadvise(FADV_DONTNEED) to throw it out) the system behavior is pretty 
> nice, because the heavy writer gets good IO performance _and_ leaves only 
> easy-to-free pages around after itself.

Great! I just implemented it in LTTng and it works very well !

A faced a small counter-intuitive fadvise behavior though.

  posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

only seems to affect the parts of a file that already exist. So after each
splice() that appends to the file, I have to call fadvise again. I would have
expected the "0" len parameter to tell the kernel to apply the hint to the whole
file, even parts that will be added in the future. I expect we have this
behavior because fadvise() was initially made with read behavior in mind rather
than write.

For the records, I do a fadvice+async range write after each splice(). Also,
after each subbuffer write, I do a blocking write-and-wait on all pages that are
in the subbuffer prior to the one that has just been written, instead of using
the fixed 8MB window.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-19 21:49                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-19 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> > 
> > Good point. This discard flag might do the trick and let us keep things simple.
> > The major concern here is to keep the page cache disturbance relatively low.
> > Which of new page allocation or stealing back the page has the lowest overhead
> > would have to be determined with benchmarks.
> 
> We could probably make it easier somehow to do the writeback and discard 
> thing, but I have had _very_ good experiences with even a rather trivial 
> file writer that basically used (iirc) 8MB windows, and the logic was very 
> trivial:
> 
>  - before writing a new 8M window, do "start writeback" 
>    (SYNC_FILE_RANGE_WRITE) on the previous window, and do 
>    a wait (SYNC_FILE_RANGE_WAIT_AFTER) on the window before that.
> 
> in fact, in its simplest form, you can do it like this (this is from my 
> "overwrite disk images" program that I use on old disks):
> 
> 	for (index = 0; index < max_index ;index++) {
> 		if (write(fd, buffer, BUFSIZE) != BUFSIZE)
> 			break;
> 		/* This won't block, but will start writeout asynchronously */
> 		sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
> 		/* This does a blocking write-and-wait on any old ranges */
> 		if (index)
> 			sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 	}
> 
> and even if you don't actually do a discard (maybe we should add a 
> SYNC_FILE_RANGE_DISCARD bit, right now you'd need to do a separate 
> fadvise(FADV_DONTNEED) to throw it out) the system behavior is pretty 
> nice, because the heavy writer gets good IO performance _and_ leaves only 
> easy-to-free pages around after itself.

Great! I just implemented it in LTTng and it works very well !

A faced a small counter-intuitive fadvise behavior though.

  posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

only seems to affect the parts of a file that already exist. So after each
splice() that appends to the file, I have to call fadvise again. I would have
expected the "0" len parameter to tell the kernel to apply the hint to the whole
file, even parts that will be added in the future. I expect we have this
behavior because fadvise() was initially made with read behavior in mind rather
than write.

For the records, I do a fadvice+async range write after each splice(). Also,
after each subbuffer write, I do a blocking write-and-wait on all pages that are
in the subbuffer prior to the one that has just been written, instead of using
the fixed 8MB window.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-19 21:49                             ` Mathieu Desnoyers
@ 2010-05-20  0:04                               ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-20  0:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
> A faced a small counter-intuitive fadvise behavior though.
> 
>   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> only seems to affect the parts of a file that already exist.

POSIX_FADV_DONTNEED does not have _any_ long-term behavior. So when you do 
a 

	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

it only affects the pages that are there right now, it has no effect on 
any future actions.

> So after each splice() that appends to the file, I have to call fadvise 
> again. I would have expected the "0" len parameter to tell the kernel to 
> apply the hint to the whole file, even parts that will be added in the 
> future.

It's not a hint about future at all. It's a "throw current pages away".

I would also suggest against doing that kind of thing in a streaming write 
situation. The behavior for dirty page writeback is _not_ welldefined, and 
if you do POSIX_FADV_DONTNEED, I would suggest you do it as part of that 
writeback logic, ie you do it only on ranges that you have just waited on.

IOW, in my example, you'd couple the

	sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);

with a

	posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE, POSIX_FADV_DONTNEED);

afterwards to throw out the pages that you just waited for.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-20  0:04                               ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-20  0:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
> A faced a small counter-intuitive fadvise behavior though.
> 
>   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> only seems to affect the parts of a file that already exist.

POSIX_FADV_DONTNEED does not have _any_ long-term behavior. So when you do 
a 

	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

it only affects the pages that are there right now, it has no effect on 
any future actions.

> So after each splice() that appends to the file, I have to call fadvise 
> again. I would have expected the "0" len parameter to tell the kernel to 
> apply the hint to the whole file, even parts that will be added in the 
> future.

It's not a hint about future at all. It's a "throw current pages away".

I would also suggest against doing that kind of thing in a streaming write 
situation. The behavior for dirty page writeback is _not_ welldefined, and 
if you do POSIX_FADV_DONTNEED, I would suggest you do it as part of that 
writeback logic, ie you do it only on ranges that you have just waited on.

IOW, in my example, you'd couple the

	sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);

with a

	posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE, POSIX_FADV_DONTNEED);

afterwards to throw out the pages that you just waited for.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-20  0:04                               ` Linus Torvalds
@ 2010-05-20  1:56                                 ` Mathieu Desnoyers
  -1 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-20  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe, Michael Kerrisk, linux-man

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> > 
> > A faced a small counter-intuitive fadvise behavior though.
> > 
> >   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> > 
> > only seems to affect the parts of a file that already exist.
> 
> POSIX_FADV_DONTNEED does not have _any_ long-term behavior. So when you do 
> a 
> 
> 	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> it only affects the pages that are there right now, it has no effect on 
> any future actions.

Hrm, someone should tell the author of posix_fadvise(2) about the benefit of
some clarifications (I'm CCing the manpage maintainer)

Quoting man posix_fadvise, annotated:


       Programs  can  use  posix_fadvise()  to announce an intention to access
       file data in a specific pattern in the future, thus allowing the kernel
       to perform appropriate optimizations.

This only talks about future accesses, not past. From what I understand, you are
saying that in the writeback case it's better to think of posix_fadvise() as
applying to pages that have been written in the past too.


       The  advice  applies to a (not necessarily existent) region starting at
       offset and extending for len bytes (or until the end of the file if len
       is 0) within the file referred to by fd.  The advice is not binding; it
       merely constitutes an expectation on behalf of the application.
 
This could be enhanced by saying that it applies up to the current file size if
0 is specified, and does not extend as the file grows. The formulation as it is
currently stated is a bit misleading.

> > So after each splice() that appends to the file, I have to call fadvise 
> > again. I would have expected the "0" len parameter to tell the kernel to 
> > apply the hint to the whole file, even parts that will be added in the 
> > future.
> 
> It's not a hint about future at all. It's a "throw current pages away".
> 
> I would also suggest against doing that kind of thing in a streaming write 
> situation. The behavior for dirty page writeback is _not_ welldefined, and 
> if you do POSIX_FADV_DONTNEED, I would suggest you do it as part of that 
> writeback logic, ie you do it only on ranges that you have just waited on.
> 
> IOW, in my example, you'd couple the
> 
> 	sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> with a
> 
> 	posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE, POSIX_FADV_DONTNEED);
> 
> afterwards to throw out the pages that you just waited for.

OK, so it's better to do the writeback as part of sync_file_range rather than
relying on the dirty page writeback to do it for us. I guess the I/O scheduler
will have more room to ensure that writes are contiguous.

Thanks for the feedback,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-20  1:56                                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 71+ messages in thread
From: Mathieu Desnoyers @ 2010-05-20  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe, Michael Kerrisk, linux-man

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> > 
> > A faced a small counter-intuitive fadvise behavior though.
> > 
> >   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> > 
> > only seems to affect the parts of a file that already exist.
> 
> POSIX_FADV_DONTNEED does not have _any_ long-term behavior. So when you do 
> a 
> 
> 	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> it only affects the pages that are there right now, it has no effect on 
> any future actions.

Hrm, someone should tell the author of posix_fadvise(2) about the benefit of
some clarifications (I'm CCing the manpage maintainer)

Quoting man posix_fadvise, annotated:


       Programs  can  use  posix_fadvise()  to announce an intention to access
       file data in a specific pattern in the future, thus allowing the kernel
       to perform appropriate optimizations.

This only talks about future accesses, not past. From what I understand, you are
saying that in the writeback case it's better to think of posix_fadvise() as
applying to pages that have been written in the past too.


       The  advice  applies to a (not necessarily existent) region starting at
       offset and extending for len bytes (or until the end of the file if len
       is 0) within the file referred to by fd.  The advice is not binding; it
       merely constitutes an expectation on behalf of the application.
 
This could be enhanced by saying that it applies up to the current file size if
0 is specified, and does not extend as the file grows. The formulation as it is
currently stated is a bit misleading.

> > So after each splice() that appends to the file, I have to call fadvise 
> > again. I would have expected the "0" len parameter to tell the kernel to 
> > apply the hint to the whole file, even parts that will be added in the 
> > future.
> 
> It's not a hint about future at all. It's a "throw current pages away".
> 
> I would also suggest against doing that kind of thing in a streaming write 
> situation. The behavior for dirty page writeback is _not_ welldefined, and 
> if you do POSIX_FADV_DONTNEED, I would suggest you do it as part of that 
> writeback logic, ie you do it only on ranges that you have just waited on.
> 
> IOW, in my example, you'd couple the
> 
> 	sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> with a
> 
> 	posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE, POSIX_FADV_DONTNEED);
> 
> afterwards to throw out the pages that you just waited for.

OK, so it's better to do the writeback as part of sync_file_range rather than
relying on the dirty page writeback to do it for us. I guess the I/O scheduler
will have more room to ensure that writes are contiguous.

Thanks for the feedback,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
  2010-05-20  1:56                                 ` Mathieu Desnoyers
  (?)
@ 2010-05-20 14:18                                   ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-20 14:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe, Michael Kerrisk, linux-man



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
>        Programs  can  use  posix_fadvise()  to announce an intention to access
>        file data in a specific pattern in the future, thus allowing the kernel
>        to perform appropriate optimizations.

It's true for some of them. The random-vs-linear behavior is a flag for 
the future, for example (relevant for prefetching).

In fact, it's technically true even for DONTNEED. It's true that we won't 
need the pages in the future! So we throw the pages away. But that means 
that we throw the _current_ pages away.

If we actually touch pages later, than that obviously invalidates the fact 
that we said 'DONTNEED' - we clearly needed them.

		Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-20 14:18                                   ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-20 14:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	tardyp-Re5JQEeQqe8AvxtiuMwx3w, mingo-X9Un+BFzKDI,
	acme-H+wXaHxf7aLQT0dZR+AlfA, tzanussi-Re5JQEeQqe8AvxtiuMwx3w,
	paulus-eUNUBHrolfbYtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	arjan-wEGCiKHe2LqWVfeAwA7xHQ,
	ziga.mahkovec-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Michael Kerrisk,
	linux-man-u79uwXL29TY76Z2rM5mHXA



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
>        Programs  can  use  posix_fadvise()  to announce an intention to access
>        file data in a specific pattern in the future, thus allowing the kernel
>        to perform appropriate optimizations.

It's true for some of them. The random-vs-linear behavior is a flag for 
the future, for example (relevant for prefetching).

In fact, it's technically true even for DONTNEED. It's true that we won't 
need the pages in the future! So we throw the pages away. But that means 
that we throw the _current_ pages away.

If we actually touch pages later, than that obviously invalidates the fact 
that we said 'DONTNEED' - we clearly needed them.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Unexpected splice "always copy" behavior observed
@ 2010-05-20 14:18                                   ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2010-05-20 14:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Steven Rostedt, Miklos Szeredi, peterz, fweisbec,
	tardyp, mingo, acme, tzanussi, paulus, linux-kernel, arjan,
	ziga.mahkovec, davem, linux-mm, akpm, kosaki.motohiro, cl, tj,
	jens.axboe, Michael Kerrisk, linux-man



On Wed, 19 May 2010, Mathieu Desnoyers wrote:
> 
>        Programs  can  use  posix_fadvise()  to announce an intention to access
>        file data in a specific pattern in the future, thus allowing the kernel
>        to perform appropriate optimizations.

It's true for some of them. The random-vs-linear behavior is a flag for 
the future, for example (relevant for prefetching).

In fact, it's technically true even for DONTNEED. It's true that we won't 
need the pages in the future! So we throw the pages away. But that means 
that we throw the _current_ pages away.

If we actually touch pages later, than that obviously invalidates the fact 
that we said 'DONTNEED' - we clearly needed them.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2010-05-20 14:22 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-18 15:34 Unexpected splice "always copy" behavior observed Mathieu Desnoyers
2010-05-18 15:34 ` Mathieu Desnoyers
2010-05-18 15:51 ` Nick Piggin
2010-05-18 15:51   ` Nick Piggin
2010-05-18 15:56   ` Christoph Lameter
2010-05-18 15:56     ` Christoph Lameter
2010-05-18 16:00     ` Nick Piggin
2010-05-18 16:00       ` Nick Piggin
2010-05-18 16:13       ` Nick Piggin
2010-05-18 16:13         ` Nick Piggin
2010-05-18 15:53 ` Steven Rostedt
2010-05-18 15:53   ` Steven Rostedt
2010-05-18 16:10   ` Steven Rostedt
2010-05-18 16:10     ` Steven Rostedt
2010-05-18 16:25     ` Linus Torvalds
2010-05-18 16:25       ` Linus Torvalds
2010-05-19  6:31       ` Nick Piggin
2010-05-19  6:31         ` Nick Piggin
2010-05-19 14:39         ` Linus Torvalds
2010-05-19 14:39           ` Linus Torvalds
2010-05-19 14:56           ` Steven Rostedt
2010-05-19 14:56             ` Steven Rostedt
2010-05-19 14:59             ` Linus Torvalds
2010-05-19 14:59               ` Linus Torvalds
2010-05-19 15:12               ` Steven Rostedt
2010-05-19 15:12                 ` Steven Rostedt
2010-05-19 15:51                 ` Mathieu Desnoyers
2010-05-19 15:51                   ` Mathieu Desnoyers
2010-05-19 15:33               ` Miklos Szeredi
2010-05-19 15:33                 ` Miklos Szeredi
2010-05-19 15:45                 ` Steven Rostedt
2010-05-19 15:45                   ` Steven Rostedt
2010-05-19 15:55                   ` Nick Piggin
2010-05-19 15:55                     ` Nick Piggin
2010-05-19 16:01                     ` Mathieu Desnoyers
2010-05-19 16:01                       ` Mathieu Desnoyers
2010-05-19 16:36                     ` Steven Rostedt
2010-05-19 16:36                       ` Steven Rostedt
2010-05-19 15:57                   ` Mathieu Desnoyers
2010-05-19 15:57                     ` Mathieu Desnoyers
2010-05-19 16:27                     ` Nick Piggin
2010-05-19 16:27                       ` Nick Piggin
2010-05-19 19:14                       ` Mathieu Desnoyers
2010-05-19 19:14                         ` Mathieu Desnoyers
2010-05-19 19:31                         ` Linus Torvalds
2010-05-19 19:31                           ` Linus Torvalds
2010-05-19 21:49                           ` Mathieu Desnoyers
2010-05-19 21:49                             ` Mathieu Desnoyers
2010-05-20  0:04                             ` Linus Torvalds
2010-05-20  0:04                               ` Linus Torvalds
2010-05-20  1:56                               ` Mathieu Desnoyers
2010-05-20  1:56                                 ` Mathieu Desnoyers
2010-05-20 14:18                                 ` Linus Torvalds
2010-05-20 14:18                                   ` Linus Torvalds
2010-05-20 14:18                                   ` Linus Torvalds
2010-05-19 20:59               ` Rick Sherm
2010-05-19 20:59                 ` Rick Sherm
2010-05-19 15:17           ` Nick Piggin
2010-05-19 15:17             ` Nick Piggin
2010-05-19 15:30             ` Linus Torvalds
2010-05-19 15:30               ` Linus Torvalds
2010-05-19 15:44               ` Nick Piggin
2010-05-19 15:44                 ` Nick Piggin
2010-05-19 15:28           ` Miklos Szeredi
2010-05-19 15:28             ` Miklos Szeredi
2010-05-19 15:32             ` Linus Torvalds
2010-05-19 15:32               ` Linus Torvalds
2010-05-19 15:56               ` Miklos Szeredi
2010-05-19 15:56                 ` Miklos Szeredi
2010-05-19 16:01                 ` Linus Torvalds
2010-05-19 16:01                   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.