linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* problem with orangefs readpage...
@ 2020-12-31 21:51 Mike Marshall
  2021-01-01  4:08 ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Marshall @ 2020-12-31 21:51 UTC (permalink / raw)
  To: linux-fsdevel, Mike Marshall, Mike Marshall

Greetings...

I hope some of you will suffer through reading this long message :-) ...

Orangefs isn't built to do small IO. Reading a
big file in page cache sized chunks is slow and painful.
I tried to write orangefs_readpage so that it would do a reasonable
sized hard IO, fill the page that was being called for, and then
go ahead and fill a whole bunch of the following pages into the
page cache with the extra data in the IO buffer.

Anywho... I thought this code was working pretty much like I
designed it to work, but on closer inspection I see that it is
not, and I thought I'd ask for some help or suggestions.

Here's the core of the loop in orangefs_readpage that tries to fill extra
pages, and what follows is a description of how it is not working
the way I designed it to work:

        while there's still data in the IO buffer
        {
            index++;
            slot_index++;
            next_page = find_get_page(inode->i_mapping, index);
            if (next_page) {
                gossip_debug(GOSSIP_FILE_DEBUG,
                    "%s: found next page, quitting\n",
                    __func__);
                put_page(next_page);
                goto out;
            }
            next_page = find_or_create_page(inode->i_mapping,
                            index,
                            GFP_KERNEL);
            /*
             * I've never hit this, leave it as a printk for
             * now so it will be obvious.
             */
            if (!next_page) {
                printk("%s: can't create next page, quitting\n",
                    __func__);
                goto out;
            }
            kaddr = kmap_atomic(next_page);
            orangefs_bufmap_page_fill(kaddr,
                        buffer_index,
                        slot_index);
            kunmap_atomic(kaddr);
            SetPageUptodate(next_page);
            unlock_page(next_page);
            put_page(next_page);
        }

So... my design was that orangefs_readpage would get called, the
needed page would be supplied and a bunch of the following pages
would get filled as well. That way if more pages were needed,
they would be in the page cache already.

My plan "kind of" works when a file is read a page at a time:

  /pvfsmnt/nine is nine pages long.
  -rwxr-xr-x. 1 root root 36864 Dec 29 11:09 /pvfsmnt/nine

  dd if=/pvfsmnt/nine of=/tmp/nine bs=4096 count=9

orangefs_readpage gets called for the first four pages and then my
prefill kicks in and fills the next pages and the right data ends
up in /tmp/nine. I, of course, wished and planned for orangefs_readpage
to only get called once, I don't understand why it gets called four
times, which results in three extraneous expensive hard IOs.

A nine page file is just an example, in general when files are read
a page at a time, orangefs_readpage gets called four times and the
rest of the pages (up to the design limit) are pre-filled.

When a file gets read all at once, though, my design
fails in a different way...

  dd if=/pvfsmnt/nine of=/tmp/nine bs=36864 count=1

In the above, orangefs_readpage gets called nine times, with
eight extraneous expensive hard IOs. Further investigation into
larger and larger block sizes shows a pattern.

I hope it is apparent to some of you why my page-at-a-time
reads don't start pre-filling until after four calls to orangefs_readpage.
Below are some more examples that show what happens with larger and larger
block sizes, hopefully the pattern there will be suggestive as well.

/pvfsmnt/N is a file exactly N pages long.

Key: orangefs_readpage->X times foo, bar, baz, ..., qux

X = number of calls to orangefs_readpage.
foo = number of bytes fetched from Orangefs on the first read.
bar = number of bytes fetched from Orangefs on the extraneous 2nd read.
baz = number of bytes fetched from Orangefs on the extraneous 3rd read.
qux = number of bytes fetched from Orangefs on the extraneous last read.


  dd if=/pvfsmnt/32 of=/tmp/32 bs=131072 count=1
       orangefs_readpage->32 times 131072, 126976, 122880, ..., 4096
       orangefs_bufmap_page_fill->0 times

  dd if=/pvfsmnt/33 of=/tmp/33 bs=135168 count=1
       orangefs_readpage->32 times 135168, 131072, 126976, ..., 8192
       orangefs_bufmap_page_fill->1 time

  dd if=/pvfsmnt/34 of=/tmp/34 bs=139264 count=1
       orangefs_readpage->32 times 139264, 135168, 131072, ..., 12288
       orangefs_bufmap_page_fill->2 times

  dd if=/pvfsmnt/35 of=/tmp/35 bs=143360 count=1
       orangefs_readpage->32 times 143360, 139264, 135168, ..., 16384
       orangefs_bufmap_page_fill->3 times

  dd if=/pvfsmnt/36 of=/tmp/36 bs=147456 count=1
       orangefs_readpage->32 times 147456, 143360, 139264, ..., 20480
       orangefs_bufmap_page_fill->4 times

  dd if=/pvfsmnt/37 of=/tmp/37 bs=151552 count=1
       orangefs_readpage->32 times 151552, 147456, 143360, ..., 24576
       orangefs_bufmap_page_fill->5 times

  dd if=/pvfsmnt/38 of=/tmp/38 bs=155648 count=1
       orangefs_readpage->32 times 155648, 151552, 147456, ..., 28672
       orangefs_bufmap_page_fill->6 times

  dd if=/pvfsmnt/39 of=/tmp/39 bs=159744 count=1
       orangefs_readpage->32 times 159744, 155648, 151552, ..., 32768
       orangefs_bufmap_page_fill->7 times

  dd if=/pvfsmnt/40 of=/tmp/40 bs=163840 count=1
       orangefs_readpage->32 times 163840, 159744, 155648, ..., 36864
       orangefs_bufmap_page_fill->8 times

  dd if=/pvfsmnt/41 of=/tmp/41 bs=167936 count=1
       orangefs_readpage->32 times 167936, 163840, 159744, ..., 40960
       orangefs_bufmap_page_fill->9 times

                     .
                     .
                     .

  dd if=/pvfsmnt/47 of=/tmp/47 bs=192512 count=1
       orangefs_readpage->32 times 192512, 188416, 184320, ..., 65536
       orangefs_bufmap_page_fill->15 times

  dd if=/pvfsmnt/48 of=/tmp/48 bs=196608 count=1
       orangefs_readpage->32 times 196608, 192512, 188416, ..., 69632
       orangefs_bufmap_page_fill->16 times

  dd if=/pvfsmnt/49 of=/tmp/49 bs=200704 count=1
       orangefs_readpage->32 times 200704, 196608, 192512, ..., 73728
       orangefs_bufmap_page_fill->17 times

                     .
                     .
                     .

  dd if=/pvfsmnt/63 of=/tmp/63 bs=258048 count=1
       orangefs_readpage->32 times 258048, 253952, 249856, ..., 131072
       orangefs_bufmap_page_fill->31 times

  dd if=/pvfsmnt/64 of=/tmp/64 bs=262144 count=1
       orangefs_readpage->32 times 262144, 258048, 253952, ..., 135168
       orangefs_bufmap_page_fill->32 times

  dd if=/pvfsmnt/65 of=/tmp/65 bs=266240 count=1
       orangefs_readpage->32 times 266240, 262144, 258048, ..., 139264
       orangefs_bufmap_page_fill->33 times

                     .
                     .
                     .

  dd if=/pvfsmnt/127 of=/tmp/127 bs=520192 count=1
       orangefs_readpage->32 times 520192, 516096, 512000, ..., 393216
       orangefs_bufmap_page_fill->95 times

  dd if=/pvfsmnt/128 of=/tmp/128 bs=524288 count=1
       orangefs_readpage->32 times 524288, 520192, 516096, ..., 397312
       orangefs_bufmap_page_fill->96 times

It kind of starts over here, since the hard IOs are all 524288 bytes.
# grep 524288 fs/orangefs/inode.c
    read_size = 524288;

Thanks for any help y'all can give, I'll of course keep on trying
to understand what is going on.

-Mike

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: problem with orangefs readpage...
  2020-12-31 21:51 problem with orangefs readpage Mike Marshall
@ 2021-01-01  4:08 ` Matthew Wilcox
       [not found]   ` <CAAoczXbw9A+kqMemEsJax+CaPkQsJzZNw6Y7XFhTsBqDnGD6hw@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2021-01-01  4:08 UTC (permalink / raw)
  To: Mike Marshall; +Cc: linux-fsdevel, Mike Marshall

On Thu, Dec 31, 2020 at 04:51:53PM -0500, Mike Marshall wrote:
> Greetings...
> 
> I hope some of you will suffer through reading this long message :-) ...

Hi Mike!  Happy New Year!

> Orangefs isn't built to do small IO. Reading a
> big file in page cache sized chunks is slow and painful.
> I tried to write orangefs_readpage so that it would do a reasonable
> sized hard IO, fill the page that was being called for, and then
> go ahead and fill a whole bunch of the following pages into the
> page cache with the extra data in the IO buffer.

This is some new version of orangefs_readpage(), right?  I don't see
anything resembling this in the current codebase.  Did you disable
orangefs_readpages() as part of this work?  Because the behaviour you're
describing sounds very much like what the readahead code might do to a
filesystem which implements readpage and neither readahead nor readpages.

> orangefs_readpage gets called for the first four pages and then my
> prefill kicks in and fills the next pages and the right data ends
> up in /tmp/nine. I, of course, wished and planned for orangefs_readpage
> to only get called once, I don't understand why it gets called four
> times, which results in three extraneous expensive hard IOs.

I might suggest some judicious calling of dump_stack() to understand
exactly what's calling you.  My suspicion is that it's this loop in
read_pages():

                while ((page = readahead_page(rac))) {
                        aops->readpage(rac->file, page);
                        put_page(page);
                }

which doesn't test for PageUptodate before calling you.

It'd probably be best if you implemented ->readahead, which has its own
ideas about which pages would be the right ones to read.  It's not always correct, but generally better to have that logic in the VFS than in each filesystem.

You probably want to have a look at Dave Howells' work to allow
the filesystem to expand the ractl:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter

specifically this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=f582790b32d5d1d8b937df95a8b2b5fdb8380e46

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Fwd: problem with orangefs readpage...
       [not found]   ` <CAAoczXbw9A+kqMemEsJax+CaPkQsJzZNw6Y7XFhTsBqDnGD6hw@mail.gmail.com>
@ 2021-01-01 22:23     ` Mike Marshall
  2021-01-02  2:07     ` Matthew Wilcox
  1 sibling, 0 replies; 4+ messages in thread
From: Mike Marshall @ 2021-01-01 22:23 UTC (permalink / raw)
  To: linux-fsdevel

Oops... my non-work email doesn't default to text only, so this
bounced to the list...

---------- Forwarded message ---------
From: Mike Marshall <hubcapsc@gmail.com>
Date: Fri, Jan 1, 2021 at 5:15 PM
Subject: Re: problem with orangefs readpage...
To: Matthew Wilcox <willy@infradead.org>
Cc: Mike Marshall <hubcap@omnibond.com>, linux-fsdevel
<linux-fsdevel@vger.kernel.org>



Hi Matthew... Thanks so much for the suggestions!

> This is some new version of orangefs_readpage(), right?

No, that code has been upstream for a while... that readahead_control
thing looks very interesting :-) ...

-Mike

On Thu, Dec 31, 2020 at 11:08 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Dec 31, 2020 at 04:51:53PM -0500, Mike Marshall wrote:
> > Greetings...
> >
> > I hope some of you will suffer through reading this long message :-) ...
>
> Hi Mike!  Happy New Year!
>
> > Orangefs isn't built to do small IO. Reading a
> > big file in page cache sized chunks is slow and painful.
> > I tried to write orangefs_readpage so that it would do a reasonable
> > sized hard IO, fill the page that was being called for, and then
> > go ahead and fill a whole bunch of the following pages into the
> > page cache with the extra data in the IO buffer.
>
> This is some new version of orangefs_readpage(), right?  I don't see
> anything resembling this in the current codebase.  Did you disable
> orangefs_readpages() as part of this work?  Because the behaviour you're
> describing sounds very much like what the readahead code might do to a
> filesystem which implements readpage and neither readahead nor readpages.
>
> > orangefs_readpage gets called for the first four pages and then my
> > prefill kicks in and fills the next pages and the right data ends
> > up in /tmp/nine. I, of course, wished and planned for orangefs_readpage
> > to only get called once, I don't understand why it gets called four
> > times, which results in three extraneous expensive hard IOs.
>
> I might suggest some judicious calling of dump_stack() to understand
> exactly what's calling you.  My suspicion is that it's this loop in
> read_pages():
>
>                 while ((page = readahead_page(rac))) {
>                         aops->readpage(rac->file, page);
>                         put_page(page);
>                 }
>
> which doesn't test for PageUptodate before calling you.
>
> It'd probably be best if you implemented ->readahead, which has its own
> ideas about which pages would be the right ones to read.  It's not always correct, but generally better to have that logic in the VFS than in each filesystem.
>
> You probably want to have a look at Dave Howells' work to allow
> the filesystem to expand the ractl:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
>
> specifically this patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=f582790b32d5d1d8b937df95a8b2b5fdb8380e46

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: problem with orangefs readpage...
       [not found]   ` <CAAoczXbw9A+kqMemEsJax+CaPkQsJzZNw6Y7XFhTsBqDnGD6hw@mail.gmail.com>
  2021-01-01 22:23     ` Fwd: " Mike Marshall
@ 2021-01-02  2:07     ` Matthew Wilcox
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2021-01-02  2:07 UTC (permalink / raw)
  To: Mike Marshall; +Cc: Mike Marshall, linux-fsdevel

On Fri, Jan 01, 2021 at 05:15:32PM -0500, Mike Marshall wrote:
> Hi Matthew... Thanks so much for the suggestions!
> > This is some new version of orangefs_readpage(), right?
> No, that code has been upstream for a while... that readahead_control
> thing looks very interesting :-) ...

Oh, my, I was looking at a tree from before 2018 that still had
orangefs_readpages.  So, yes, I think what's happening is that
orangefs_readpage() is being called from the readahead code.
You'll hit this path:

                        next_page = find_get_page(inode->i_mapping, index);
                        if (next_page) {
                                gossip_debug(GOSSIP_FILE_DEBUG,
                                        "%s: found next page, quitting\n",
                                        __func__);
                                put_page(next_page);
                                goto out;

because readahead already allocated those pages for you and is trying
to fill them one-at-a-time.

Implementing ->readahead, even without dhowells' patch to expand
the ractl will definitely help you!

> -Mike
> 
> On Thu, Dec 31, 2020 at 11:08 PM Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Thu, Dec 31, 2020 at 04:51:53PM -0500, Mike Marshall wrote:
> > > Greetings...
> > >
> > > I hope some of you will suffer through reading this long message :-) ...
> >
> > Hi Mike!  Happy New Year!
> >
> > > Orangefs isn't built to do small IO. Reading a
> > > big file in page cache sized chunks is slow and painful.
> > > I tried to write orangefs_readpage so that it would do a reasonable
> > > sized hard IO, fill the page that was being called for, and then
> > > go ahead and fill a whole bunch of the following pages into the
> > > page cache with the extra data in the IO buffer.
> >
> > This is some new version of orangefs_readpage(), right?  I don't see
> > anything resembling this in the current codebase.  Did you disable
> > orangefs_readpages() as part of this work?  Because the behaviour you're
> > describing sounds very much like what the readahead code might do to a
> > filesystem which implements readpage and neither readahead nor readpages.
> >
> > > orangefs_readpage gets called for the first four pages and then my
> > > prefill kicks in and fills the next pages and the right data ends
> > > up in /tmp/nine. I, of course, wished and planned for orangefs_readpage
> > > to only get called once, I don't understand why it gets called four
> > > times, which results in three extraneous expensive hard IOs.
> >
> > I might suggest some judicious calling of dump_stack() to understand
> > exactly what's calling you.  My suspicion is that it's this loop in
> > read_pages():
> >
> >                 while ((page = readahead_page(rac))) {
> >                         aops->readpage(rac->file, page);
> >                         put_page(page);
> >                 }
> >
> > which doesn't test for PageUptodate before calling you.
> >
> > It'd probably be best if you implemented ->readahead, which has its own
> > ideas about which pages would be the right ones to read.  It's not always
> > correct, but generally better to have that logic in the VFS than in each
> > filesystem.
> >
> > You probably want to have a look at Dave Howells' work to allow
> > the filesystem to expand the ractl:
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
> >
> > specifically this patch:
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/commit/?h=fscache-iter&id=f582790b32d5d1d8b937df95a8b2b5fdb8380e46
> >

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-02  2:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-31 21:51 problem with orangefs readpage Mike Marshall
2021-01-01  4:08 ` Matthew Wilcox
     [not found]   ` <CAAoczXbw9A+kqMemEsJax+CaPkQsJzZNw6Y7XFhTsBqDnGD6hw@mail.gmail.com>
2021-01-01 22:23     ` Fwd: " Mike Marshall
2021-01-02  2:07     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).