All of lore.kernel.org
 help / color / mirror / Atom feed
* Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
@ 2021-10-06  9:25 Hsin-Yi Wang
  2021-10-06 11:20 ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Hsin-Yi Wang @ 2021-10-06  9:25 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

Hi Matthew,

We tested that the performance of readahead is regressed on multicore
arm64 platforms running on the 5.10 kernel.
- The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
- The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
note that if the file size is small, it's not obvious to see the
regression)

After we revert the commit c1f6925e1091("mm: put readahead pages in
cache earlier"), the readahead performance is back:
- time ureadahead $FILE:
  - 5.10: 1m23.124s
  - with c1f6925e1091 reverted: 0m3.323s
  - other LTS kernel (eg. 5.4): 0m3.066s

The slowest part is aops->readpage() in read_pages() called in
read_pages(ractl, &page_pool, false); (the 3rd in
page_cache_ra_unbounded())

static void read_pages(struct readahead_control *rac, struct list_head *pages,
        bool skip_page)
{
    ...
    if (aops->readahead) {
        ...
    } else if (aops->readpages) {
        ...
    } else {
        while ((page = readahead_page(rac))) {
            aops->readpage(rac->file, page);   // most of the time is
spent on this line
            put_page(page);
        }
    }
    ...
}

We also found following metrics that are relevant:
- time ureadahead $FILE:
  - 5.10
      - taskset ureadahead to a small core: 0m7.411s
      - taskset ureadahead to a big core: 0m5.982s
  compared to the original 1m23s, pining the ureadahead task on a
single core also solves the gap.

Do you have any idea why moving pages to cache earlier then doing page
read later will cause such a difference?

Thanks,

Hsin-Yi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-06  9:25 Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms Hsin-Yi Wang
@ 2021-10-06 11:20 ` Matthew Wilcox
  2021-10-06 13:07   ` Hsin-Yi Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-06 11:20 UTC (permalink / raw)
  To: Hsin-Yi Wang
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

On Wed, Oct 06, 2021 at 05:25:23PM +0800, Hsin-Yi Wang wrote:
> Hi Matthew,
> 
> We tested that the performance of readahead is regressed on multicore
> arm64 platforms running on the 5.10 kernel.
> - The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
> - The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
> note that if the file size is small, it's not obvious to see the
> regression)
> 
> After we revert the commit c1f6925e1091("mm: put readahead pages in
> cache earlier"), the readahead performance is back:
> - time ureadahead $FILE:
>   - 5.10: 1m23.124s
>   - with c1f6925e1091 reverted: 0m3.323s
>   - other LTS kernel (eg. 5.4): 0m3.066s
> 
> The slowest part is aops->readpage() in read_pages() called in
> read_pages(ractl, &page_pool, false); (the 3rd in
> page_cache_ra_unbounded())

What filesystem are you using?

> static void read_pages(struct readahead_control *rac, struct list_head *pages,
>         bool skip_page)
> {
>     ...
>     if (aops->readahead) {
>         ...
>     } else if (aops->readpages) {
>         ...
>     } else {
>         while ((page = readahead_page(rac))) {
>             aops->readpage(rac->file, page);   // most of the time is
> spent on this line
>             put_page(page);
>         }
>     }
>     ...
> }
> 
> We also found following metrics that are relevant:
> - time ureadahead $FILE:
>   - 5.10
>       - taskset ureadahead to a small core: 0m7.411s
>       - taskset ureadahead to a big core: 0m5.982s
>   compared to the original 1m23s, pining the ureadahead task on a
> single core also solves the gap.
> 
> Do you have any idea why moving pages to cache earlier then doing page
> read later will cause such a difference?
> 
> Thanks,
> 
> Hsin-Yi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-06 11:20 ` Matthew Wilcox
@ 2021-10-06 13:07   ` Hsin-Yi Wang
  2021-10-06 13:12     ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Hsin-Yi Wang @ 2021-10-06 13:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

On Wed, Oct 6, 2021 at 7:21 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 06, 2021 at 05:25:23PM +0800, Hsin-Yi Wang wrote:
> > Hi Matthew,
> >
> > We tested that the performance of readahead is regressed on multicore
> > arm64 platforms running on the 5.10 kernel.
> > - The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
> > - The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
> > note that if the file size is small, it's not obvious to see the
> > regression)
> >
> > After we revert the commit c1f6925e1091("mm: put readahead pages in
> > cache earlier"), the readahead performance is back:
> > - time ureadahead $FILE:
> >   - 5.10: 1m23.124s
> >   - with c1f6925e1091 reverted: 0m3.323s
> >   - other LTS kernel (eg. 5.4): 0m3.066s
> >
> > The slowest part is aops->readpage() in read_pages() called in
> > read_pages(ractl, &page_pool, false); (the 3rd in
> > page_cache_ra_unbounded())
>
> What filesystem are you using?
>
ext4, block size 4096


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-06 13:07   ` Hsin-Yi Wang
@ 2021-10-06 13:12     ` Matthew Wilcox
  2021-10-07  4:08       ` Hsin-Yi Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-06 13:12 UTC (permalink / raw)
  To: Hsin-Yi Wang
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

On Wed, Oct 06, 2021 at 09:07:56PM +0800, Hsin-Yi Wang wrote:
> On Wed, Oct 6, 2021 at 7:21 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 06, 2021 at 05:25:23PM +0800, Hsin-Yi Wang wrote:
> > > Hi Matthew,
> > >
> > > We tested that the performance of readahead is regressed on multicore
> > > arm64 platforms running on the 5.10 kernel.
> > > - The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
> > > - The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
> > > note that if the file size is small, it's not obvious to see the
> > > regression)
> > >
> > > After we revert the commit c1f6925e1091("mm: put readahead pages in
> > > cache earlier"), the readahead performance is back:
> > > - time ureadahead $FILE:
> > >   - 5.10: 1m23.124s
> > >   - with c1f6925e1091 reverted: 0m3.323s
> > >   - other LTS kernel (eg. 5.4): 0m3.066s
> > >
> > > The slowest part is aops->readpage() in read_pages() called in
> > > read_pages(ractl, &page_pool, false); (the 3rd in
> > > page_cache_ra_unbounded())
> >
> > What filesystem are you using?
> >
> ext4, block size 4096

That's confusing.  ext4 shouldn't hit that path; it has a ->readahead
address space operation.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-06 13:12     ` Matthew Wilcox
@ 2021-10-07  4:08       ` Hsin-Yi Wang
  2021-10-07  7:08         ` Hsin-Yi Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Hsin-Yi Wang @ 2021-10-07  4:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

On Wed, Oct 6, 2021 at 9:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 06, 2021 at 09:07:56PM +0800, Hsin-Yi Wang wrote:
> > On Wed, Oct 6, 2021 at 7:21 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Oct 06, 2021 at 05:25:23PM +0800, Hsin-Yi Wang wrote:
> > > > Hi Matthew,
> > > >
> > > > We tested that the performance of readahead is regressed on multicore
> > > > arm64 platforms running on the 5.10 kernel.
> > > > - The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
> > > > - The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
> > > > note that if the file size is small, it's not obvious to see the
> > > > regression)
> > > >
> > > > After we revert the commit c1f6925e1091("mm: put readahead pages in
> > > > cache earlier"), the readahead performance is back:
> > > > - time ureadahead $FILE:
> > > >   - 5.10: 1m23.124s
> > > >   - with c1f6925e1091 reverted: 0m3.323s
> > > >   - other LTS kernel (eg. 5.4): 0m3.066s
> > > >
> > > > The slowest part is aops->readpage() in read_pages() called in
> > > > read_pages(ractl, &page_pool, false); (the 3rd in
> > > > page_cache_ra_unbounded())
> > >
> > > What filesystem are you using?
> > >
> > ext4, block size 4096
>
> That's confusing.  ext4 shouldn't hit that path; it has a ->readahead
> address space operation.

Sorry for the confusion, both readahead and readpage are called.
The ->readpage is called by vfs: vfs_fadvise.
(Full path)
read_pages
page_cache_ra_unbounded
do_page_cache_ra
force_page_cache_ra
generic_fadvise
vfs_fadvise
ksys_readahead
__arm64_compat_sys_aarch32_readahead
el0_svc_common
do_el0_svc_compat
el0_svc_compat
el0_sync_compat_handler
el0_sync_compat

The ->readahead is called by ext4: ext4_file_read_iter. But this part is fast.
(Full path)
read_pages
page_cache_ra_unbounded
do_page_cache_ra
ondemand_readahead
page_cache_sync_ra
generic_file_buffered_read
generic_file_read_iter
ext4_file_read_iter
do_iter_readv_writev
do_iter_read
vfs_iter_read
loop_queue_work
kthread_worker_fn
loop_kthread_worker_fn
kthread
ret_from_fork


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-07  4:08       ` Hsin-Yi Wang
@ 2021-10-07  7:08         ` Hsin-Yi Wang
  2021-10-07 13:45           ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Hsin-Yi Wang @ 2021-10-07  7:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm

On Thu, Oct 7, 2021 at 12:08 PM Hsin-Yi Wang <hsinyi@chromium.org> wrote:
>
> On Wed, Oct 6, 2021 at 9:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 06, 2021 at 09:07:56PM +0800, Hsin-Yi Wang wrote:
> > > On Wed, Oct 6, 2021 at 7:21 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Oct 06, 2021 at 05:25:23PM +0800, Hsin-Yi Wang wrote:
> > > > > Hi Matthew,
> > > > >
> > > > > We tested that the performance of readahead is regressed on multicore
> > > > > arm64 platforms running on the 5.10 kernel.
> > > > > - The platform we used: 8 cores (4x a53(small), 4x a73(big)) arm64 platform
> > > > > - The command we used: ureadahead $FILE ($FILE is a 1MB+ pack file,
> > > > > note that if the file size is small, it's not obvious to see the
> > > > > regression)
> > > > >
> > > > > After we revert the commit c1f6925e1091("mm: put readahead pages in
> > > > > cache earlier"), the readahead performance is back:
> > > > > - time ureadahead $FILE:
> > > > >   - 5.10: 1m23.124s
> > > > >   - with c1f6925e1091 reverted: 0m3.323s
> > > > >   - other LTS kernel (eg. 5.4): 0m3.066s
> > > > >
> > > > > The slowest part is aops->readpage() in read_pages() called in
> > > > > read_pages(ractl, &page_pool, false); (the 3rd in
> > > > > page_cache_ra_unbounded())
> > > >
> > > > What filesystem are you using?
> > > >
> > > ext4, block size 4096
> >
> > That's confusing.  ext4 shouldn't hit that path; it has a ->readahead
> > address space operation.
>
> Sorry for the confusion, both readahead and readpage are called.
> The ->readpage is called by vfs: vfs_fadvise.
> (Full path)
> read_pages

This calls into squashfs_readpage().
The data pasted before is with SQUASHFS_DECOMP_SINGLE.
However if using SQUASHFS_DECOMP_MULTI_PERCPU config:
- 5.10:
  1. real 0m1.692s, sys 0m4.188s
  2. real 0m1.655s, sys 0m4.175s
- 5.10 with c1f6925e1091 reverted:
  1. real 0m1.549s, 0m3.616s
  2. real 0m1.603s, 0m3.638s
which is slightly better but the difference is not that much as using
SQUASHFS_DECOMP_SINGLE.

> page_cache_ra_unbounded
> do_page_cache_ra
> force_page_cache_ra
> generic_fadvise
> vfs_fadvise
> ksys_readahead
> __arm64_compat_sys_aarch32_readahead
> el0_svc_common
> do_el0_svc_compat
> el0_svc_compat
> el0_sync_compat_handler
> el0_sync_compat
>
> The ->readahead is called by ext4: ext4_file_read_iter. But this part is fast.
> (Full path)
> read_pages

This calls into ext4_readahead().

> page_cache_ra_unbounded
> do_page_cache_ra
> ondemand_readahead
> page_cache_sync_ra
> generic_file_buffered_read
> generic_file_read_iter
> ext4_file_read_iter
> do_iter_readv_writev
> do_iter_read
> vfs_iter_read
> loop_queue_work
> kthread_worker_fn
> loop_kthread_worker_fn
> kthread
> ret_from_fork


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-07  7:08         ` Hsin-Yi Wang
@ 2021-10-07 13:45           ` Matthew Wilcox
  2021-10-08  4:11             ` Hsin-Yi Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-07 13:45 UTC (permalink / raw)
  To: Hsin-Yi Wang
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm,
	linux-fsdevel, Phillip Lougher

On Thu, Oct 07, 2021 at 03:08:38PM +0800, Hsin-Yi Wang wrote:
> This calls into squashfs_readpage().

Aha!  I hadn't looked at squashfs before, and now that I do, I can
see why this commit causes problems for squashfs.  (It would be
helpful if your report included more detail about which paths inside
squashfs were taken, but I think I can guess):

squashfs_readpage()
  squashfs_readpage_block()
    squashfs_copy_cache()
      grab_cache_page_nowait()

Before this patch, readahead of 1MB would allocate 256x4kB pages,
then add each one to the page cache and call ->readpage on it:

        for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
                struct page *page = lru_to_page(pages);
                list_del(&page->lru);
                if (!add_to_page_cache_lru(page, rac->mapping, page->index,
                               gfp))
                        aops->readpage(rac->file, page);

When Squashfs sees it has more than 4kB of data, it calls
grab_cache_page_nowait(), which allocates more memory (ignoring the
other 255 pages which have been allocated, because they're not in the
page cache yet).  Then this loop frees the pages that readahead
allocated.

After this patch, the pages are already in the page cache when
->readpage is called the first time.  So the call to
grab_cache_page_nowait() fails and squashfs redoes the decompression for
each page.

Neither of these approaches are efficient.  Squashfs need to implement
->readahead.  Working on it now ...


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms
  2021-10-07 13:45           ` Matthew Wilcox
@ 2021-10-08  4:11             ` Hsin-Yi Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Hsin-Yi Wang @ 2021-10-08  4:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, William Kucharski, Christoph Hellwig, linux-mm,
	linux-fsdevel, Phillip Lougher

On Thu, Oct 7, 2021 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Oct 07, 2021 at 03:08:38PM +0800, Hsin-Yi Wang wrote:
> > This calls into squashfs_readpage().
>
> Aha!  I hadn't looked at squashfs before, and now that I do, I can
> see why this commit causes problems for squashfs.  (It would be
> helpful if your report included more detail about which paths inside
> squashfs were taken, but I think I can guess):
>
> squashfs_readpage()
>   squashfs_readpage_block()
>     squashfs_copy_cache()
>       grab_cache_page_nowait()
>
Right, before the patch, push_page won't be null but after the patch,
grab_cache_page_nowait() fails.


> Before this patch, readahead of 1MB would allocate 256x4kB pages,
> then add each one to the page cache and call ->readpage on it:
>
>         for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>                 struct page *page = lru_to_page(pages);
>                 list_del(&page->lru);
>                 if (!add_to_page_cache_lru(page, rac->mapping, page->index,
>                                gfp))
>                         aops->readpage(rac->file, page);
>
> When Squashfs sees it has more than 4kB of data, it calls
> grab_cache_page_nowait(), which allocates more memory (ignoring the
> other 255 pages which have been allocated, because they're not in the
> page cache yet).  Then this loop frees the pages that readahead
> allocated.
>
> After this patch, the pages are already in the page cache when
> ->readpage is called the first time.  So the call to
> grab_cache_page_nowait() fails and squashfs redoes the decompression for
> each page.
>
> Neither of these approaches are efficient.  Squashfs need to implement
> ->readahead.  Working on it now ...
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-08  4:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-06  9:25 Readahead regressed with c1f6925e1091("mm: put readahead pages in cache earlier") on multicore arm64 platforms Hsin-Yi Wang
2021-10-06 11:20 ` Matthew Wilcox
2021-10-06 13:07   ` Hsin-Yi Wang
2021-10-06 13:12     ` Matthew Wilcox
2021-10-07  4:08       ` Hsin-Yi Wang
2021-10-07  7:08         ` Hsin-Yi Wang
2021-10-07 13:45           ` Matthew Wilcox
2021-10-08  4:11             ` Hsin-Yi Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.