linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Questions about folio allocation.
@ 2022-04-24 11:35 Guo Xuenan
  2022-04-24 11:37 ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Guo Xuenan @ 2022-04-24 11:35 UTC (permalink / raw)
  To: willy; +Cc: linux-mm, linux-kernel, houtao1, fangwei1, guoxuenan

Hi Matthew,

You have done a lot of work on folio, many folio related patches have been
incorporated into the mainline. I'm very interested in your excellent work
and did some sequential read test (using fixed read length, testing on a
10G file), and found something.
1. different read length may effect folio order
   using 100KB read length during sequentital read, when readahead folio
   order may always 0, so there always allocate folios with 0 or 2. 
2. folio order can not reach MAX_PAGECACHE_ORDER, when read length is small.
   (eg, less than 32KB)

As you have mentationed here[1],
"The heuristic for choosing which folio sizes will surely need some tuning"
I wonder (1) why the folio order need align with page index. is this
necessary or there are some certain restrictions?
(2) for pagecache, by using large folio, it saving loops for allocating pages,
and i also did some test on dropcache, it shows that dropcache costs less time.
there are twenty times performance improvement when drop the 10G file's cache.
so, can i concluded that pagecache should tend to use large order of folio?

[1] https://lore.kernel.org/linux-mm/20220204195852.1751729-72-willy@infradead.org/,

Thanks,
Guo Xuenan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Questions about folio allocation.
  2022-04-24 11:35 Questions about folio allocation Guo Xuenan
@ 2022-04-24 11:37 ` Matthew Wilcox
       [not found]   ` <77b76283-cec5-94a8-9bfe-34ea24c55b82@huawei.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2022-04-24 11:37 UTC (permalink / raw)
  To: Guo Xuenan; +Cc: linux-mm, linux-kernel, houtao1, fangwei1

On Sun, Apr 24, 2022 at 07:35:43PM +0800, Guo Xuenan wrote:
> Hi Matthew,
> 
> You have done a lot of work on folio, many folio related patches have been
> incorporated into the mainline. I'm very interested in your excellent work
> and did some sequential read test (using fixed read length, testing on a
> 10G file), and found something.
> 1. different read length may effect folio order
>    using 100KB read length during sequentital read, when readahead folio
>    order may always 0, so there always allocate folios with 0 or 2. 

Hmm.  Looks like we're foiling readahead somehow.  I'll look into this.

root@pepe-kvm:~# mkfs.xfs /dev/sdb
root@pepe-kvm:~# mount /dev/sdb /mnt/
root@pepe-kvm:~# truncate -s 10G /mnt/bigfile
root@pepe-kvm:~# echo 1 >/sys/kernel/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
root@pepe-kvm:~# dd if=/mnt/bigfile of=/dev/null bs=100K count=4 
root@pepe-kvm:~# cat /sys/kernel/tracing/trace
[...]
              dd-286     [000] .....   175.495258: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b0c ofs=0 order=2
              dd-286     [000] .....   175.495266: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b10 ofs=16384 order=2
              dd-286     [000] .....   175.495267: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b14 ofs=32768 order=2
              dd-286     [000] .....   175.495268: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b18 ofs=49152 order=2
              dd-286     [000] .....   175.495269: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b1c ofs=65536 order=2
              dd-286     [000] .....   175.495270: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b20 ofs=81920 order=2
              dd-286     [000] .....   175.495271: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b24 ofs=98304 order=2
              dd-286     [000] .....   175.495272: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b28 ofs=114688 order=2
              dd-286     [000] .....   175.495485: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x1048a3 ofs=135168 order=0
              dd-286     [000] .....   175.495486: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x1036eb ofs=139264 order=0
              dd-286     [000] .....   175.495486: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f4a ofs=143360 order=0
              dd-286     [000] .....   175.495487: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b2c ofs=147456 order=2
              dd-286     [000] .....   175.495490: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b48 ofs=163840 order=3
              dd-286     [000] .....   175.495491: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b30 ofs=196608 order=2
              dd-286     [000] .....   175.495492: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f76 ofs=212992 order=0
              dd-286     [000] .....   175.495666: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f79 ofs=131072 order=0
              dd-286     [000] .....   175.495669: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f5b ofs=217088 order=0
              dd-286     [000] .....   175.495669: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103c99 ofs=221184 order=0
              dd-286     [000] .....   175.495670: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x1037a0 ofs=225280 order=0
              dd-286     [000] .....   175.495673: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f45 ofs=229376 order=0
              dd-286     [000] .....   175.495674: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103f44 ofs=233472 order=0
              dd-286     [000] .....   175.495675: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x10378c ofs=237568 order=0
              dd-286     [000] .....   175.495675: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103fde ofs=241664 order=0
              dd-286     [000] .....   175.495676: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103fdd ofs=245760 order=0
              dd-286     [000] .....   175.495677: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103fe1 ofs=249856 order=0
              dd-286     [000] .....   175.495677: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103fe2 ofs=253952 order=0
              dd-286     [000] .....   175.495678: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x103fa7 ofs=258048 order=0
              dd-286     [000] .....   175.495687: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b34 ofs=262144 order=2
              dd-286     [000] .....   175.495690: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b38 ofs=278528 order=2
              dd-286     [000] .....   175.495691: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b3c ofs=294912 order=2
              dd-286     [000] .....   175.495692: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b40 ofs=311296 order=2
              dd-286     [000] .....   175.495693: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b44 ofs=327680 order=2
              dd-286     [000] .....   175.495701: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b80 ofs=344064 order=2
              dd-286     [000] .....   175.495703: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b84 ofs=360448 order=2
              dd-286     [000] .....   175.495704: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106b88 ofs=376832 order=2
              dd-286     [000] .....   175.495894: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106bc0 ofs=393216 order=4
              dd-286     [000] .....   175.495896: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106bd0 ofs=458752 order=4
              dd-286     [000] .....   175.496096: mm_filemap_add_to_page_cache: dev 8:16 ino 83 pfn=0x106be0 ofs=524288 order=5

We do eventually get up to an order=5 allocation (128kB), but we should
get there far sooner.

> 2. folio order can not reach MAX_PAGECACHE_ORDER, when read length is small.
>    (eg, less than 32KB)

I'm less concerned about that.  It's not necessarily a good idea to go
all the way to an order-9 page; 2MB pages are pretty big.

> As you have mentationed here[1],
> "The heuristic for choosing which folio sizes will surely need some tuning"
> I wonder (1) why the folio order need align with page index. is this
> necessary or there are some certain restrictions?

That's partly because of the limitations of the radix tree used by
the page cache.  With an aligned folio, an order-6 folio will take
up a single entry; if it were unaligned, we'd need two nodes.  Worse,
we'd have to constantly be walking around the tree in order to find
all the entries associated with an unaligned folio.

Partly, it's to try to use CPU resources more effectively.  For the
pages which are mapped to userspace, we set up the alignments so we can
use things like PMD-sized TLB entries and ARM's 64KiB TLB entries.

> (2) for pagecache, by using large folio, it saving loops for allocating pages,
> and i also did some test on dropcache, it shows that dropcache costs less time.
> there are twenty times performance improvement when drop the 10G file's cache.
> so, can i concluded that pagecache should tend to use large order of folio?

Well, dropping the pagecache isn't a performance path ;-)  But as a
proxy for doing page reclaim under memory pressure, yes, that kind of
performance gain is what I'd expect and was one of the major motivators
for this work (shortening the LRU list and keeping memory unfragmented).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Questions about folio allocation.
       [not found]   ` <77b76283-cec5-94a8-9bfe-34ea24c55b82@huawei.com>
@ 2022-04-27 21:15     ` Matthew Wilcox
  2022-04-28  9:09       ` Guo Xuenan
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2022-04-27 21:15 UTC (permalink / raw)
  To: Guo Xuenan; +Cc: linux-mm, linux-kernel, houtao1, fangwei1

On Sun, Apr 24, 2022 at 09:30:26PM +0800, Guo Xuenan wrote:
> Hmm.. sorry my expression is not rigorous enough, but i think you have got
> it partly.
> Read the whole file but not only 100k * 4, in most case page order is 2,
> which means
> that in this way of reading,the order of folio with readahead flag is 0 in
> most case.
> 
> [root@localhost ]# echo 4096 > /sys/block/vdb/queue/read_ahead_kb
> [root@localhost ]# echo 4096 > /sys/block/vdb/queue/max_sectors_kb
> [root@localhost ]# bpftrace bpf.bt  > 100K
> [root@localhost ]# cat 100K | awk '{print $11}' | sort | uniq -c
>     884 0
>    55945 2
>      1 3
>      14 4
>      2 5
>      5 6
> 
> According to the readahead code, the inital order is from current folio with
> readahead flag,
> may the inital order based on size of readadhead window is better?
> (eg: ra->size big enough and considering index alignment then set the
> order?)

Try this patch; it should fix the problem you're seeing.  At least, it
does in my tests.


From 89539907eb14b0723d457e77a18cc5af5e13db8f Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Wed, 27 Apr 2022 17:01:28 -0400
Subject: [PATCH] mm/readahead: Fix readahead with large folios

Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor
readahead behaviour.  Studying the traces in detail, I noticed two
problems.

The first is that we were setting the readahead flag on the folio which
contains the last byte read from the block.  This is wrong because we
will trigger readahead at the end of the read without waiting to see
if a subsequent read is going to use the pages we just read.  Instead,
we need to set the readahead flag on the first folio _after_ the one
which contains the last byte that we're reading.

The second is that we were looking for the index of the folio with the
readahead flag set to exactly match the start + size - async_size.
If we've rounded this, either down (as previously) or up (as now),
we'll think we hit a folio marked as readahead by a different read,
and try to read the wrong pages.  So round the expected index to the
order of the folio we hit.

Reported-by: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8e3775829513..4a60cdb64262 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -474,7 +474,8 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 
 	if (!folio)
 		return -ENOMEM;
-	if (mark - index < (1UL << order))
+	mark = round_up(mark, 1UL << order);
+	if (index == mark)
 		folio_set_readahead(folio);
 	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
 	if (err)
@@ -555,8 +556,9 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	unsigned long max_pages = ra->ra_pages;
 	unsigned long add_pages;
-	unsigned long index = readahead_index(ractl);
-	pgoff_t prev_index;
+	pgoff_t index = readahead_index(ractl);
+	pgoff_t expected, prev_index;
+	unsigned int order = folio ? folio_order(folio) : 0;
 
 	/*
 	 * If the request exceeds the readahead window, allow the read to
@@ -575,8 +577,9 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * It's the expected callback index, assume sequential access.
 	 * Ramp up sizes, and push forward the readahead window.
 	 */
-	if ((index == (ra->start + ra->size - ra->async_size) ||
-	     index == (ra->start + ra->size))) {
+	expected = round_up(ra->start + ra->size - ra->async_size,
+			1UL << order);
+	if (index == expected || index == (ra->start + ra->size)) {
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max_pages);
 		ra->async_size = ra->size;
@@ -662,7 +665,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	}
 
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra, folio ? folio_order(folio) : 0);
+	page_cache_ra_order(ractl, ra, order);
 }
 
 void page_cache_sync_ra(struct readahead_control *ractl,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Questions about folio allocation.
  2022-04-27 21:15     ` Matthew Wilcox
@ 2022-04-28  9:09       ` Guo Xuenan
  0 siblings, 0 replies; 4+ messages in thread
From: Guo Xuenan @ 2022-04-28  9:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-kernel, houtao1, fangwei1

Hi Matthew,
On 2022/4/28 5:15, Matthew Wilcox wrote:
> On Sun, Apr 24, 2022 at 09:30:26PM +0800, Guo Xuenan wrote:
>> Hmm.. sorry my expression is not rigorous enough, but i think you have got
>> it partly.
>> Read the whole file but not only 100k * 4, in most case page order is 2,
>> which means
>> that in this way of reading,the order of folio with readahead flag is 0 in
>> most case.
>>
>> [root@localhost ]# echo 4096 > /sys/block/vdb/queue/read_ahead_kb
>> [root@localhost ]# echo 4096 > /sys/block/vdb/queue/max_sectors_kb
>> [root@localhost ]# bpftrace bpf.bt  > 100K
>> [root@localhost ]# cat 100K | awk '{print $11}' | sort | uniq -c
>>      884 0
>>     55945 2
>>       1 3
>>       14 4
>>       2 5
>>       5 6
>>
>> According to the readahead code, the inital order is from current folio with
>> readahead flag,
>> may the inital order based on size of readadhead window is better?
>> (eg: ra->size big enough and considering index alignment then set the
>> order?)
> Try this patch; it should fix the problem you're seeing.  At least, it
> does in my tests.
I have tried it. It looks much better now :). it seems that
"folio order can not reach MAX_PAGECACHE_ORDER" also be solved by your
new policy of read ahead flag setting.

I trid serveral different seq-read steps. I haven't seen any problems yet.
100K-step result after applied your patch:
[root@localhost ]# cat 100K-fix | awk '{print $11}' | sort | uniq -c
    8759 2
    2192 4
    552 6
     5 7
    140 8
    4787 9
> >From 89539907eb14b0723d457e77a18cc5af5e13db8f Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Wed, 27 Apr 2022 17:01:28 -0400
> Subject: [PATCH] mm/readahead: Fix readahead with large folios
>
> Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor
> readahead behaviour.  Studying the traces in detail, I noticed two
> problems.
>
> The first is that we were setting the readahead flag on the folio which
> contains the last byte read from the block.  This is wrong because we
> will trigger readahead at the end of the read without waiting to see
> if a subsequent read is going to use the pages we just read.  Instead,
> we need to set the readahead flag on the first folio _after_ the one
> which contains the last byte that we're reading.
>
> The second is that we were looking for the index of the folio with the
> readahead flag set to exactly match the start + size - async_size.
> If we've rounded this, either down (as previously) or up (as now),
> we'll think we hit a folio marked as readahead by a different read,
> and try to read the wrong pages.  So round the expected index to the
> order of the folio we hit.
>
> Reported-by: Guo Xuenan <guoxuenan@huawei.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   mm/readahead.c | 15 +++++++++------
>   1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8e3775829513..4a60cdb64262 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -474,7 +474,8 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>   
>   	if (!folio)
>   		return -ENOMEM;
> -	if (mark - index < (1UL << order))
> +	mark = round_up(mark, 1UL << order);
> +	if (index == mark)
>   		folio_set_readahead(folio);
>   	err = filemap_add_folio(ractl->mapping, folio, index, gfp);
>   	if (err)
> @@ -555,8 +556,9 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	struct file_ra_state *ra = ractl->ra;
>   	unsigned long max_pages = ra->ra_pages;
>   	unsigned long add_pages;
> -	unsigned long index = readahead_index(ractl);
> -	pgoff_t prev_index;
> +	pgoff_t index = readahead_index(ractl);
> +	pgoff_t expected, prev_index;
> +	unsigned int order = folio ? folio_order(folio) : 0;
>   
>   	/*
>   	 * If the request exceeds the readahead window, allow the read to
> @@ -575,8 +577,9 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	 * It's the expected callback index, assume sequential access.
>   	 * Ramp up sizes, and push forward the readahead window.
>   	 */
> -	if ((index == (ra->start + ra->size - ra->async_size) ||
> -	     index == (ra->start + ra->size))) {
> +	expected = round_up(ra->start + ra->size - ra->async_size,
> +			1UL << order);
> +	if (index == expected || index == (ra->start + ra->size)) {
>   		ra->start += ra->size;
>   		ra->size = get_next_ra_size(ra, max_pages);
>   		ra->async_size = ra->size;
> @@ -662,7 +665,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	}
>   
>   	ractl->_index = ra->start;
> -	page_cache_ra_order(ractl, ra, folio ? folio_order(folio) : 0);
> +	page_cache_ra_order(ractl, ra, order);
>   }
>   
>   void page_cache_sync_ra(struct readahead_control *ractl,
Thanks.
Guo Xuenan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-28  9:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-24 11:35 Questions about folio allocation Guo Xuenan
2022-04-24 11:37 ` Matthew Wilcox
     [not found]   ` <77b76283-cec5-94a8-9bfe-34ea24c55b82@huawei.com>
2022-04-27 21:15     ` Matthew Wilcox
2022-04-28  9:09       ` Guo Xuenan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).