Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* [RFC PATCH] btrfs: Remove __extent_readpages
@ 2018-12-03 10:25 Nikolay Borisov
  2018-12-03 14:25 ` Nikolay Borisov
  2018-12-05 16:58 ` Josef Bacik
  0 siblings, 2 replies; 4+ messages in thread
From: Nikolay Borisov @ 2018-12-03 10:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

When extent_readpages is called from the generic readahead code it first
builds a batch of 16 pages (which might or might not be consecutive,
depending on whether add_to_page_cache_lru failed) and submits them to
__extent_readpages. The latter ensures that the range of pages (in the
batch of 16) that is passed to __do_contiguous_readpages is consecutive.

If add_to_page_cache_lru does't fail then __extent_readpages will call
__do_contiguous_readpages only once with the whole batch of 16.
Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
then the contigous page read code will be called twice.

All of this can be simplified by exploiting the fact that all pages passed
to extent_readpages are consecutive, thus when the batch is built in
that function it is already consecutive (barring add_to_page_cache_lru
failures) so are ready to be submitted directly to __do_contiguous_readpages.
Also simplify the name of the function to contiguous_readpages. 

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---

So this patch looks like a very nice cleanup, however when doing performance 
measurements with fio I was shocked to see that it actually is detrimental to 
performance. Here are the results: 

The command line used for fio: 
fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
 --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10

This was tested on a vm with 4g of ram so the size of the test is smaller than 
the memory, so pages should have been nicely readahead. 

PATCHED: 

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 09:57:17 2018
  read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
    clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
     lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
    clat percentiles (nsec):
     |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
     | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
     | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
     | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
     | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
     | 99.95th=[ 667648], 99.99th=[5931008]
   bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50
   iops        : min=  692, max=136135, avg=102478.08, stdev=40583.47, samples=50
  lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
  cpu          : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec


UNPATCHED:

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 10:07:38 2018
  read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
    clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
     lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
    clat percentiles (nsec):
     |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
     | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
     | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
     | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
     | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
     | 99.95th=[ 798720], 99.99th=[5341184]
   bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45
   iops        : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45
  lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
  lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
  cpu          : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec

Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is 
the case, given that I don't see how this patch affects the submission of 
readahead io. 

 fs/btrfs/extent_io.c | 63 +++++++++++++-------------------------------
 1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 233f835dd6f8..c097eec1b73d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3059,7 +3059,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 	return ret;
 }
 
-static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
+static inline void contiguous_readpages(struct extent_io_tree *tree,
 					     struct page *pages[], int nr_pages,
 					     u64 start, u64 end,
 					     struct extent_map **em_cached,
@@ -3090,46 +3090,6 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
 	}
 }
 
-static void __extent_readpages(struct extent_io_tree *tree,
-			       struct page *pages[],
-			       int nr_pages,
-			       struct extent_map **em_cached,
-			       struct bio **bio, unsigned long *bio_flags,
-			       u64 *prev_em_start)
-{
-	u64 start = 0;
-	u64 end = 0;
-	u64 page_start;
-	int index;
-	int first_index = 0;
-
-	for (index = 0; index < nr_pages; index++) {
-		page_start = page_offset(pages[index]);
-		if (!end) {
-			start = page_start;
-			end = start + PAGE_SIZE - 1;
-			first_index = index;
-		} else if (end + 1 == page_start) {
-			end += PAGE_SIZE;
-		} else {
-			__do_contiguous_readpages(tree, &pages[first_index],
-						  index - first_index, start,
-						  end, em_cached,
-						  bio, bio_flags,
-						  prev_em_start);
-			start = page_start;
-			end = start + PAGE_SIZE - 1;
-			first_index = index;
-		}
-	}
-
-	if (end)
-		__do_contiguous_readpages(tree, &pages[first_index],
-					  index - first_index, start,
-					  end, em_cached, bio,
-					  bio_flags, prev_em_start);
-}
-
 static int __extent_read_full_page(struct extent_io_tree *tree,
 				   struct page *page,
 				   get_extent_t *get_extent,
@@ -4104,6 +4064,13 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 	u64 prev_em_start = (u64)-1;
 
 	while (!list_empty(pages)) {
+		u64 contig_end = 0;
+
+		/*
+		 * Produces a batch of up to 16 contiguous pages, assumes
+		 * that pages are consecutive in pages list (guaranteed by the
+		 * generic code)
+		 **/
 		for (nr = 0; nr < BTRFS_PAGES_BATCH && !list_empty(pages);) {
 			struct page *page = lru_to_page(pages);
 
@@ -4112,14 +4079,22 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 			if (add_to_page_cache_lru(page, mapping, page->index,
 						readahead_gfp_mask(mapping))) {
 				put_page(page);
-				continue;
+				break;
 			}
 
 			pagepool[nr++] = page;
+			contig_end = page_offset(page) + PAGE_SIZE - 1;
 		}
 
-		__extent_readpages(tree, pagepool, nr, &em_cached, &bio,
-				   &bio_flags, &prev_em_start);
+		if (nr) {
+			u64 contig_start = page_offset(pagepool[0]);
+
+			ASSERT(contig_start + (nr*PAGE_SIZE) - 1 == contig_end);
+
+			contiguous_readpages(tree, pagepool, nr, contig_start,
+				     contig_end, &em_cached, &bio, &bio_flags,
+				     &prev_em_start);
+		}
 	}
 
 	if (em_cached)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] btrfs: Remove __extent_readpages
  2018-12-03 10:25 [RFC PATCH] btrfs: Remove __extent_readpages Nikolay Borisov
@ 2018-12-03 14:25 ` Nikolay Borisov
  2018-12-05 16:58 ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Nikolay Borisov @ 2018-12-03 14:25 UTC (permalink / raw)
  To: linux-btrfs



On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller than 
> the memory, so pages should have been nicely readahead. 
> 
> PATCHED: 
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 09:57:17 2018
>   read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
>     clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
>      lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
>     clat percentiles (nsec):
>      |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
>      | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
>      | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
>      | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
>      | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
>      | 99.95th=[ 667648], 99.99th=[5931008]
>    bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50
>    iops        : min=  692, max=136135, avg=102478.08, stdev=40583.47, samples=50
>   lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
>   lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
>   cpu          : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec
> 
> 
> UNPATCHED:
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 10:07:38 2018
>   read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
>     clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
>      lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
>     clat percentiles (nsec):
>      |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
>      | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
>      | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
>      | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
>      | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
>      | 99.95th=[ 798720], 99.99th=[5341184]
>    bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45
>    iops        : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45
>   lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
>   lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
>   cpu          : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec
> 
> Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is 
> the case, given that I don't see how this patch affects the submission of 
> readahead io. 
> 

I took slightly different measurements, namely: call count of 
contiguous_readpages/__do_contiguous_readpages, latency histogram of calls 
to extent_readpages as well as the output from cachetop (from bcc tools). Also I 
switched qemu's cache settings to 'none' for the virtio device which fio is writing to! 
The vms still has 4g of ram and this time the fio cmd line is: 

fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k --numjobs=1 --size=2G --runtime=600  --group_reporting --loop 10 (I've modified the file to be 2g rather than 1 as in the original patch)

So here are the results: 

UNPATCHED CASE: 

/media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=470MiB/s][r=120k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=643: Mon Dec  3 13:55:09 2018
  read: IOPS=82.5k, BW=322MiB/s (338MB/s)(20.0GiB/63559msec)
    clat (nsec): min=1427, max=38694k, avg=5613.13, stdev=126477.18
     lat (nsec): min=1819, max=38695k, avg=6061.95, stdev=126479.46
    clat percentiles (nsec):
     |  1.00th=[   1480],  5.00th=[   1512], 10.00th=[   1528],
     | 20.00th=[   1560], 30.00th=[   1576], 40.00th=[   1592],
     | 50.00th=[   1624], 60.00th=[   1656], 70.00th=[   1720],
     | 80.00th=[   1896], 90.00th=[   2640], 95.00th=[   3024],
     | 99.00th=[   4768], 99.50th=[  12352], 99.90th=[1003520],
     | 99.95th=[3817472], 99.99th=[4685824]
   bw (  KiB/s): min=45728, max=549616, per=100.00%, avg=460140.08, stdev=144985.90, samples=91
   iops        : min=11432, max=137404, avg=115035.09, stdev=36246.50, samples=91
  lat (usec)   : 2=82.15%, 4=16.25%, 10=1.01%, 20=0.34%, 50=0.10%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02%
  lat (msec)   : 2=0.02%, 4=0.04%, 10=0.04%, 20=0.01%, 50=0.01%
  cpu          : usr=5.81%, sys=81.11%, ctx=5682, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

FUNC                                    COUNT
__do_contiguous_readpages              327935

Here is a histogram of latencies in microseconds

         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 10       |                                        |
        64 -> 127        : 20       |                                        |
       128 -> 255        : 6        |                                        |
       256 -> 511        : 13       |                                        |
       512 -> 1023       : 11       |                                        |
      1024 -> 2047       : 10       |                                        |
      2048 -> 4095       : 2516     |****************************************|
      4096 -> 8191       : 34       |                                        |

Output from cachetop for the duration of the test: 

   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB
     354        1                       353        1           14       1063
  543360   524288    19072        0           14       2232
  379788   369092    10696        0           14       1956
  329056   316960    12096        0           14       1147
  377836   362181    15655        0           14        766
  543360   524288    19072        0           14       2036
  437191   423764    13427        0           14       2169
  327289   317440     9849        0           14       1361
  322175   307376    14799        0           14        513
  543352   524284    19068        0           14       1680
  481780   466852    14928        0           14       2338
  343440   331528    11912        0           14       1585
  344403   333824    10579        0           14        841
  564139   440492   123647        0           14       2160

total hits:294800                                                               
total misses: 5242370

PATCHED CASE: 

/media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=529MiB/s][r=135k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=620: Mon Dec  3 11:56:49 2018
  read: IOPS=81.2k, BW=317MiB/s (332MB/s)(10.0GiB/32302msec)
    clat (nsec): min=1428, max=152945k, avg=5927.07, stdev=246422.91
     lat (nsec): min=1837, max=152946k, avg=6427.46, stdev=246426.89
    clat percentiles (nsec):
     |  1.00th=[   1496],  5.00th=[   1528], 10.00th=[   1560],
     | 20.00th=[   1608], 30.00th=[   1672], 40.00th=[   1736],
     | 50.00th=[   1880], 60.00th=[   2192], 70.00th=[   2448],
     | 80.00th=[   2736], 90.00th=[   3280], 95.00th=[   4080],
     | 99.00th=[   7328], 99.50th=[  14528], 99.90th=[ 242688],
     | 99.95th=[3719168], 99.99th=[5013504]
   bw (  KiB/s): min= 1432, max=549808, per=100.00%, avg=396769.62, stdev=175814.08, samples=52
   iops        : min=  358, max=137452, avg=99192.48, stdev=43953.67, samples=52
  lat (usec)   : 2=53.85%, 4=40.84%, 10=4.54%, 20=0.47%, 50=0.15%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.04%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=6.53%, sys=83.13%, ctx=2746, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=10.0GiB (10.7GB), run=32302-32302msec

FUNC                                    COUNT
contiguous_readpages                   327935

So the total number of times we call the low-level submission function is
identical across the 2 versions. This implies that the patch doesn't result 
in worse "merge" behavior i.e submitting more but smaller contiguous chunks via 
contiguous_readpages.

Cachetop: 
TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB                     
  475016   459690    15326        0           15       2313                     
  316483   309138     7345        2           15       1472                     
  326726   315393    11333        1           15        656                     
  508038   487596    20442        0           15       1710                     
  470487   456532    13955        0           15       2296                     
  329494   318160    11334        0           15       1492                     
  312000   300217    11783        0           15        617                     
  518013   497629    20384        0           15       1552                     
  518780   503636    15144        0           15       2480                     
  338471   325632    12839        0           15       1705                     
  325678   315328    10350        0           15        888                     
  447023   428204    18819        0           15       1127                     
  647182   524289   122893        0           15       2160                     
                                                                                
total hits: 291947<--- hits are negligibly smaller                                                    
total misses: 5241444<--- so are misses  




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] btrfs: Remove __extent_readpages
  2018-12-03 10:25 [RFC PATCH] btrfs: Remove __extent_readpages Nikolay Borisov
  2018-12-03 14:25 ` Nikolay Borisov
@ 2018-12-05 16:58 ` Josef Bacik
  2018-12-10  8:41   ` Nikolay Borisov
  1 sibling, 1 reply; 4+ messages in thread
From: Josef Bacik @ 2018-12-05 16:58 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller than 
> the memory, so pages should have been nicely readahead. 
> 

What this patch changes is now we aren't reading all of the pages we are passed
by the readahead, so now we fall back to per-page reading when we go to read
those pages because the readahead window has already moved past them.  This
isn't great behavior, what we have is nice in that it tries to group the entire
range together as much as possible.  What your patch changes is that as soon as
we stop having a contiguous range we just bail and submit what we have.  Testing
it in isolation is likely to show very little change, but having recently
touched the fault in code where we definitely do not count major faults in some
cases I'd suspect that we're not reflecting this higher fault rate in the
performance counters properly.  We should preserve the existing behavior, what
hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
a heavily loaded box.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] btrfs: Remove __extent_readpages
  2018-12-05 16:58 ` Josef Bacik
@ 2018-12-10  8:41   ` Nikolay Borisov
  0 siblings, 0 replies; 4+ messages in thread
From: Nikolay Borisov @ 2018-12-10  8:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs



On 5.12.18 г. 18:58 ч., Josef Bacik wrote:
> On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
>> When extent_readpages is called from the generic readahead code it first
>> builds a batch of 16 pages (which might or might not be consecutive,
>> depending on whether add_to_page_cache_lru failed) and submits them to
>> __extent_readpages. The latter ensures that the range of pages (in the
>> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
>>
>> If add_to_page_cache_lru does't fail then __extent_readpages will call
>> __do_contiguous_readpages only once with the whole batch of 16.
>> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
>> then the contigous page read code will be called twice.
>>
>> All of this can be simplified by exploiting the fact that all pages passed
>> to extent_readpages are consecutive, thus when the batch is built in
>> that function it is already consecutive (barring add_to_page_cache_lru
>> failures) so are ready to be submitted directly to __do_contiguous_readpages.
>> Also simplify the name of the function to contiguous_readpages. 
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>
>> So this patch looks like a very nice cleanup, however when doing performance 
>> measurements with fio I was shocked to see that it actually is detrimental to 
>> performance. Here are the results: 
>>
>> The command line used for fio: 
>> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
>>
>> This was tested on a vm with 4g of ram so the size of the test is smaller than 
>> the memory, so pages should have been nicely readahead. 
>>
> 
> What this patch changes is now we aren't reading all of the pages we are passed
> by the readahead, so now we fall back to per-page reading when we go to read
> those pages because the readahead window has already moved past them.  This
> isn't great behavior, what we have is nice in that it tries to group the entire
> range together as much as possible.  What your patch changes is that as soon as
> we stop having a contiguous range we just bail and submit what we have.  Testing
> it in isolation is likely to show very little change, but having recently
> touched the fault in code where we definitely do not count major faults in some
> cases I'd suspect that we're not reflecting this higher fault rate in the
> performance counters properly.  We should preserve the existing behavior, what
> hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
> a heavily loaded box.  Thanks,

Following our recent chat regarding this (and the previous prep patch) I
think both doesn't constitute difference in behavior, just making the
code a bit clearer. Also Mel was kind enough to run more rigorous
performance tests and the result were identical.

The main concern in your statements comes from the fact that the prep
patch actually introduced a for loop which creates the batch so the
break you see here is for this nested loop and not for the main loop
that reads all pages passed by the RA code.

> 
> Josef
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 10:25 [RFC PATCH] btrfs: Remove __extent_readpages Nikolay Borisov
2018-12-03 14:25 ` Nikolay Borisov
2018-12-05 16:58 ` Josef Bacik
2018-12-10  8:41   ` Nikolay Borisov

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox