Re: [RFC PATCH] btrfs: Remove __extent_readpages

From: Nikolay Borisov <nborisov@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [RFC PATCH] btrfs: Remove __extent_readpages
Date: Mon, 3 Dec 2018 16:25:13 +0200	[thread overview]
Message-ID: <51be8fe3-e319-6666-2511-342446297f61@suse.com> (raw)
In-Reply-To: <20181203102532.13945-1-nborisov@suse.com>

On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller than 
> the memory, so pages should have been nicely readahead. 
> 
> PATCHED: 
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 09:57:17 2018
>   read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
>     clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
>      lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
>     clat percentiles (nsec):
>      |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
>      | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
>      | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
>      | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
>      | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
>      | 99.95th=[ 667648], 99.99th=[5931008]
>    bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50
>    iops        : min=  692, max=136135, avg=102478.08, stdev=40583.47, samples=50
>   lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
>   lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
>   cpu          : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec
> 
> 
> UNPATCHED:
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 10:07:38 2018
>   read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
>     clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
>      lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
>     clat percentiles (nsec):
>      |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
>      | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
>      | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
>      | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
>      | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
>      | 99.95th=[ 798720], 99.99th=[5341184]
>    bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45
>    iops        : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45
>   lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
>   lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
>   cpu          : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec
> 
> Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is 
> the case, given that I don't see how this patch affects the submission of 
> readahead io. 
> 

I took slightly different measurements, namely: call count of 
contiguous_readpages/__do_contiguous_readpages, latency histogram of calls 
to extent_readpages as well as the output from cachetop (from bcc tools). Also I 
switched qemu's cache settings to 'none' for the virtio device which fio is writing to! 
The vms still has 4g of ram and this time the fio cmd line is: 

fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k --numjobs=1 --size=2G --runtime=600  --group_reporting --loop 10 (I've modified the file to be 2g rather than 1 as in the original patch)

So here are the results: 

UNPATCHED CASE: 

/media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=470MiB/s][r=120k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=643: Mon Dec  3 13:55:09 2018
  read: IOPS=82.5k, BW=322MiB/s (338MB/s)(20.0GiB/63559msec)
    clat (nsec): min=1427, max=38694k, avg=5613.13, stdev=126477.18
     lat (nsec): min=1819, max=38695k, avg=6061.95, stdev=126479.46
    clat percentiles (nsec):
     |  1.00th=[   1480],  5.00th=[   1512], 10.00th=[   1528],
     | 20.00th=[   1560], 30.00th=[   1576], 40.00th=[   1592],
     | 50.00th=[   1624], 60.00th=[   1656], 70.00th=[   1720],
     | 80.00th=[   1896], 90.00th=[   2640], 95.00th=[   3024],
     | 99.00th=[   4768], 99.50th=[  12352], 99.90th=[1003520],
     | 99.95th=[3817472], 99.99th=[4685824]
   bw (  KiB/s): min=45728, max=549616, per=100.00%, avg=460140.08, stdev=144985.90, samples=91
   iops        : min=11432, max=137404, avg=115035.09, stdev=36246.50, samples=91
  lat (usec)   : 2=82.15%, 4=16.25%, 10=1.01%, 20=0.34%, 50=0.10%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02%
  lat (msec)   : 2=0.02%, 4=0.04%, 10=0.04%, 20=0.01%, 50=0.01%
  cpu          : usr=5.81%, sys=81.11%, ctx=5682, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

FUNC                                    COUNT
__do_contiguous_readpages              327935

Here is a histogram of latencies in microseconds

         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 10       |                                        |
        64 -> 127        : 20       |                                        |
       128 -> 255        : 6        |                                        |
       256 -> 511        : 13       |                                        |
       512 -> 1023       : 11       |                                        |
      1024 -> 2047       : 10       |                                        |
      2048 -> 4095       : 2516     |****************************************|
      4096 -> 8191       : 34       |                                        |

Output from cachetop for the duration of the test: 

   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB
     354        1                       353        1           14       1063
  543360   524288    19072        0           14       2232
  379788   369092    10696        0           14       1956
  329056   316960    12096        0           14       1147
  377836   362181    15655        0           14        766
  543360   524288    19072        0           14       2036
  437191   423764    13427        0           14       2169
  327289   317440     9849        0           14       1361
  322175   307376    14799        0           14        513
  543352   524284    19068        0           14       1680
  481780   466852    14928        0           14       2338
  343440   331528    11912        0           14       1585
  344403   333824    10579        0           14        841
  564139   440492   123647        0           14       2160

total hits:294800                                                               
total misses: 5242370

PATCHED CASE: 

/media/scratch/seqread: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=529MiB/s][r=135k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=620: Mon Dec  3 11:56:49 2018
  read: IOPS=81.2k, BW=317MiB/s (332MB/s)(10.0GiB/32302msec)
    clat (nsec): min=1428, max=152945k, avg=5927.07, stdev=246422.91
     lat (nsec): min=1837, max=152946k, avg=6427.46, stdev=246426.89
    clat percentiles (nsec):
     |  1.00th=[   1496],  5.00th=[   1528], 10.00th=[   1560],
     | 20.00th=[   1608], 30.00th=[   1672], 40.00th=[   1736],
     | 50.00th=[   1880], 60.00th=[   2192], 70.00th=[   2448],
     | 80.00th=[   2736], 90.00th=[   3280], 95.00th=[   4080],
     | 99.00th=[   7328], 99.50th=[  14528], 99.90th=[ 242688],
     | 99.95th=[3719168], 99.99th=[5013504]
   bw (  KiB/s): min= 1432, max=549808, per=100.00%, avg=396769.62, stdev=175814.08, samples=52
   iops        : min=  358, max=137452, avg=99192.48, stdev=43953.67, samples=52
  lat (usec)   : 2=53.85%, 4=40.84%, 10=4.54%, 20=0.47%, 50=0.15%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.04%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=6.53%, sys=83.13%, ctx=2746, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=317MiB/s (332MB/s), 317MiB/s-317MiB/s (332MB/s-332MB/s), io=10.0GiB (10.7GB), run=32302-32302msec

FUNC                                    COUNT
contiguous_readpages                   327935

So the total number of times we call the low-level submission function is
identical across the 2 versions. This implies that the patch doesn't result 
in worse "merge" behavior i.e submitting more but smaller contiguous chunks via 
contiguous_readpages.

Cachetop: 
TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB                     
  475016   459690    15326        0           15       2313                     
  316483   309138     7345        2           15       1472                     
  326726   315393    11333        1           15        656                     
  508038   487596    20442        0           15       1710                     
  470487   456532    13955        0           15       2296                     
  329494   318160    11334        0           15       1492                     
  312000   300217    11783        0           15        617                     
  518013   497629    20384        0           15       1552                     
  518780   503636    15144        0           15       2480                     
  338471   325632    12839        0           15       1705                     
  325678   315328    10350        0           15        888                     
  447023   428204    18819        0           15       1127                     
  647182   524289   122893        0           15       2160                     

total hits: 291947<--- hits are negligibly smaller                                                    
total misses: 5241444<--- so are misses