From: John Hubbard <jhubbard@nvidia.com>
To: Tom Talpey <tom@talpey.com>, <john.hubbard@gmail.com>,
<linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-rdma <linux-rdma@vger.kernel.org>,
<linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
Date: Tue, 20 Nov 2018 22:09:06 -0800 [thread overview]
Message-ID: <97934904-2754-77e0-5fcb-83f2311362ee@nvidia.com> (raw)
In-Reply-To: <942cb823-9b18-69e7-84aa-557a68f9d7e9@talpey.com>
On 11/19/18 10:57 AM, Tom Talpey wrote:
> John, thanks for the discussion at LPC. One of the concerns we
> raised however was the performance test. The numbers below are
> rather obviously tainted. I think we need to get a better baseline
> before concluding anything...
>
> Here's my main concern:
>
Hi Tom,
Thanks again for looking at this!
> On 11/10/2018 3:50 AM, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>> ...
>> ------------------------------------------------------
>> WITHOUT the patch:
>> ------------------------------------------------------
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018
>> read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
>
> ~14000 4KB read IOPS is really, really low for an NVMe disk.
Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance, as you can see by the numjobs and direct IO parameters:
cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64
So I'm thinking that this is not a "tainted" test, but rather, we're constraining
things a lot with these choices. It's hard to find a good test config to run that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.
After talking with you and reading this email, I did a bunch more test runs,
varying the following fio parameters:
-- direct
-- numjobs
-- iodepth
...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if
anyone cares, I'll post a github link that has a complete, testable patchset--not
ready for submission as such, but it works cleanly and will allow others to
attempt to reproduce my results).
What I'm seeing is that I can get 10x or better improvements in IOPS and BW,
just by going to 10 threads and turning off direct IO--as expected. So in the end,
I increased the number of threads, and also increased iodepth a bit.
Test results below...
>
>> cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
>
> CPU is obviously the limiting factor. At these IOPS, it should be far
> less.
>> ------------------------------------------------------
>> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
>> as the "without" case:
>> ------------------------------------------------------
>>
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018
>> read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
>
> Similar low IOPS.
>
>> cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
>
> Similar CPU saturation.
>
>>
>
> I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
> i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
> and fio version 3.1). Even then, the CPU saturates, so it's not
> necessarily a perfect test. I'd like to see your runs both get to
> "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
> give the best comparison for making a decision.
I can get to CPU < 100% by increasing to 10 or 20 threads, although it
makes latency ever so much worse.
>
> Can you confirm what type of hardware you're running this test on?
> CPU, memory speed and capacity, and NVMe device especially?
>
> Tom.
Yes, it's a nice new system, I don't expect any strange perf problems:
CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
(Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB
So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:
-- IOPS are similar, around 60k.
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.
Baseline:
$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 4 (f=4): [_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 IOPS][eta 00m:03s]
reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018
read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec)
slat (usec): min=26, max=48167, avg=249.27, stdev=1200.02
clat (usec): min=42, max=147792, avg=67108.56, stdev=18062.46
lat (usec): min=103, max=147943, avg=67358.10, stdev=18109.75
clat percentiles (msec):
| 1.00th=[ 21], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 47],
| 30.00th=[ 58], 40.00th=[ 65], 50.00th=[ 70], 60.00th=[ 75],
| 70.00th=[ 79], 80.00th=[ 83], 90.00th=[ 89], 95.00th=[ 93],
| 99.00th=[ 104], 99.50th=[ 109], 99.90th=[ 121], 99.95th=[ 125],
| 99.99th=[ 134]
bw ( KiB/s): min= 9712, max=46362, per=5.11%, avg=15164.99, stdev=2242.15, samples=2742
iops : min= 2428, max=11590, avg=3790.94, stdev=560.53, samples=2742
lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.98%, 50=20.44%
lat (msec) : 100=76.95%, 250=1.61%
cpu : usr=1.00%, sys=57.65%, ctx=158367, majf=0, minf=5284
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=20.0GiB (21.5GB), run=70644-70644msec
Disk stats (read/write):
nvme0n1: ios=5240738/7, merge=0/7, ticks=1457727/5, in_queue=1547139, util=100.00%
--------------------------------------------------------------
Patched:
<redforge> fast_256GB $ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
slat (usec): min=26, max=50436, avg=337.21, stdev=1405.14
clat (usec): min=43, max=178839, avg=88963.96, stdev=21745.31
lat (usec): min=106, max=179041, avg=89301.43, stdev=21800.43
clat percentiles (msec):
| 1.00th=[ 50], 5.00th=[ 53], 10.00th=[ 55], 20.00th=[ 68],
| 30.00th=[ 79], 40.00th=[ 86], 50.00th=[ 93], 60.00th=[ 99],
| 70.00th=[ 103], 80.00th=[ 108], 90.00th=[ 114], 95.00th=[ 121],
| 99.00th=[ 134], 99.50th=[ 140], 99.90th=[ 150], 99.95th=[ 155],
| 99.99th=[ 163]
bw ( KiB/s): min= 4920, max=39733, per=5.07%, avg=11506.18, stdev=1540.18, samples=3650
iops : min= 1230, max= 9933, avg=2876.20, stdev=385.05, samples=3650
lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.23%, 50=1.13%
lat (msec) : 100=63.04%, 250=35.57%
cpu : usr=0.65%, sys=58.07%, ctx=188963, majf=0, minf=5303
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=222MiB/s (232MB/s), 222MiB/s-222MiB/s (232MB/s-232MB/s), io=20.0GiB (21.5GB), run=92385-92385msec
Disk stats (read/write):
nvme0n1: ios=5240550/7, merge=0/7, ticks=1513681/4, in_queue=1636411, util=100.00%
Thoughts?
thanks,
--
John Hubbard
NVIDIA
next prev parent reply other threads:[~2018-11-21 6:09 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-10 8:50 [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
2018-11-10 8:50 ` [PATCH v2 1/6] mm/gup: finish consolidating error handling john.hubbard
2018-11-12 15:41 ` Keith Busch
2018-11-12 16:14 ` Dan Williams
2018-11-15 0:45 ` John Hubbard
2018-11-10 8:50 ` [PATCH v2 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-11-11 14:10 ` Mike Rapoport
2018-11-10 8:50 ` [PATCH v2 3/6] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-11-10 8:50 ` [PATCH v2 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
2018-11-10 8:50 ` [PATCH v2 5/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
2018-11-10 8:50 ` [PATCH v2 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
2018-11-12 13:58 ` Jan Kara
2018-11-15 6:28 ` [LKP] [mm] 0e9755bfa2: kernel_BUG_at_include/linux/mm.h kernel test robot
2018-11-19 18:57 ` [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages Tom Talpey
2018-11-21 6:09 ` John Hubbard [this message]
2018-11-21 16:49 ` Tom Talpey
2018-11-21 22:06 ` John Hubbard
2018-11-28 1:21 ` Tom Talpey
2018-11-28 2:52 ` John Hubbard
2018-11-28 13:59 ` Tom Talpey
2018-11-30 1:39 ` John Hubbard
2018-11-30 2:18 ` Tom Talpey
2018-11-30 2:21 ` John Hubbard
2018-11-30 2:30 ` Tom Talpey
2018-11-30 3:00 ` John Hubbard
2018-11-30 3:14 ` Tom Talpey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=97934904-2754-77e0-5fcb-83f2311362ee@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=john.hubbard@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).