[RFC][PATCH 0/2] 9p: v9fs read and write speedup

* [RFC][PATCH 0/2] 9p: v9fs read and write speedup
@ 2016-10-10 17:24 ` Edward Shishkin
  0 siblings, 0 replies; 26+ messages in thread
From: Edward Shishkin @ 2016-10-10 17:24 UTC (permalink / raw)
  To: Eric Van Hensbergen, V9FS Developers Mailing List,
	Linux Filesystem Development List
  Cc: QEMU Developers Mailing List, ZhangWei, Claudio Fontana, Edward Shishkin

Hello everyone,

The progress in virtualization and cloud technologies has resulted in
a popularity of file sets shared on the host machines by Plan 9 File
Protocol (the sharing setup is also known as VirtFS). Another sharing
setup which uses NFS protocol is less popular because of number of
reasons.

Unfortunately performance of default VirtFS setup is poor.
We analyzed the reasons in our Labs of Huawei Technologies and found
that typical bottleneck is caused by the fact that transfer of any
portion of data by many small 9p messages is slower than transfer of
the same amount of data by a single 9p message.

It is possible to reduce number of 9P messages (and, hence, to improve
performance) by a number of ways(*), however, some "hardcoded"
bottlenecks are still present in the v9fs driver of the guest kernel.
Specifically, this is poor implementation of read-ahead and
write-behind paths of v9fs. With current implementations there is no
chances that more than PAGE_SIZE bytes of data will be transmitted at
one time.

To improve the situation we have introduced a special layer, which
allows to coalesce specified number of adjacent pages (if any) into a
single 9P message. This layer is represented by private
implementations of ->readpages() and ->writepages() address space
operations for v9fs.

To merge adjacent pages we use a special buffer of size, which depends
on specified (per mount session) msize. For read-ahead paths we
allocate such buffers by demand. For writeback paths we use a single
buffer pre-allocated at mount time. This is because at writeback paths
the file system usually responds to memory pressure notification, so
we can not afford allocation by-demand at writeback paths.

All pages which are supposed to merge are coped to the buffer at
respective offsets. Then we construct and transmit a long read(write)
9P message (**).

Thus, we managed to speedup only one writeback-ing thread. Other
concurrent threads, which failed to obtain the buffer, will go by
usual (slow) ways. If interesting, I'll implement a solution with N
pre-allocated buffers (where N is number of CPUs).

This approach allows to increase VirtFS bandwidth up to 3 times, and
thus, to make it close to the bandwidth of VirtIO-blk (see the numbers
in the Appendix A below).

Comment. Our patches improve only asynchronous operations, which
involve the page cache. Direct reads and writes will be unaffected by
obvious reasons.
NOTE, that by default v9fs works in direct mode, so in order to see an
effect, you should specify respective v9fs mount option (e.g.
"fscache").

----
(*)  Specifying larger mszie (maximal possible size of 9P message)
allow to reduce number of 9P messages in direct operations performed
by large chunks.
Disabling v9fs ACL and security labels in the guest kernel (if it is
not needed) allows to avoid extra-messages.

(**) 9P, Plan 9 File Protocol specifications
https://swtch.com/plan9port/man/man9/intro.html

Appendix A.

        iozone -e -r chunk_size -s 6G -w -f

            Throughput in mBytes/sec

operation  chunk_size   (1)       (2)       (3)

write          1M       391       127       330
read           1M       469       221       432
write          4K       410       128       297
read           4K       465       192       327
random write   1M       403       145       313
random read    1M       347       161       195
random write   4K       344       119       131
random read    4K        44        41        64

Legend:

(1): VirtIO-blk
(2): VirtIO-9p
(3): VirtIO-9p: guest kernel is patched with our stuff

Hardware & Software:

Host:  8 CPU Intel(R) Core(TM) Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
16G RAM, SSD: Noname. Throughput: Write: 410 M/sec, Read: 470 M/sec.
Fedora 24, Kernel-4.7.4-200.fc24.x86_64, kvm+qemu-2.7.0, fs: ext4

Guest: 2 CPU: GenuineIntel 3.4GHz, 2G RAM, Network model: VirtIO
Fedora 21, Kernel: 4.7.6

Settings:
VirtIO-blk: Guest FS: ext4;
VirtIO-9p:  mount option:
             "trans=virtio,version=9p2000.L,msize=131096,fscache"

Caches of the host and guest were dropped before every iozone's phase.

CC-ed QEMU developers list for possible comments and ACKs.

Please, consider for inclusion.

Thanks,
Edward.

^ permalink raw reply	[flat|nested] 26+ messages in thread