All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 01/11] tests/virtio-9p: add terminating null in v9fs_string_read()
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
@ 2020-01-20 22:29 ` Christian Schoenebeck
  2020-01-20 22:47 ` [PATCH v4 02/11] 9pfs: require msize >= 4096 Christian Schoenebeck
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-20 22:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

The 9p protocol sends strings in general without null termination
over the wire. However for future use of this functions it is
beneficial for the delivered string to be null terminated though
for being able to use the string with standard C functions which
often rely on strings being null terminated.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 tests/qtest/virtio-9p-test.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index e7b58e3a0c..06263edb53 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -130,8 +130,9 @@ static void v9fs_string_read(P9Req *req, uint16_t *len, char **string)
         *len = local_len;
     }
     if (string) {
-        *string = g_malloc(local_len);
+        *string = g_malloc(local_len + 1);
         v9fs_memread(req, *string, local_len);
+        (*string)[local_len] = 0;
     } else {
         v9fs_memskip(req, local_len);
     }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 02/11] 9pfs: require msize >= 4096
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
  2020-01-20 22:29 ` [PATCH v4 01/11] tests/virtio-9p: add terminating null in v9fs_string_read() Christian Schoenebeck
@ 2020-01-20 22:47 ` Christian Schoenebeck
  2020-01-20 23:50 ` [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir Christian Schoenebeck
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-20 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

A client establishes a session by sending a Tversion request along with a
'msize' parameter which client uses to suggest server a maximum message
size ever to be used for communication (for both requests and replies)
between client and server during that session. If client suggests a 'msize'
smaller than 4096 then deny session by server immediately with an error
response (Rlerror for "9P2000.L" clients or Rerror for "9P2000.u" clients)
instead of replying with Rversion.

So far any msize submitted by client with Tversion was simply accepted by
server without any check. Introduction of some minimum msize makes sense,
because e.g. a msize < 7 would not allow any subsequent 9p operation at
all, because 7 is the size of the header section common by all 9p message
types.

A substantial higher value of 4096 was chosen though to prevent potential
issues with some message types. E.g. Rreadlink may yield up to a size of
PATH_MAX which is usually 4096, and like almost all 9p message types,
Rreadlink is not allowed to be truncated by the 9p protocol. This chosen
size also prevents a similar issue with Rreaddir responses (provided client
always sends adequate 'count' parameter with Treaddir), because even though
directory entries retrieval may be split up over several T/Rreaddir
messages; a Rreaddir response must not truncate individual directory entries
though. So msize should be large enough to return at least one directory
entry with the longest possible file name supported by host. Most file
systems support a max. file name length of 255. Largest known file name
lenght limit would be currently ReiserFS with max. 4032 bytes, which is
also covered by this min. msize value because 4032 + 35 < 4096.

Furthermore 4096 is already the minimum msize of the Linux kernel's 9pfs
client.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 hw/9pfs/9p.c | 12 ++++++++++++
 hw/9pfs/9p.h | 11 +++++++++++
 2 files changed, 23 insertions(+)

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 520177f40c..a5fbe821d4 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -1363,8 +1363,20 @@ static void coroutine_fn v9fs_version(void *opaque)
         s->proto_version = V9FS_PROTO_2000L;
     } else {
         v9fs_string_sprintf(&version, "unknown");
+        /* skip min. msize check, reporting invalid version has priority */
+        goto marshal;
     }
 
+    if (s->msize < P9_MIN_MSIZE) {
+        err = -EMSGSIZE;
+        error_report(
+            "9pfs: Client requested msize < minimum msize ("
+            stringify(P9_MIN_MSIZE) ") supported by this server."
+        );
+        goto out;
+    }
+
+marshal:
     err = pdu_marshal(pdu, offset, "ds", s->msize, &version);
     if (err < 0) {
         goto out;
diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
index 3904f82901..6fffe44f5a 100644
--- a/hw/9pfs/9p.h
+++ b/hw/9pfs/9p.h
@@ -100,6 +100,17 @@ typedef enum P9ProtoVersion {
     V9FS_PROTO_2000L = 0x02,
 } P9ProtoVersion;
 
+/**
+ * @brief Minimum message size supported by this 9pfs server.
+ *
+ * A client establishes a session by sending a Tversion request along with a
+ * 'msize' parameter which suggests the server a maximum message size ever to be
+ * used for communication (for both requests and replies) between client and
+ * server during that session. If client suggests a 'msize' smaller than this
+ * value then session is denied by server with an error response.
+ */
+#define P9_MIN_MSIZE    4096
+
 #define P9_NOTAG    UINT16_MAX
 #define P9_NOFID    UINT32_MAX
 #define P9_MAXWELEM 16
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
  2020-01-20 22:29 ` [PATCH v4 01/11] tests/virtio-9p: add terminating null in v9fs_string_read() Christian Schoenebeck
  2020-01-20 22:47 ` [PATCH v4 02/11] 9pfs: require msize >= 4096 Christian Schoenebeck
@ 2020-01-20 23:50 ` Christian Schoenebeck
  2020-01-22 14:11   ` Greg Kurz
  2020-01-21  0:01 ` [PATCH v4 04/11] hw/9pfs/9p-synth: added directory for readdir test Christian Schoenebeck
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-20 23:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

A good 9p client sends T_readdir with "count" parameter that's sufficiently
smaller than client's initially negotiated msize (maximum message size).
We perform a check for that though to avoid the server to be interrupted
with a "Failed to encode VirtFS reply type 41" transport error message by
bad clients. This count value constraint uses msize - 11, because 11 is the
header size of R_readdir.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index a5fbe821d4..18370183c4 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -2426,6 +2426,7 @@ static void coroutine_fn v9fs_readdir(void *opaque)
     int32_t count;
     uint32_t max_count;
     V9fsPDU *pdu = opaque;
+    V9fsState *s = pdu->s;
 
     retval = pdu_unmarshal(pdu, offset, "dqd", &fid,
                            &initial_offset, &max_count);
@@ -2434,6 +2435,13 @@ static void coroutine_fn v9fs_readdir(void *opaque)
     }
     trace_v9fs_readdir(pdu->tag, pdu->id, fid, initial_offset, max_count);
 
+    if (max_count > s->msize - 11) {
+        max_count = s->msize - 11;
+        warn_report_once(
+            "9p: bad client: T_readdir with count > msize - 11"
+        );
+    }
+
     fidp = get_fid(pdu, fid);
     if (fidp == NULL) {
         retval = -EINVAL;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 04/11] hw/9pfs/9p-synth: added directory for readdir test
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (2 preceding siblings ...)
  2020-01-20 23:50 ` [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir Christian Schoenebeck
@ 2020-01-21  0:01 ` Christian Schoenebeck
  2020-01-21  0:12 ` [PATCH v4 05/11] tests/virtio-9p: added " Christian Schoenebeck
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

This will provide the following virtual files by the 9pfs
synth driver:

  - /ReadDirDir/ReadDirFile99
  - /ReadDirDir/ReadDirFile98
  ...
  - /ReadDirDir/ReadDirFile1
  - /ReadDirDir/ReadDirFile0

This virtual directory and its virtual 100 files will be
used by the upcoming 9pfs readdir tests.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 hw/9pfs/9p-synth.c | 19 +++++++++++++++++++
 hw/9pfs/9p-synth.h |  5 +++++
 2 files changed, 24 insertions(+)

diff --git a/hw/9pfs/9p-synth.c b/hw/9pfs/9p-synth.c
index 54239c9bbf..7eb210ffa8 100644
--- a/hw/9pfs/9p-synth.c
+++ b/hw/9pfs/9p-synth.c
@@ -578,6 +578,25 @@ static int synth_init(FsContext *ctx, Error **errp)
                                        NULL, v9fs_synth_qtest_flush_write,
                                        ctx);
         assert(!ret);
+
+        /* Directory for READDIR test */
+        {
+            V9fsSynthNode *dir = NULL;
+            ret = qemu_v9fs_synth_mkdir(
+                NULL, 0700, QTEST_V9FS_SYNTH_READDIR_DIR, &dir
+            );
+            assert(!ret);
+            for (i = 0; i < QTEST_V9FS_SYNTH_READDIR_NFILES; ++i) {
+                char *name = g_strdup_printf(
+                    QTEST_V9FS_SYNTH_READDIR_FILE, i
+                );
+                ret = qemu_v9fs_synth_add_file(
+                    dir, 0, name, NULL, NULL, ctx
+                );
+                assert(!ret);
+                g_free(name);
+            }
+        }
     }
 
     return 0;
diff --git a/hw/9pfs/9p-synth.h b/hw/9pfs/9p-synth.h
index af7a993a1e..036d7e4a5b 100644
--- a/hw/9pfs/9p-synth.h
+++ b/hw/9pfs/9p-synth.h
@@ -55,6 +55,11 @@ int qemu_v9fs_synth_add_file(V9fsSynthNode *parent, int mode,
 #define QTEST_V9FS_SYNTH_LOPEN_FILE "LOPEN"
 #define QTEST_V9FS_SYNTH_WRITE_FILE "WRITE"
 
+/* for READDIR test */
+#define QTEST_V9FS_SYNTH_READDIR_DIR "ReadDirDir"
+#define QTEST_V9FS_SYNTH_READDIR_FILE "ReadDirFile%d"
+#define QTEST_V9FS_SYNTH_READDIR_NFILES 100
+
 /* Any write to the "FLUSH" file is handled one byte at a time by the
  * backend. If the byte is zero, the backend returns success (ie, 1),
  * otherwise it forces the server to try again forever. Thus allowing
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 05/11] tests/virtio-9p: added readdir test
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (3 preceding siblings ...)
  2020-01-21  0:01 ` [PATCH v4 04/11] hw/9pfs/9p-synth: added directory for readdir test Christian Schoenebeck
@ 2020-01-21  0:12 ` Christian Schoenebeck
  2020-01-22 19:56   ` Greg Kurz
  2020-01-21  0:16 ` [PATCH v4 06/11] tests/virtio-9p: added splitted " Christian Schoenebeck
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

The first readdir test simply checks the amount of directory
entries returned by 9pfs server, according to the created amount
of virtual files on 9pfs synth driver side. Then the subsequent
readdir test also checks whether all directory entries have the
expected file names (as created on 9pfs synth driver side),
ignoring their precise order in result list though.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 tests/qtest/virtio-9p-test.c | 152 +++++++++++++++++++++++++++++++++++
 1 file changed, 152 insertions(+)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 06263edb53..2167322985 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -68,6 +68,11 @@ static void v9fs_memread(P9Req *req, void *addr, size_t len)
     req->r_off += len;
 }
 
+static void v9fs_uint8_read(P9Req *req, uint8_t *val)
+{
+    v9fs_memread(req, val, 1);
+}
+
 static void v9fs_uint16_write(P9Req *req, uint16_t val)
 {
     uint16_t le_val = cpu_to_le16(val);
@@ -101,6 +106,12 @@ static void v9fs_uint32_read(P9Req *req, uint32_t *val)
     le32_to_cpus(val);
 }
 
+static void v9fs_uint64_read(P9Req *req, uint64_t *val)
+{
+    v9fs_memread(req, val, 8);
+    le64_to_cpus(val);
+}
+
 /* len[2] string[len] */
 static uint16_t v9fs_string_size(const char *string)
 {
@@ -191,6 +202,7 @@ static const char *rmessage_name(uint8_t id)
         id == P9_RLOPEN ? "RLOPEN" :
         id == P9_RWRITE ? "RWRITE" :
         id == P9_RFLUSH ? "RFLUSH" :
+        id == P9_RREADDIR ? "READDIR" :
         "<unknown>";
 }
 
@@ -348,6 +360,82 @@ static void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid **wqid)
     v9fs_req_free(req);
 }
 
+/* size[4] Treaddir tag[2] fid[4] offset[8] count[4] */
+static P9Req *v9fs_treaddir(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
+                            uint32_t count, uint16_t tag)
+{
+    P9Req *req;
+
+    req = v9fs_req_init(v9p, 4 + 8 + 4, P9_TREADDIR, tag);
+    v9fs_uint32_write(req, fid);
+    v9fs_uint64_write(req, offset);
+    v9fs_uint32_write(req, count);
+    v9fs_req_send(req);
+    return req;
+}
+
+struct V9fsDirent {
+    v9fs_qid qid;
+    uint64_t offset;
+    uint8_t type;
+    char *name;
+    struct V9fsDirent *next;
+};
+
+/* size[4] Rreaddir tag[2] count[4] data[count] */
+static void v9fs_rreaddir(P9Req *req, uint32_t *count, uint32_t *nentries,
+                          struct V9fsDirent **entries)
+{
+    uint32_t local_count;
+    struct V9fsDirent *e = NULL;
+    uint16_t slen;
+    uint32_t n = 0;
+
+    v9fs_req_recv(req, P9_RREADDIR);
+    v9fs_uint32_read(req, &local_count);
+
+    if (count) {
+        *count = local_count;
+    }
+
+    for (int32_t togo = (int32_t)local_count;
+         togo >= 13 + 8 + 1 + 2;
+         togo -= 13 + 8 + 1 + 2 + slen, ++n)
+    {
+        if (!e) {
+            e = g_malloc(sizeof(struct V9fsDirent));
+            if (entries) {
+                *entries = e;
+            }
+        } else {
+            e = e->next = g_malloc(sizeof(struct V9fsDirent));
+        }
+        e->next = NULL;
+        /* qid[13] offset[8] type[1] name[s] */
+        v9fs_memread(req, &e->qid, 13);
+        v9fs_uint64_read(req, &e->offset);
+        v9fs_uint8_read(req, &e->type);
+        v9fs_string_read(req, &slen, &e->name);
+    }
+
+    if (nentries) {
+        *nentries = n;
+    }
+
+    v9fs_req_free(req);
+}
+
+static void v9fs_free_dirents(struct V9fsDirent *e)
+{
+    struct V9fsDirent *next = NULL;
+
+    for (; e; e = next) {
+        next = e->next;
+        g_free(e->name);
+        g_free(e);
+    }
+}
+
 /* size[4] Tlopen tag[2] fid[4] flags[4] */
 static P9Req *v9fs_tlopen(QVirtio9P *v9p, uint32_t fid, uint32_t flags,
                           uint16_t tag)
@@ -480,6 +568,69 @@ static void fs_walk(void *obj, void *data, QGuestAllocator *t_alloc)
     g_free(wqid);
 }
 
+static bool fs_dirents_contain_name(struct V9fsDirent *e, const char* name)
+{
+    for (; e; e = e->next) {
+        if (!strcmp(e->name, name)) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVirtio9P *v9p = obj;
+    alloc = t_alloc;
+    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
+    uint16_t nqid;
+    v9fs_qid qid;
+    uint32_t count, nentries;
+    struct V9fsDirent *entries = NULL;
+    P9Req *req;
+
+    fs_attach(v9p, NULL, t_alloc);
+    req = v9fs_twalk(v9p, 0, 1, 1, wnames, 0);
+    v9fs_req_wait_for_reply(req, NULL);
+    v9fs_rwalk(req, &nqid, NULL);
+    g_assert_cmpint(nqid, ==, 1);
+
+    req = v9fs_tlopen(v9p, 1, O_DIRECTORY, 0);
+    v9fs_req_wait_for_reply(req, NULL);
+    v9fs_rlopen(req, &qid, NULL);
+
+    /*
+     * submit count = msize - 11, because 11 is the header size of Rreaddir
+     */
+    req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
+    v9fs_req_wait_for_reply(req, NULL);
+    v9fs_rreaddir(req, &count, &nentries, &entries);
+
+    /*
+     * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve all
+     * dir entries with only one readdir request.
+     */
+    g_assert_cmpint(
+        nentries, ==,
+        QTEST_V9FS_SYNTH_READDIR_NFILES + 2 /* "." and ".." */
+    );
+
+    /*
+     * Check all file names exist in returned entries, ignore their order
+     * though.
+     */
+    g_assert_cmpint(fs_dirents_contain_name(entries, "."), ==, true);
+    g_assert_cmpint(fs_dirents_contain_name(entries, ".."), ==, true);
+    for (int i = 0; i < QTEST_V9FS_SYNTH_READDIR_NFILES; ++i) {
+        char *name = g_strdup_printf(QTEST_V9FS_SYNTH_READDIR_FILE, i);
+        g_assert_cmpint(fs_dirents_contain_name(entries, name), ==, true);
+        g_free(name);
+    }
+
+    v9fs_free_dirents(entries);
+    g_free(wnames[0]);
+}
+
 static void fs_walk_no_slash(void *obj, void *data, QGuestAllocator *t_alloc)
 {
     QVirtio9P *v9p = obj;
@@ -658,6 +809,7 @@ static void register_virtio_9p_test(void)
                  NULL);
     qos_add_test("fs/flush/ignored", "virtio-9p", fs_flush_ignored,
                  NULL);
+    qos_add_test("fs/readdir/basic", "virtio-9p", fs_readdir, NULL);
 }
 
 libqos_init(register_virtio_9p_test);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (4 preceding siblings ...)
  2020-01-21  0:12 ` [PATCH v4 05/11] tests/virtio-9p: added " Christian Schoenebeck
@ 2020-01-21  0:16 ` Christian Schoenebeck
  2020-01-22 21:14   ` Eric Blake
  2020-01-22 21:19   ` Greg Kurz
  2020-01-21  0:17 ` [PATCH v4 07/11] tests/virtio-9p: failing " Christian Schoenebeck
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:16 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

The previous, already existing readdir test simply used a 'count'
parameter big enough to retrieve all directory entries with a
single Treaddir request.

In this new 'splitted' readdir test, directory entries are
retrieved, splitted over several Treaddir requests by picking small
'count' parameters which force the server to truncate the response.
So the test client sends as many Treaddir requests as necessary to
get all directory entries. Currently this test covers actually two
tests: a sequence of Treaddir requests with count=512 and then a
subsequent test with a sequence of Treaddir requests with count=256.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 tests/qtest/virtio-9p-test.c | 91 ++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 2167322985..8b0d94546e 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct V9fsDirent *e, const char* name)
     return false;
 }
 
+/* basic readdir test where reply fits into a single response message */
 static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
 {
     QVirtio9P *v9p = obj;
@@ -631,6 +632,95 @@ static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
     g_free(wnames[0]);
 }
 
+/* readdir test where overall request is splitted over several messages */
+static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVirtio9P *v9p = obj;
+    alloc = t_alloc;
+    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
+    uint16_t nqid;
+    v9fs_qid qid;
+    uint32_t count, nentries, npartialentries;
+    struct V9fsDirent *entries, *tail, *partialentries;
+    P9Req *req;
+    int subtest;
+    int fid;
+    uint64_t offset;
+    /* the Treaddir 'count' parameter values to be tested */
+    const uint32_t vcount[] = { 512, 256 };
+    const int nvcount = sizeof(vcount) / sizeof(uint32_t);
+
+    fs_attach(v9p, NULL, t_alloc);
+
+    /* iterate over all 'count' parameter values to be tested with Treaddir */
+    for (subtest = 0; subtest < nvcount; ++subtest) {
+        fid = subtest + 1;
+        offset = 0;
+        entries = NULL;
+        nentries = 0;
+        tail = NULL;
+
+        req = v9fs_twalk(v9p, 0, fid, 1, wnames, 0);
+        v9fs_req_wait_for_reply(req, NULL);
+        v9fs_rwalk(req, &nqid, NULL);
+        g_assert_cmpint(nqid, ==, 1);
+
+        req = v9fs_tlopen(v9p, fid, O_DIRECTORY, 0);
+        v9fs_req_wait_for_reply(req, NULL);
+        v9fs_rlopen(req, &qid, NULL);
+
+        /*
+         * send as many Treaddir requests as required to get all directory
+         * entries
+         */
+        while (true) {
+            npartialentries = 0;
+            partialentries = NULL;
+
+            req = v9fs_treaddir(v9p, fid, offset, vcount[subtest], 0);
+            v9fs_req_wait_for_reply(req, NULL);
+            v9fs_rreaddir(req, &count, &npartialentries, &partialentries);
+            if (npartialentries > 0 && partialentries) {
+                if (!entries) {
+                    entries = partialentries;
+                    nentries = npartialentries;
+                    tail = partialentries;
+                } else {
+                    tail->next = partialentries;
+                    nentries += npartialentries;
+                }
+                while (tail->next) {
+                    tail = tail->next;
+                }
+                offset = tail->offset;
+            } else {
+                break;
+            }
+        }
+
+        g_assert_cmpint(
+            nentries, ==,
+            QTEST_V9FS_SYNTH_READDIR_NFILES + 2 /* "." and ".." */
+        );
+
+        /*
+         * Check all file names exist in returned entries, ignore their order
+         * though.
+         */
+        g_assert_cmpint(fs_dirents_contain_name(entries, "."), ==, true);
+        g_assert_cmpint(fs_dirents_contain_name(entries, ".."), ==, true);
+        for (int i = 0; i < QTEST_V9FS_SYNTH_READDIR_NFILES; ++i) {
+            char *name = g_strdup_printf(QTEST_V9FS_SYNTH_READDIR_FILE, i);
+            g_assert_cmpint(fs_dirents_contain_name(entries, name), ==, true);
+            g_free(name);
+        }
+
+        v9fs_free_dirents(entries);
+    }
+
+    g_free(wnames[0]);
+}
+
 static void fs_walk_no_slash(void *obj, void *data, QGuestAllocator *t_alloc)
 {
     QVirtio9P *v9p = obj;
@@ -810,6 +900,7 @@ static void register_virtio_9p_test(void)
     qos_add_test("fs/flush/ignored", "virtio-9p", fs_flush_ignored,
                  NULL);
     qos_add_test("fs/readdir/basic", "virtio-9p", fs_readdir, NULL);
+    qos_add_test("fs/readdir/split", "virtio-9p", fs_readdir_split, NULL);
 }
 
 libqos_init(register_virtio_9p_test);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 07/11] tests/virtio-9p: failing splitted readdir test
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (5 preceding siblings ...)
  2020-01-21  0:16 ` [PATCH v4 06/11] tests/virtio-9p: added splitted " Christian Schoenebeck
@ 2020-01-21  0:17 ` Christian Schoenebeck
  2020-01-22 22:59   ` Greg Kurz
  2020-01-21  0:23 ` [PATCH v4 08/11] 9pfs: readdir benchmark Christian Schoenebeck
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

This patch is not intended to be merged. It resembles
an issue (with debug messages) where the splitted
readdir test fails because server is interrupted with
transport error "Failed to decode VirtFS request type 40",
which BTW fails both with the unoptimized and with the
optimized 9p readdir code.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 tests/qtest/virtio-9p-test.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index 8b0d94546e..e47b286340 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -647,13 +647,14 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
     int fid;
     uint64_t offset;
     /* the Treaddir 'count' parameter values to be tested */
-    const uint32_t vcount[] = { 512, 256 };
+    const uint32_t vcount[] = { 512, 256, 128 };
     const int nvcount = sizeof(vcount) / sizeof(uint32_t);
 
     fs_attach(v9p, NULL, t_alloc);
 
     /* iterate over all 'count' parameter values to be tested with Treaddir */
     for (subtest = 0; subtest < nvcount; ++subtest) {
+        printf("\nsubtest[%d] with count=%d\n", subtest, vcount[subtest]);
         fid = subtest + 1;
         offset = 0;
         entries = NULL;
@@ -674,12 +675,16 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
          * entries
          */
         while (true) {
+            printf("\toffset=%ld\n", offset);
             npartialentries = 0;
             partialentries = NULL;
 
+            printf("Treaddir fid=%d offset=%ld count=%d\n",
+                   fid, offset, vcount[subtest]);
             req = v9fs_treaddir(v9p, fid, offset, vcount[subtest], 0);
             v9fs_req_wait_for_reply(req, NULL);
             v9fs_rreaddir(req, &count, &npartialentries, &partialentries);
+            printf("\t\tnpartial=%d nentries=%d\n", npartialentries, nentries);
             if (npartialentries > 0 && partialentries) {
                 if (!entries) {
                     entries = partialentries;
@@ -716,6 +721,8 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
         }
 
         v9fs_free_dirents(entries);
+
+        printf("PASSED subtest[%d]\n", subtest);
     }
 
     g_free(wnames[0]);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 08/11] 9pfs: readdir benchmark
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (6 preceding siblings ...)
  2020-01-21  0:17 ` [PATCH v4 07/11] tests/virtio-9p: failing " Christian Schoenebeck
@ 2020-01-21  0:23 ` Christian Schoenebeck
  2020-01-23 10:34   ` Greg Kurz
  2020-01-21  0:26 ` [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir() Christian Schoenebeck
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

This patch is not intended to be merged. It just provides a
temporary benchmark foundation for coneniently A/B comparison
of the subsequent 9p readdir optimization patches:

* hw/9pfs/9p-synth: increase amount of simulated files for
  readdir test to 2000 files.

* tests/virtio-9p: measure wall time that elapsed between
  sending T_readdir request and arrival of R_readdir response
  and print out that measured duration, as well as amount of
  directory entries received, and the amount of bytes of the
  response message.

* tests/virtio-9p: increased msize to 256kiB to allow
  retrieving all 2000 files (simulated by 9pfs synth driver)
  with only one T_readdir request.

Running this benchmark is fairly quick & simple and does not
require any guest OS installation or other prerequisites:

cd build
make && make tests/qtest/qos-test
export QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic)

Since this benchmark uses the 9pfs synth driver, the host
machine's I/O hardware (SSDs/HDDs) is not relevant for the
benchmark result, because the synth backend's readdir
implementation returns immediately (without any blocking I/O
that would incur with a real-life fs driver) and just returns
already prepared, simulated directory entries directly from RAM.
So this benchmark focuses on the efficiency of the 9pfs
controller code (or top half) for readdir request handling.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p-synth.h           |  2 +-
 tests/qtest/virtio-9p-test.c | 37 +++++++++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/hw/9pfs/9p-synth.h b/hw/9pfs/9p-synth.h
index 036d7e4a5b..7d6cedcdac 100644
--- a/hw/9pfs/9p-synth.h
+++ b/hw/9pfs/9p-synth.h
@@ -58,7 +58,7 @@ int qemu_v9fs_synth_add_file(V9fsSynthNode *parent, int mode,
 /* for READDIR test */
 #define QTEST_V9FS_SYNTH_READDIR_DIR "ReadDirDir"
 #define QTEST_V9FS_SYNTH_READDIR_FILE "ReadDirFile%d"
-#define QTEST_V9FS_SYNTH_READDIR_NFILES 100
+#define QTEST_V9FS_SYNTH_READDIR_NFILES 2000
 
 /* Any write to the "FLUSH" file is handled one byte at a time by the
  * backend. If the byte is zero, the backend returns success (ie, 1),
diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
index e47b286340..d71b37aa6c 100644
--- a/tests/qtest/virtio-9p-test.c
+++ b/tests/qtest/virtio-9p-test.c
@@ -15,6 +15,18 @@
 #include "libqos/virtio-9p.h"
 #include "libqos/qgraph.h"
 
+/*
+ * to benchmark the real time (not CPU time) that elapsed between start of
+ * a request and arrival of its response
+ */
+static double wall_time(void)
+{
+    struct timeval t;
+    struct timezone tz;
+    gettimeofday(&t, &tz);
+    return t.tv_sec + t.tv_usec * 0.000001;
+}
+
 #define QVIRTIO_9P_TIMEOUT_US (10 * 1000 * 1000)
 static QGuestAllocator *alloc;
 
@@ -36,7 +48,7 @@ static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
     g_free(tag);
 }
 
-#define P9_MAX_SIZE 4096 /* Max size of a T-message or R-message */
+#define P9_MAX_SIZE (256 * 1024) /* Max size of a T-message or R-message */
 
 typedef struct {
     QTestState *qts;
@@ -600,12 +612,35 @@ static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
     v9fs_req_wait_for_reply(req, NULL);
     v9fs_rlopen(req, &qid, NULL);
 
+    const double start = wall_time();
+
     /*
      * submit count = msize - 11, because 11 is the header size of Rreaddir
      */
     req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
+    const double treaddir = wall_time();
     v9fs_req_wait_for_reply(req, NULL);
+    const double waitforreply = wall_time();
     v9fs_rreaddir(req, &count, &nentries, &entries);
+    const double end = wall_time();
+
+    printf("\nTime client spent on sending T_readdir: %fs\n\n",
+           treaddir - start);
+
+    printf("Time client spent for waiting for reply from server: %fs "
+           "[MOST IMPORTANT]\n", waitforreply - start);
+    printf("(This is the most important value, because it reflects the time\n"
+           "the 9p server required to process and return the result of the\n"
+           "T_readdir request.)\n\n");
+
+    printf("Total client time: %fs\n", end - start);
+    printf("(NOTE: this time is not relevant; this huge time comes from\n"
+           "inefficient qtest_memread() calls. So you can discard this\n"
+           "value as a problem of this test client implementation while\n"
+           "processing the received server T_readdir reply.)\n\n");
+
+    printf("Details of response message data: R_readddir nentries=%d "
+           "rbytes=%d\n", nentries, count);
 
     /*
      * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve all
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir()
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (7 preceding siblings ...)
  2020-01-21  0:23 ` [PATCH v4 08/11] 9pfs: readdir benchmark Christian Schoenebeck
@ 2020-01-21  0:26 ` Christian Schoenebeck
  2020-01-23 11:13   ` Greg Kurz
  2020-01-21  0:30 ` [PATCH v4 10/11] 9pfs: T_readdir latency optimization Christian Schoenebeck
  2020-01-21  0:32 ` [PATCH v4 11/11] hw/9pfs/9p.c: benchmark time on T_readdir request Christian Schoenebeck
  10 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

This patch is just a temporary benchmark hack, not intended
to be merged!

9pfs synth driver's readdir() implementation has a severe
n-square performance problem. This patch is a quick and dirty
hack to prevent that performance problem from tainting the
readdir() benchmark results. In its current form, this patch
is not useful for anything else than for an isolated readdir
benchmark.

NOTE: This patch would break the new readdir/split test,
because it would alter the behaviour of seekdir() required
for retrieving directory entries splitted over several
requests.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p-synth.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/hw/9pfs/9p-synth.c b/hw/9pfs/9p-synth.c
index 7eb210ffa8..54dc30f37b 100644
--- a/hw/9pfs/9p-synth.c
+++ b/hw/9pfs/9p-synth.c
@@ -225,7 +225,8 @@ static void synth_direntry(V9fsSynthNode *node,
 }
 
 static struct dirent *synth_get_dentry(V9fsSynthNode *dir,
-                                            struct dirent *entry, off_t off)
+                                       struct dirent *entry, off_t off,
+                                       V9fsSynthNode **hack)
 {
     int i = 0;
     V9fsSynthNode *node;
@@ -243,16 +244,38 @@ static struct dirent *synth_get_dentry(V9fsSynthNode *dir,
         /* end of directory */
         return NULL;
     }
+    *hack = node;
     synth_direntry(node, entry, off);
     return entry;
 }
 
 static struct dirent *synth_readdir(FsContext *ctx, V9fsFidOpenState *fs)
 {
-    struct dirent *entry;
+    struct dirent *entry = NULL;
     V9fsSynthOpenState *synth_open = fs->private;
     V9fsSynthNode *node = synth_open->node;
-    entry = synth_get_dentry(node, &synth_open->dent, synth_open->offset);
+
+    /*
+     * HACK: This is just intended for benchmark, to avoid severe n-square
+     * performance problem of synth driver's readdir implementation here which
+     * would otherwise unncessarily taint the benchmark results. By simply
+     * caching (globally) the previous node (of the previous synth_readdir()
+     * call) we can simply proceed to next node in chained list efficiently.
+     *
+     * not a good idea for any production code ;-)
+     */
+    static struct V9fsSynthNode *cachedNode;
+
+    if (!cachedNode) {
+        entry = synth_get_dentry(node, &synth_open->dent, synth_open->offset,
+                                 &cachedNode);
+    } else {
+        cachedNode = cachedNode->sibling.le_next;
+        if (cachedNode) {
+            entry = &synth_open->dent;
+            synth_direntry(cachedNode, entry, synth_open->offset + 1);
+        }
+    }
     if (entry) {
         synth_open->offset++;
     }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (8 preceding siblings ...)
  2020-01-21  0:26 ` [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir() Christian Schoenebeck
@ 2020-01-21  0:30 ` Christian Schoenebeck
  2020-01-23 11:33   ` Greg Kurz
  2020-03-09 14:09   ` Christian Schoenebeck
  2020-01-21  0:32 ` [PATCH v4 11/11] hw/9pfs/9p.c: benchmark time on T_readdir request Christian Schoenebeck
  10 siblings, 2 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

Make top half really top half and bottom half really bottom half:

Each T_readdir request handling is hopping between threads (main
I/O thread and background I/O driver threads) several times for
every individual directory entry, which sums up to huge latencies
for handling just a single T_readdir request.

Instead of doing that, collect now all required directory entries
(including all potentially required stat buffers for each entry) in
one rush on a background I/O thread from fs driver, then assemble
the entire resulting network response message for the readdir
request on main I/O thread. The fs driver is still aborting the
directory entry retrieval loop (on the background I/O thread) as
soon as it would exceed the client's requested maximum R_readdir
response size. So we should not have any performance penalty by
doing this.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p.c    | 124 +++++++++++++++-----------------
 hw/9pfs/9p.h    |  23 ++++++
 hw/9pfs/codir.c | 183 +++++++++++++++++++++++++++++++++++++++++++++---
 hw/9pfs/coth.h  |   3 +
 4 files changed, 254 insertions(+), 79 deletions(-)

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 18370183c4..e0ca45d46b 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -971,30 +971,6 @@ static int coroutine_fn fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
     return 0;
 }
 
-static int coroutine_fn dirent_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
-                                      struct dirent *dent, V9fsQID *qidp)
-{
-    struct stat stbuf;
-    V9fsPath path;
-    int err;
-
-    v9fs_path_init(&path);
-
-    err = v9fs_co_name_to_path(pdu, &fidp->path, dent->d_name, &path);
-    if (err < 0) {
-        goto out;
-    }
-    err = v9fs_co_lstat(pdu, &path, &stbuf);
-    if (err < 0) {
-        goto out;
-    }
-    err = stat_to_qid(pdu, &stbuf, qidp);
-
-out:
-    v9fs_path_free(&path);
-    return err;
-}
-
 V9fsPDU *pdu_alloc(V9fsState *s)
 {
     V9fsPDU *pdu = NULL;
@@ -2314,7 +2290,7 @@ out_nofid:
     pdu_complete(pdu, err);
 }
 
-static size_t v9fs_readdir_data_size(V9fsString *name)
+size_t v9fs_readdir_response_size(V9fsString *name)
 {
     /*
      * Size of each dirent on the wire: size of qid (13) + size of offset (8)
@@ -2323,6 +2299,18 @@ static size_t v9fs_readdir_data_size(V9fsString *name)
     return 24 + v9fs_string_size(name);
 }
 
+static void v9fs_free_dirents(struct V9fsDirEnt *e)
+{
+    struct V9fsDirEnt *next = NULL;
+
+    for (; e; e = next) {
+        next = e->next;
+        g_free(e->dent);
+        g_free(e->st);
+        g_free(e);
+    }
+}
+
 static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
                                         int32_t max_count)
 {
@@ -2331,54 +2319,53 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
     V9fsString name;
     int len, err = 0;
     int32_t count = 0;
-    off_t saved_dir_pos;
     struct dirent *dent;
+    struct stat *st;
+    struct V9fsDirEnt *entries = NULL;
 
-    /* save the directory position */
-    saved_dir_pos = v9fs_co_telldir(pdu, fidp);
-    if (saved_dir_pos < 0) {
-        return saved_dir_pos;
-    }
-
-    while (1) {
-        v9fs_readdir_lock(&fidp->fs.dir);
+    /*
+     * inode remapping requires the device id, which in turn might be
+     * different for different directory entries, so if inode remapping is
+     * enabled we have to make a full stat for each directory entry
+     */
+    const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
 
-        err = v9fs_co_readdir(pdu, fidp, &dent);
-        if (err || !dent) {
-            break;
-        }
-        v9fs_string_init(&name);
-        v9fs_string_sprintf(&name, "%s", dent->d_name);
-        if ((count + v9fs_readdir_data_size(&name)) > max_count) {
-            v9fs_readdir_unlock(&fidp->fs.dir);
+    /*
+     * Fetch all required directory entries altogether on a background IO
+     * thread from fs driver. We don't want to do that for each entry
+     * individually, because hopping between threads (this main IO thread
+     * and background IO driver thread) would sum up to huge latencies.
+     */
+    count = v9fs_co_readdir_lowlat(pdu, fidp, &entries, max_count, dostat);
+    if (count < 0) {
+        err = count;
+        count = 0;
+        goto out;
+    }
+    count = 0;
 
-            /* Ran out of buffer. Set dir back to old position and return */
-            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-            v9fs_string_free(&name);
-            return count;
-        }
+    for (struct V9fsDirEnt *e = entries; e; e = e->next) {
+        dent = e->dent;
 
         if (pdu->s->ctx.export_flags & V9FS_REMAP_INODES) {
-            /*
-             * dirent_to_qid() implies expensive stat call for each entry,
-             * we must do that here though since inode remapping requires
-             * the device id, which in turn might be different for
-             * different entries; we cannot make any assumption to avoid
-             * that here.
-             */
-            err = dirent_to_qid(pdu, fidp, dent, &qid);
+            st = e->st;
+            /* e->st should never be NULL, but just to be sure */
+            if (!st) {
+                err = -1;
+                break;
+            }
+
+            /* remap inode */
+            err = stat_to_qid(pdu, st, &qid);
             if (err < 0) {
-                v9fs_readdir_unlock(&fidp->fs.dir);
-                v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-                v9fs_string_free(&name);
-                return err;
+                break;
             }
         } else {
             /*
              * Fill up just the path field of qid because the client uses
              * only that. To fill the entire qid structure we will have
              * to stat each dirent found, which is expensive. For the
-             * latter reason we don't call dirent_to_qid() here. Only drawback
+             * latter reason we don't call stat_to_qid() here. Only drawback
              * is that no multi-device export detection of stat_to_qid()
              * would be done and provided as error to the user here. But
              * user would get that error anyway when accessing those
@@ -2391,25 +2378,26 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
             qid.version = 0;
         }
 
+        v9fs_string_init(&name);
+        v9fs_string_sprintf(&name, "%s", dent->d_name);
+
         /* 11 = 7 + 4 (7 = start offset, 4 = space for storing count) */
         len = pdu_marshal(pdu, 11 + count, "Qqbs",
                           &qid, dent->d_off,
                           dent->d_type, &name);
 
-        v9fs_readdir_unlock(&fidp->fs.dir);
+        v9fs_string_free(&name);
 
         if (len < 0) {
-            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
-            v9fs_string_free(&name);
-            return len;
+            err = len;
+            break;
         }
+
         count += len;
-        v9fs_string_free(&name);
-        saved_dir_pos = dent->d_off;
     }
 
-    v9fs_readdir_unlock(&fidp->fs.dir);
-
+out:
+    v9fs_free_dirents(entries);
     if (err < 0) {
         return err;
     }
diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
index 6fffe44f5a..1dbb0ad189 100644
--- a/hw/9pfs/9p.h
+++ b/hw/9pfs/9p.h
@@ -215,6 +215,28 @@ static inline void v9fs_readdir_init(V9fsDir *dir)
     qemu_mutex_init(&dir->readdir_mutex);
 }
 
+/*
+ * Type for 9p fs drivers' (a.k.a. 9p backends) result of readdir requests,
+ * which is a chained list of directory entries.
+ */
+typedef struct V9fsDirEnt {
+    /* mandatory (must not be NULL) information for all readdir requests */
+    struct dirent *dent;
+    /*
+     * optional (may be NULL): A full stat of each directory entry is just
+     * done if explicitly told to fs driver.
+     */
+    struct stat *st;
+    /*
+     * instead of an array, directory entries are always returned as
+     * chained list, that's because the amount of entries retrieved by fs
+     * drivers is dependent on the individual entries' name (since response
+     * messages are size limited), so the final amount cannot be estimated
+     * before hand
+     */
+    struct V9fsDirEnt *next;
+} V9fsDirEnt;
+
 /*
  * Filled by fs driver on open and other
  * calls.
@@ -419,6 +441,7 @@ void v9fs_path_init(V9fsPath *path);
 void v9fs_path_free(V9fsPath *path);
 void v9fs_path_sprintf(V9fsPath *path, const char *fmt, ...);
 void v9fs_path_copy(V9fsPath *dst, const V9fsPath *src);
+size_t v9fs_readdir_response_size(V9fsString *name);
 int v9fs_name_to_path(V9fsState *s, V9fsPath *dirpath,
                       const char *name, V9fsPath *path);
 int v9fs_device_realize_common(V9fsState *s, const V9fsTransport *t,
diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
index 73f9a751e1..6ce7dc8cde 100644
--- a/hw/9pfs/codir.c
+++ b/hw/9pfs/codir.c
@@ -18,28 +18,189 @@
 #include "qemu/main-loop.h"
 #include "coth.h"
 
+/*
+ * This is solely executed on a background IO thread.
+ */
+static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent **dent)
+{
+    int err = 0;
+    V9fsState *s = pdu->s;
+    struct dirent *entry;
+
+    errno = 0;
+    entry = s->ops->readdir(&s->ctx, &fidp->fs);
+    if (!entry && errno) {
+        *dent = NULL;
+        err = -errno;
+    } else {
+        *dent = entry;
+    }
+    return err;
+}
+
+/*
+ * TODO: This will be removed for performance reasons.
+ * Use v9fs_co_readdir_lowlat() instead.
+ */
 int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
                                  struct dirent **dent)
 {
     int err;
-    V9fsState *s = pdu->s;
 
     if (v9fs_request_cancelled(pdu)) {
         return -EINTR;
     }
-    v9fs_co_run_in_worker(
-        {
-            struct dirent *entry;
+    v9fs_co_run_in_worker({
+        err = do_readdir(pdu, fidp, dent);
+    });
+    return err;
+}
+
+/*
+ * This is solely executed on a background IO thread.
+ *
+ * See v9fs_co_readdir_lowlat() (as its only user) below for details.
+ */
+static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
+                             struct V9fsDirEnt **entries,
+                             int32_t maxsize, bool dostat)
+{
+    V9fsState *s = pdu->s;
+    V9fsString name;
+    int len, err = 0;
+    int32_t size = 0;
+    off_t saved_dir_pos;
+    struct dirent *dent;
+    struct V9fsDirEnt *e = NULL;
+    V9fsPath path;
+    struct stat stbuf;
+
+    *entries = NULL;
+    v9fs_path_init(&path);
+
+    /*
+     * TODO: Here should be a warn_report_once() if lock failed.
+     *
+     * With a good 9p client we should not get into concurrency here,
+     * because a good client would not use the same fid for concurrent
+     * requests. We do the lock here for safety reasons though. However
+     * the client would then suffer performance issues, so better log that
+     * issue here.
+     */
+    v9fs_readdir_lock(&fidp->fs.dir);
+
+    /* save the directory position */
+    saved_dir_pos = s->ops->telldir(&s->ctx, &fidp->fs);
+    if (saved_dir_pos < 0) {
+        err = saved_dir_pos;
+        goto out;
+    }
+
+    while (true) {
+        /* get directory entry from fs driver */
+        err = do_readdir(pdu, fidp, &dent);
+        if (err || !dent) {
+            break;
+        }
+
+        /*
+         * stop this loop as soon as it would exceed the allowed maximum
+         * response message size for the directory entries collected so far,
+         * because anything beyond that size would need to be discarded by
+         * 9p controller (main thread / top half) anyway
+         */
+        v9fs_string_init(&name);
+        v9fs_string_sprintf(&name, "%s", dent->d_name);
+        len = v9fs_readdir_response_size(&name);
+        v9fs_string_free(&name);
+        if (size + len > maxsize) {
+            /* this is not an error case actually */
+            break;
+        }
+
+        /* append next node to result chain */
+        if (!e) {
+            *entries = e = g_malloc0(sizeof(V9fsDirEnt));
+        } else {
+            e = e->next = g_malloc0(sizeof(V9fsDirEnt));
+        }
+        e->dent = g_malloc0(sizeof(struct dirent));
+        memcpy(e->dent, dent, sizeof(struct dirent));
 
-            errno = 0;
-            entry = s->ops->readdir(&s->ctx, &fidp->fs);
-            if (!entry && errno) {
+        /* perform a full stat() for directory entry if requested by caller */
+        if (dostat) {
+            err = s->ops->name_to_path(
+                &s->ctx, &fidp->path, dent->d_name, &path
+            );
+            if (err < 0) {
                 err = -errno;
-            } else {
-                *dent = entry;
-                err = 0;
+                break;
             }
-        });
+
+            err = s->ops->lstat(&s->ctx, &path, &stbuf);
+            if (err < 0) {
+                err = -errno;
+                break;
+            }
+
+            e->st = g_malloc0(sizeof(struct stat));
+            memcpy(e->st, &stbuf, sizeof(struct stat));
+        }
+
+        size += len;
+        saved_dir_pos = dent->d_off;
+    }
+
+    /* restore (last) saved position */
+    s->ops->seekdir(&s->ctx, &fidp->fs, saved_dir_pos);
+
+out:
+    v9fs_readdir_unlock(&fidp->fs.dir);
+    v9fs_path_free(&path);
+    if (err < 0) {
+        return err;
+    }
+    return size;
+}
+
+/**
+ * @brief Low latency variant of fs driver readdir handling.
+ *
+ * Retrieves the requested (max. amount of) directory entries from the fs
+ * driver. This function must only be called by the main IO thread (top half).
+ * Internally this function call will be dispatched to a background IO thread
+ * (bottom half) where it is eventually executed by the fs driver.
+ *
+ * The old readdir implementation above just retrieves always one dir entry
+ * per call. The problem of that implementation above is that latency is
+ * added for (retrieval of) each directory entry, which in practice lead to
+ * latencies of several hundred ms for readdir of only one directory.
+ *
+ * This is avoided in this function by letting the fs driver retrieve all
+ * required directory entries with only call of this function and hence with
+ * only a single fs driver request.
+ *
+ * @param pdu - the causing 9p (T_readdir) client request
+ * @param fidp - already opened directory where readdir shall be performed on
+ * @param entries - output for directory entries (must not be NULL)
+ * @param maxsize - maximum result message body size (in bytes)
+ * @param dostat - whether a stat() should be performed and returned for
+ *                 each directory entry
+ * @returns resulting response message body size (in bytes) on success,
+ *          negative error code otherwise
+ */
+int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
+                                        struct V9fsDirEnt **entries,
+                                        int32_t maxsize, bool dostat)
+{
+    int err = 0;
+
+    if (v9fs_request_cancelled(pdu)) {
+        return -EINTR;
+    }
+    v9fs_co_run_in_worker({
+        err = do_readdir_lowlat(pdu, fidp, entries, maxsize, dostat);
+    });
     return err;
 }
 
diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
index c2cdc7a9ea..1249dbe6df 100644
--- a/hw/9pfs/coth.h
+++ b/hw/9pfs/coth.h
@@ -49,6 +49,9 @@
 void co_run_in_worker_bh(void *);
 int coroutine_fn v9fs_co_readlink(V9fsPDU *, V9fsPath *, V9fsString *);
 int coroutine_fn v9fs_co_readdir(V9fsPDU *, V9fsFidState *, struct dirent **);
+int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *, V9fsFidState *,
+                                        struct V9fsDirEnt **,
+                                        int32_t, bool);
 off_t coroutine_fn v9fs_co_telldir(V9fsPDU *, V9fsFidState *);
 void coroutine_fn v9fs_co_seekdir(V9fsPDU *, V9fsFidState *, off_t);
 void coroutine_fn v9fs_co_rewinddir(V9fsPDU *, V9fsFidState *);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 11/11] hw/9pfs/9p.c: benchmark time on T_readdir request
  2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
                   ` (9 preceding siblings ...)
  2020-01-21  0:30 ` [PATCH v4 10/11] 9pfs: T_readdir latency optimization Christian Schoenebeck
@ 2020-01-21  0:32 ` Christian Schoenebeck
  10 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

This patch is not intended to be merged, it measures
and prints the time the 9p server spends on handling
a T_readdir request. It prints the total time it spent
on handling the request, and also the time it spent
on I/O (fs driver) only.

Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
---
 hw/9pfs/9p.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index e0ca45d46b..9cd494edd7 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -2311,6 +2311,15 @@ static void v9fs_free_dirents(struct V9fsDirEnt *e)
     }
 }
 
+static double wall_time(void)
+{
+    struct timeval t;
+    struct timezone tz;
+    gettimeofday(&t, &tz);
+    return t.tv_sec + t.tv_usec * 0.000001;
+}
+
+
 static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
                                         int32_t max_count)
 {
@@ -2330,6 +2339,8 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
      */
     const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
 
+    const double start = wall_time();
+
     /*
      * Fetch all required directory entries altogether on a background IO
      * thread from fs driver. We don't want to do that for each entry
@@ -2344,6 +2355,10 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
     }
     count = 0;
 
+    const double end = wall_time();
+    printf("\n\nTime 9p server spent on synth_readdir() I/O only (synth "
+           "driver): %fs\n", end - start);
+
     for (struct V9fsDirEnt *e = entries; e; e = e->next) {
         dent = e->dent;
 
@@ -2416,6 +2431,8 @@ static void coroutine_fn v9fs_readdir(void *opaque)
     V9fsPDU *pdu = opaque;
     V9fsState *s = pdu->s;
 
+    const double start = wall_time();
+
     retval = pdu_unmarshal(pdu, offset, "dqd", &fid,
                            &initial_offset, &max_count);
     if (retval < 0) {
@@ -2459,6 +2476,10 @@ out:
     put_fid(pdu, fidp);
 out_nofid:
     pdu_complete(pdu, retval);
+
+    const double end = wall_time();
+    printf("Time 9p server spent on entire T_readdir request: %fs "
+           "[IMPORTANT]\n", end - start);
 }
 
 static int v9fs_xattr_write(V9fsState *s, V9fsPDU *pdu, V9fsFidState *fidp,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 00/11] 9pfs: readdir optimization
@ 2020-01-21  0:36 Christian Schoenebeck
  2020-01-20 22:29 ` [PATCH v4 01/11] tests/virtio-9p: add terminating null in v9fs_string_read() Christian Schoenebeck
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-21  0:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

As previously mentioned, I was investigating performance issues with 9pfs.
Raw file read/write of 9pfs is actually quite good, provided that client
picked a reasonable high msize (maximum message size). I would recommend
to log a warning on 9p server side if a client attached with a small msize
that would cause performance issues for that reason.

However there are other aspects where 9pfs currently performs suboptimally,
especially readdir handling of 9pfs is extremely slow, a simple readdir
request of a guest typically blocks for several hundred milliseconds or
even several seconds, no matter how powerful the underlying hardware is.
The reason for this performance issue: latency.
Currently 9pfs is heavily dispatching a T_readdir request numerous times
between main I/O thread and a background I/O thread back and forth; in fact
it is actually hopping between threads even multiple times for every single
directory entry during T_readdir request handling which leads in total to
huge latencies for a single T_readdir request.

This patch series aims to address this severe performance issue of 9pfs
T_readdir request handling. The actual performance fix is patch 10. I also
provided a convenient benchmark for comparing the performance improvements
by using the 9pfs "synth" driver (see patch 8 for instructions how to run
the benchmark), so no guest OS installation is required to peform this
benchmark A/B comparison. With patch 10 I achieved a performance improvement
of factor 40 on my test machine.

** NOTE: ** As outlined by patch 7 there seems to be an outstanding issue
(both with current, unoptimized readdir code, as well as with new, optimized
readdir code) causing a transport error with splitted readdir requests. This
issue only occurs if patch 7 is applied. I haven't investigated the cause of
this issue yet, it looks like a memory issue though. I am not sure if it is a
problem with the actual 9p server or rather "just" with the test environment.
Apart from that issue, the actual splitted readdir seems to work well with the
new performance optimized readdir code as well though.

v3->v4:

  * Rebased to master (SHA-1 43d1455c).

  * Adjusted commit log message [patch 2], [patch 3], [patch 8].

  * Fixed using Rreaddir header size of 11 (instead of P9_IOHDRSZ) for
    limiting 'count' parameter of Treaddir [patch 3], [patch 5].

Christian Schoenebeck (11):
  tests/virtio-9p: add terminating null in v9fs_string_read()
  9pfs: require msize >= 4096
  9pfs: validate count sent by client with T_readdir
  hw/9pfs/9p-synth: added directory for readdir test
  tests/virtio-9p: added readdir test
  tests/virtio-9p: added splitted readdir test
  tests/virtio-9p: failing splitted readdir test
  9pfs: readdir benchmark
  hw/9pfs/9p-synth: avoid n-square issue in synth_readdir()
  9pfs: T_readdir latency optimization
  hw/9pfs/9p.c: benchmark time on T_readdir request

 hw/9pfs/9p-synth.c           |  48 +++++-
 hw/9pfs/9p-synth.h           |   5 +
 hw/9pfs/9p.c                 | 163 ++++++++++++--------
 hw/9pfs/9p.h                 |  34 ++++
 hw/9pfs/codir.c              | 183 ++++++++++++++++++++--
 hw/9pfs/coth.h               |   3 +
 tests/qtest/virtio-9p-test.c | 290 ++++++++++++++++++++++++++++++++++-
 7 files changed, 643 insertions(+), 83 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir
  2020-01-20 23:50 ` [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir Christian Schoenebeck
@ 2020-01-22 14:11   ` Greg Kurz
  2020-01-22 14:26     ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-22 14:11 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 00:50:33 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> A good 9p client sends T_readdir with "count" parameter that's sufficiently
> smaller than client's initially negotiated msize (maximum message size).
> We perform a check for that though to avoid the server to be interrupted
> with a "Failed to encode VirtFS reply type 41" transport error message by
> bad clients. This count value constraint uses msize - 11, because 11 is the
> header size of R_readdir.
> 

This would be worth a comment...

> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  hw/9pfs/9p.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index a5fbe821d4..18370183c4 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -2426,6 +2426,7 @@ static void coroutine_fn v9fs_readdir(void *opaque)
>      int32_t count;
>      uint32_t max_count;
>      V9fsPDU *pdu = opaque;
> +    V9fsState *s = pdu->s;
>  
>      retval = pdu_unmarshal(pdu, offset, "dqd", &fid,
>                             &initial_offset, &max_count);
> @@ -2434,6 +2435,13 @@ static void coroutine_fn v9fs_readdir(void *opaque)
>      }
>      trace_v9fs_readdir(pdu->tag, pdu->id, fid, initial_offset, max_count);
>  

... here. Something like:

    /* Enough space for a R_readdir header: size[4] Rreaddir tag[2] count[4] */

I can fix this in my tree, and actually done so since I've
applied patches 1 to 3.

Reviewed-by: Greg Kurz <groug@kaod.org>

> +    if (max_count > s->msize - 11) {
> +        max_count = s->msize - 11;
> +        warn_report_once(
> +            "9p: bad client: T_readdir with count > msize - 11"
> +        );
> +    }
> +
>      fidp = get_fid(pdu, fid);
>      if (fidp == NULL) {
>          retval = -EINVAL;



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir
  2020-01-22 14:11   ` Greg Kurz
@ 2020-01-22 14:26     ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-22 14:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Mittwoch, 22. Januar 2020 15:11:07 CET Greg Kurz wrote:
> On Tue, 21 Jan 2020 00:50:33 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > A good 9p client sends T_readdir with "count" parameter that's
> > sufficiently
> > smaller than client's initially negotiated msize (maximum message size).
> > We perform a check for that though to avoid the server to be interrupted
> > with a "Failed to encode VirtFS reply type 41" transport error message by
> > bad clients. This count value constraint uses msize - 11, because 11 is
> > the
> > header size of R_readdir.
> 
> This would be worth a comment...
> 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> > 
> >  hw/9pfs/9p.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> > index a5fbe821d4..18370183c4 100644
> > --- a/hw/9pfs/9p.c
> > +++ b/hw/9pfs/9p.c
> > @@ -2426,6 +2426,7 @@ static void coroutine_fn v9fs_readdir(void *opaque)
> > 
> >      int32_t count;
> >      uint32_t max_count;
> >      V9fsPDU *pdu = opaque;
> > 
> > +    V9fsState *s = pdu->s;
> > 
> >      retval = pdu_unmarshal(pdu, offset, "dqd", &fid,
> >      
> >                             &initial_offset, &max_count);
> > 
> > @@ -2434,6 +2435,13 @@ static void coroutine_fn v9fs_readdir(void *opaque)
> > 
> >      }
> >      trace_v9fs_readdir(pdu->tag, pdu->id, fid, initial_offset,
> >      max_count);
> 
> ... here. Something like:
> 
>     /* Enough space for a R_readdir header: size[4] Rreaddir tag[2] count[4]
> */
> 
> I can fix this in my tree, and actually done so since I've
> applied patches 1 to 3.

Fine with me, thanks Greg!

> Reviewed-by: Greg Kurz <groug@kaod.org>

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 05/11] tests/virtio-9p: added readdir test
  2020-01-21  0:12 ` [PATCH v4 05/11] tests/virtio-9p: added " Christian Schoenebeck
@ 2020-01-22 19:56   ` Greg Kurz
  0 siblings, 0 replies; 39+ messages in thread
From: Greg Kurz @ 2020-01-22 19:56 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:12:00 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> The first readdir test simply checks the amount of directory
> entries returned by 9pfs server, according to the created amount
> of virtual files on 9pfs synth driver side. Then the subsequent
> readdir test also checks whether all directory entries have the
> expected file names (as created on 9pfs synth driver side),
> ignoring their precise order in result list though.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

LGTM

Reviewed-by: Greg Kurz <groug@kaod.org>

I've applied patches 4 and 5 as well.

>  tests/qtest/virtio-9p-test.c | 152 +++++++++++++++++++++++++++++++++++
>  1 file changed, 152 insertions(+)
> 
> diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> index 06263edb53..2167322985 100644
> --- a/tests/qtest/virtio-9p-test.c
> +++ b/tests/qtest/virtio-9p-test.c
> @@ -68,6 +68,11 @@ static void v9fs_memread(P9Req *req, void *addr, size_t len)
>      req->r_off += len;
>  }
>  
> +static void v9fs_uint8_read(P9Req *req, uint8_t *val)
> +{
> +    v9fs_memread(req, val, 1);
> +}
> +
>  static void v9fs_uint16_write(P9Req *req, uint16_t val)
>  {
>      uint16_t le_val = cpu_to_le16(val);
> @@ -101,6 +106,12 @@ static void v9fs_uint32_read(P9Req *req, uint32_t *val)
>      le32_to_cpus(val);
>  }
>  
> +static void v9fs_uint64_read(P9Req *req, uint64_t *val)
> +{
> +    v9fs_memread(req, val, 8);
> +    le64_to_cpus(val);
> +}
> +
>  /* len[2] string[len] */
>  static uint16_t v9fs_string_size(const char *string)
>  {
> @@ -191,6 +202,7 @@ static const char *rmessage_name(uint8_t id)
>          id == P9_RLOPEN ? "RLOPEN" :
>          id == P9_RWRITE ? "RWRITE" :
>          id == P9_RFLUSH ? "RFLUSH" :
> +        id == P9_RREADDIR ? "READDIR" :
>          "<unknown>";
>  }
>  
> @@ -348,6 +360,82 @@ static void v9fs_rwalk(P9Req *req, uint16_t *nwqid, v9fs_qid **wqid)
>      v9fs_req_free(req);
>  }
>  
> +/* size[4] Treaddir tag[2] fid[4] offset[8] count[4] */
> +static P9Req *v9fs_treaddir(QVirtio9P *v9p, uint32_t fid, uint64_t offset,
> +                            uint32_t count, uint16_t tag)
> +{
> +    P9Req *req;
> +
> +    req = v9fs_req_init(v9p, 4 + 8 + 4, P9_TREADDIR, tag);
> +    v9fs_uint32_write(req, fid);
> +    v9fs_uint64_write(req, offset);
> +    v9fs_uint32_write(req, count);
> +    v9fs_req_send(req);
> +    return req;
> +}
> +
> +struct V9fsDirent {
> +    v9fs_qid qid;
> +    uint64_t offset;
> +    uint8_t type;
> +    char *name;
> +    struct V9fsDirent *next;
> +};
> +
> +/* size[4] Rreaddir tag[2] count[4] data[count] */
> +static void v9fs_rreaddir(P9Req *req, uint32_t *count, uint32_t *nentries,
> +                          struct V9fsDirent **entries)
> +{
> +    uint32_t local_count;
> +    struct V9fsDirent *e = NULL;
> +    uint16_t slen;
> +    uint32_t n = 0;
> +
> +    v9fs_req_recv(req, P9_RREADDIR);
> +    v9fs_uint32_read(req, &local_count);
> +
> +    if (count) {
> +        *count = local_count;
> +    }
> +
> +    for (int32_t togo = (int32_t)local_count;
> +         togo >= 13 + 8 + 1 + 2;
> +         togo -= 13 + 8 + 1 + 2 + slen, ++n)
> +    {
> +        if (!e) {
> +            e = g_malloc(sizeof(struct V9fsDirent));
> +            if (entries) {
> +                *entries = e;
> +            }
> +        } else {
> +            e = e->next = g_malloc(sizeof(struct V9fsDirent));
> +        }
> +        e->next = NULL;
> +        /* qid[13] offset[8] type[1] name[s] */
> +        v9fs_memread(req, &e->qid, 13);
> +        v9fs_uint64_read(req, &e->offset);
> +        v9fs_uint8_read(req, &e->type);
> +        v9fs_string_read(req, &slen, &e->name);
> +    }
> +
> +    if (nentries) {
> +        *nentries = n;
> +    }
> +
> +    v9fs_req_free(req);
> +}
> +
> +static void v9fs_free_dirents(struct V9fsDirent *e)
> +{
> +    struct V9fsDirent *next = NULL;
> +
> +    for (; e; e = next) {
> +        next = e->next;
> +        g_free(e->name);
> +        g_free(e);
> +    }
> +}
> +
>  /* size[4] Tlopen tag[2] fid[4] flags[4] */
>  static P9Req *v9fs_tlopen(QVirtio9P *v9p, uint32_t fid, uint32_t flags,
>                            uint16_t tag)
> @@ -480,6 +568,69 @@ static void fs_walk(void *obj, void *data, QGuestAllocator *t_alloc)
>      g_free(wqid);
>  }
>  
> +static bool fs_dirents_contain_name(struct V9fsDirent *e, const char* name)
> +{
> +    for (; e; e = e->next) {
> +        if (!strcmp(e->name, name)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
> +{
> +    QVirtio9P *v9p = obj;
> +    alloc = t_alloc;
> +    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
> +    uint16_t nqid;
> +    v9fs_qid qid;
> +    uint32_t count, nentries;
> +    struct V9fsDirent *entries = NULL;
> +    P9Req *req;
> +
> +    fs_attach(v9p, NULL, t_alloc);
> +    req = v9fs_twalk(v9p, 0, 1, 1, wnames, 0);
> +    v9fs_req_wait_for_reply(req, NULL);
> +    v9fs_rwalk(req, &nqid, NULL);
> +    g_assert_cmpint(nqid, ==, 1);
> +
> +    req = v9fs_tlopen(v9p, 1, O_DIRECTORY, 0);
> +    v9fs_req_wait_for_reply(req, NULL);
> +    v9fs_rlopen(req, &qid, NULL);
> +
> +    /*
> +     * submit count = msize - 11, because 11 is the header size of Rreaddir
> +     */
> +    req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
> +    v9fs_req_wait_for_reply(req, NULL);
> +    v9fs_rreaddir(req, &count, &nentries, &entries);
> +
> +    /*
> +     * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve all
> +     * dir entries with only one readdir request.
> +     */
> +    g_assert_cmpint(
> +        nentries, ==,
> +        QTEST_V9FS_SYNTH_READDIR_NFILES + 2 /* "." and ".." */
> +    );
> +
> +    /*
> +     * Check all file names exist in returned entries, ignore their order
> +     * though.
> +     */
> +    g_assert_cmpint(fs_dirents_contain_name(entries, "."), ==, true);
> +    g_assert_cmpint(fs_dirents_contain_name(entries, ".."), ==, true);
> +    for (int i = 0; i < QTEST_V9FS_SYNTH_READDIR_NFILES; ++i) {
> +        char *name = g_strdup_printf(QTEST_V9FS_SYNTH_READDIR_FILE, i);
> +        g_assert_cmpint(fs_dirents_contain_name(entries, name), ==, true);
> +        g_free(name);
> +    }
> +
> +    v9fs_free_dirents(entries);
> +    g_free(wnames[0]);
> +}
> +
>  static void fs_walk_no_slash(void *obj, void *data, QGuestAllocator *t_alloc)
>  {
>      QVirtio9P *v9p = obj;
> @@ -658,6 +809,7 @@ static void register_virtio_9p_test(void)
>                   NULL);
>      qos_add_test("fs/flush/ignored", "virtio-9p", fs_flush_ignored,
>                   NULL);
> +    qos_add_test("fs/readdir/basic", "virtio-9p", fs_readdir, NULL);
>  }
>  
>  libqos_init(register_virtio_9p_test);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-21  0:16 ` [PATCH v4 06/11] tests/virtio-9p: added splitted " Christian Schoenebeck
@ 2020-01-22 21:14   ` Eric Blake
  2020-01-22 21:29     ` Christian Schoenebeck
  2020-01-22 21:19   ` Greg Kurz
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Blake @ 2020-01-22 21:14 UTC (permalink / raw)
  To: Christian Schoenebeck, qemu-devel; +Cc: Greg Kurz

On 1/20/20 6:16 PM, Christian Schoenebeck wrote:
> The previous, already existing readdir test simply used a 'count'
> parameter big enough to retrieve all directory entries with a
> single Treaddir request.
> 
> In this new 'splitted' readdir test, directory entries are

English is weird; the past tense of 'split' is 'split', not 'splitted'

> retrieved, splitted over several Treaddir requests by picking small

and again

> 'count' parameters which force the server to truncate the response.
> So the test client sends as many Treaddir requests as necessary to
> get all directory entries. Currently this test covers actually two
> tests: a sequence of Treaddir requests with count=512 and then a
> subsequent test with a sequence of Treaddir requests with count=256.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>   tests/qtest/virtio-9p-test.c | 91 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 91 insertions(+)
> 

>   
> +/* readdir test where overall request is splitted over several messages */

and again

> +static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
> +{
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-21  0:16 ` [PATCH v4 06/11] tests/virtio-9p: added splitted " Christian Schoenebeck
  2020-01-22 21:14   ` Eric Blake
@ 2020-01-22 21:19   ` Greg Kurz
  2020-01-22 22:36     ` Christian Schoenebeck
  1 sibling, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-22 21:19 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:16:21 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> The previous, already existing readdir test simply used a 'count'
> parameter big enough to retrieve all directory entries with a
> single Treaddir request.
> 
> In this new 'splitted' readdir test, directory entries are
> retrieved, splitted over several Treaddir requests by picking small
> 'count' parameters which force the server to truncate the response.
> So the test client sends as many Treaddir requests as necessary to
> get all directory entries. Currently this test covers actually two
> tests: a sequence of Treaddir requests with count=512 and then a
> subsequent test with a sequence of Treaddir requests with count=256.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

Not sure it is really needed to check with multiple values for 'count',
but it doesn't eat too many cycles so I guess it doesn't hurt.

Applied as well.

>  tests/qtest/virtio-9p-test.c | 91 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 91 insertions(+)
> 
> diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> index 2167322985..8b0d94546e 100644
> --- a/tests/qtest/virtio-9p-test.c
> +++ b/tests/qtest/virtio-9p-test.c
> @@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct V9fsDirent *e, const char* name)
>      return false;
>  }
>  
> +/* basic readdir test where reply fits into a single response message */
>  static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
>  {
>      QVirtio9P *v9p = obj;
> @@ -631,6 +632,95 @@ static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
>      g_free(wnames[0]);
>  }
>  
> +/* readdir test where overall request is splitted over several messages */
> +static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
> +{
> +    QVirtio9P *v9p = obj;
> +    alloc = t_alloc;
> +    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
> +    uint16_t nqid;
> +    v9fs_qid qid;
> +    uint32_t count, nentries, npartialentries;
> +    struct V9fsDirent *entries, *tail, *partialentries;
> +    P9Req *req;
> +    int subtest;
> +    int fid;
> +    uint64_t offset;
> +    /* the Treaddir 'count' parameter values to be tested */
> +    const uint32_t vcount[] = { 512, 256 };
> +    const int nvcount = sizeof(vcount) / sizeof(uint32_t);
> +
> +    fs_attach(v9p, NULL, t_alloc);
> +
> +    /* iterate over all 'count' parameter values to be tested with Treaddir */
> +    for (subtest = 0; subtest < nvcount; ++subtest) {
> +        fid = subtest + 1;
> +        offset = 0;
> +        entries = NULL;
> +        nentries = 0;
> +        tail = NULL;
> +
> +        req = v9fs_twalk(v9p, 0, fid, 1, wnames, 0);
> +        v9fs_req_wait_for_reply(req, NULL);
> +        v9fs_rwalk(req, &nqid, NULL);
> +        g_assert_cmpint(nqid, ==, 1);
> +
> +        req = v9fs_tlopen(v9p, fid, O_DIRECTORY, 0);
> +        v9fs_req_wait_for_reply(req, NULL);
> +        v9fs_rlopen(req, &qid, NULL);
> +
> +        /*
> +         * send as many Treaddir requests as required to get all directory
> +         * entries
> +         */
> +        while (true) {
> +            npartialentries = 0;
> +            partialentries = NULL;
> +
> +            req = v9fs_treaddir(v9p, fid, offset, vcount[subtest], 0);
> +            v9fs_req_wait_for_reply(req, NULL);
> +            v9fs_rreaddir(req, &count, &npartialentries, &partialentries);
> +            if (npartialentries > 0 && partialentries) {
> +                if (!entries) {
> +                    entries = partialentries;
> +                    nentries = npartialentries;
> +                    tail = partialentries;
> +                } else {
> +                    tail->next = partialentries;
> +                    nentries += npartialentries;
> +                }
> +                while (tail->next) {
> +                    tail = tail->next;
> +                }
> +                offset = tail->offset;
> +            } else {
> +                break;
> +            }
> +        }
> +
> +        g_assert_cmpint(
> +            nentries, ==,
> +            QTEST_V9FS_SYNTH_READDIR_NFILES + 2 /* "." and ".." */
> +        );
> +
> +        /*
> +         * Check all file names exist in returned entries, ignore their order
> +         * though.
> +         */
> +        g_assert_cmpint(fs_dirents_contain_name(entries, "."), ==, true);
> +        g_assert_cmpint(fs_dirents_contain_name(entries, ".."), ==, true);
> +        for (int i = 0; i < QTEST_V9FS_SYNTH_READDIR_NFILES; ++i) {
> +            char *name = g_strdup_printf(QTEST_V9FS_SYNTH_READDIR_FILE, i);
> +            g_assert_cmpint(fs_dirents_contain_name(entries, name), ==, true);
> +            g_free(name);
> +        }
> +
> +        v9fs_free_dirents(entries);
> +    }
> +
> +    g_free(wnames[0]);
> +}
> +
>  static void fs_walk_no_slash(void *obj, void *data, QGuestAllocator *t_alloc)
>  {
>      QVirtio9P *v9p = obj;
> @@ -810,6 +900,7 @@ static void register_virtio_9p_test(void)
>      qos_add_test("fs/flush/ignored", "virtio-9p", fs_flush_ignored,
>                   NULL);
>      qos_add_test("fs/readdir/basic", "virtio-9p", fs_readdir, NULL);
> +    qos_add_test("fs/readdir/split", "virtio-9p", fs_readdir_split, NULL);
>  }
>  
>  libqos_init(register_virtio_9p_test);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-22 21:14   ` Eric Blake
@ 2020-01-22 21:29     ` Christian Schoenebeck
  2020-01-23  6:59       ` Greg Kurz
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-22 21:29 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Greg Kurz

On Mittwoch, 22. Januar 2020 22:14:39 CET Eric Blake wrote:
> On 1/20/20 6:16 PM, Christian Schoenebeck wrote:
> > The previous, already existing readdir test simply used a 'count'
> > parameter big enough to retrieve all directory entries with a
> > single Treaddir request.
> > 
> > In this new 'splitted' readdir test, directory entries are
> 
> English is weird; the past tense of 'split' is 'split', not 'splitted'

Just an attempt to fix a language defect. You're right of course Eric.

Greg, should I resend that patch fixed or can you just awk it? Whatever is 
more comfortable to you.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-22 21:19   ` Greg Kurz
@ 2020-01-22 22:36     ` Christian Schoenebeck
  2020-01-23 10:30       ` Greg Kurz
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-22 22:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Mittwoch, 22. Januar 2020 22:19:05 CET Greg Kurz wrote:
> On Tue, 21 Jan 2020 01:16:21 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > The previous, already existing readdir test simply used a 'count'
> > parameter big enough to retrieve all directory entries with a
> > single Treaddir request.
> > 
> > In this new 'splitted' readdir test, directory entries are
> > retrieved, splitted over several Treaddir requests by picking small
> > 'count' parameters which force the server to truncate the response.
> > So the test client sends as many Treaddir requests as necessary to
> > get all directory entries. Currently this test covers actually two
> > tests: a sequence of Treaddir requests with count=512 and then a
> > subsequent test with a sequence of Treaddir requests with count=256.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> 
> Not sure it is really needed to check with multiple values for 'count',
> but it doesn't eat too many cycles so I guess it doesn't hurt.

Yes, it is a cheap test, nobody will feel the difference. One could argue 
about the precise 'count' values being used ...

> 
> Applied as well.
> 
> >  tests/qtest/virtio-9p-test.c | 91 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 91 insertions(+)
> > 
> > diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> > index 2167322985..8b0d94546e 100644
> > --- a/tests/qtest/virtio-9p-test.c
> > +++ b/tests/qtest/virtio-9p-test.c
> > @@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct V9fsDirent
> > *e, const char* name)> 
> >      return false;
> >  
> >  }
> > 
> > +/* basic readdir test where reply fits into a single response message */
> > 
> >  static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
> >  {
> >  
> >      QVirtio9P *v9p = obj;
> > 
> > @@ -631,6 +632,95 @@ static void fs_readdir(void *obj, void *data,
> > QGuestAllocator *t_alloc)> 
> >      g_free(wnames[0]);
> >  
> >  }
> > 
> > +/* readdir test where overall request is splitted over several messages
> > */
> > +static void fs_readdir_split(void *obj, void *data, QGuestAllocator
> > *t_alloc) +{
> > +    QVirtio9P *v9p = obj;
> > +    alloc = t_alloc;
> > +    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
> > +    uint16_t nqid;
> > +    v9fs_qid qid;
> > +    uint32_t count, nentries, npartialentries;
> > +    struct V9fsDirent *entries, *tail, *partialentries;
> > +    P9Req *req;
> > +    int subtest;
> > +    int fid;
> > +    uint64_t offset;
> > +    /* the Treaddir 'count' parameter values to be tested */
> > +    const uint32_t vcount[] = { 512, 256 };

... here. But IMO it does make sense preserving the function's overall 
structure to allow testing with different 'count' values if necessary. Because 
that way this test could e.g. guard potential bugs when server's Treaddir 
handler is rolling back (or not) the dir offset for instance, which server has 
to do (or not) according to this 'count' value and the precise file name 
length of the individual directory entries.

Whatever precise 'count' tests are desired, it would only mean a one line 
change here.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 07/11] tests/virtio-9p: failing splitted readdir test
  2020-01-21  0:17 ` [PATCH v4 07/11] tests/virtio-9p: failing " Christian Schoenebeck
@ 2020-01-22 22:59   ` Greg Kurz
  2020-01-23 11:36     ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-22 22:59 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:17:35 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> This patch is not intended to be merged. It resembles
> an issue (with debug messages) where the splitted
> readdir test fails because server is interrupted with
> transport error "Failed to decode VirtFS request type 40",

Ok. When we send a new request, we call:

uint32_t qvirtqueue_add(QTestState *qts, QVirtQueue *vq, uint64_t data,
                        uint32_t len, bool write, bool next)
{
    uint16_t flags = 0;
    vq->num_free--;

[...]

    return vq->free_head++; /* Return and increase, in this order */
}

where vq->num_free is the number of available buffers (aka. requests) in
the vq and vq->free_head the index of the next available buffer. The size
of the vq of the virtio-9p device is MAX_REQ (128) buffers. The driver
is very simple and doesn't care to handle the scenario of a full vq,
ie, num_free == 0 and free_head is past the vq->desc[] array. It seems
that count=128 generates enough extra requests to reach the end of the
vq. Hence the "decode" error you get. Maybe an assert(vq->num_free) in
qvirtqueue_add() would make that more clear ?

Not sure it is worth to address this limitation though. Especially since
count=128 isn't really a recommended choice in the first place. It has
more chances to cause a disconnect if the server needs to return a longer
file name (which is expected since most fs have 255 character long file
names).

> which BTW fails both with the unoptimized and with the
> optimized 9p readdir code.
> 

Yes, this is the client's fault.

> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  tests/qtest/virtio-9p-test.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> index 8b0d94546e..e47b286340 100644
> --- a/tests/qtest/virtio-9p-test.c
> +++ b/tests/qtest/virtio-9p-test.c
> @@ -647,13 +647,14 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
>      int fid;
>      uint64_t offset;
>      /* the Treaddir 'count' parameter values to be tested */
> -    const uint32_t vcount[] = { 512, 256 };
> +    const uint32_t vcount[] = { 512, 256, 128 };
>      const int nvcount = sizeof(vcount) / sizeof(uint32_t);
>  
>      fs_attach(v9p, NULL, t_alloc);
>  
>      /* iterate over all 'count' parameter values to be tested with Treaddir */
>      for (subtest = 0; subtest < nvcount; ++subtest) {
> +        printf("\nsubtest[%d] with count=%d\n", subtest, vcount[subtest]);
>          fid = subtest + 1;
>          offset = 0;
>          entries = NULL;
> @@ -674,12 +675,16 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
>           * entries
>           */
>          while (true) {
> +            printf("\toffset=%ld\n", offset);
>              npartialentries = 0;
>              partialentries = NULL;
>  
> +            printf("Treaddir fid=%d offset=%ld count=%d\n",
> +                   fid, offset, vcount[subtest]);
>              req = v9fs_treaddir(v9p, fid, offset, vcount[subtest], 0);
>              v9fs_req_wait_for_reply(req, NULL);
>              v9fs_rreaddir(req, &count, &npartialentries, &partialentries);
> +            printf("\t\tnpartial=%d nentries=%d\n", npartialentries, nentries);
>              if (npartialentries > 0 && partialentries) {
>                  if (!entries) {
>                      entries = partialentries;
> @@ -716,6 +721,8 @@ static void fs_readdir_split(void *obj, void *data, QGuestAllocator *t_alloc)
>          }
>  
>          v9fs_free_dirents(entries);
> +
> +        printf("PASSED subtest[%d]\n", subtest);
>      }
>  
>      g_free(wnames[0]);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-22 21:29     ` Christian Schoenebeck
@ 2020-01-23  6:59       ` Greg Kurz
  0 siblings, 0 replies; 39+ messages in thread
From: Greg Kurz @ 2020-01-23  6:59 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Wed, 22 Jan 2020 22:29:12 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 22. Januar 2020 22:14:39 CET Eric Blake wrote:
> > On 1/20/20 6:16 PM, Christian Schoenebeck wrote:
> > > The previous, already existing readdir test simply used a 'count'
> > > parameter big enough to retrieve all directory entries with a
> > > single Treaddir request.
> > > 
> > > In this new 'splitted' readdir test, directory entries are
> > 
> > English is weird; the past tense of 'split' is 'split', not 'splitted'
> 

Heh.. My irregular verb detector module is a bit tired. I should
replace it ;-)

> Just an attempt to fix a language defect. You're right of course Eric.
> 
> Greg, should I resend that patch fixed or can you just awk it? Whatever is 
> more comfortable to you.
> 

Yeah, I'll fix this in my tree.

> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-22 22:36     ` Christian Schoenebeck
@ 2020-01-23 10:30       ` Greg Kurz
  2020-01-23 13:07         ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-23 10:30 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Wed, 22 Jan 2020 23:36:22 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 22. Januar 2020 22:19:05 CET Greg Kurz wrote:
> > On Tue, 21 Jan 2020 01:16:21 +0100
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > The previous, already existing readdir test simply used a 'count'
> > > parameter big enough to retrieve all directory entries with a
> > > single Treaddir request.
> > > 
> > > In this new 'splitted' readdir test, directory entries are
> > > retrieved, splitted over several Treaddir requests by picking small
> > > 'count' parameters which force the server to truncate the response.
> > > So the test client sends as many Treaddir requests as necessary to
> > > get all directory entries. Currently this test covers actually two
> > > tests: a sequence of Treaddir requests with count=512 and then a
> > > subsequent test with a sequence of Treaddir requests with count=256.
> > > 
> > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > ---
> > 
> > Not sure it is really needed to check with multiple values for 'count',
> > but it doesn't eat too many cycles so I guess it doesn't hurt.
> 
> Yes, it is a cheap test, nobody will feel the difference. One could argue 
> about the precise 'count' values being used ...
> 
> > 
> > Applied as well.
> > 
> > >  tests/qtest/virtio-9p-test.c | 91 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 91 insertions(+)
> > > 
> > > diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> > > index 2167322985..8b0d94546e 100644
> > > --- a/tests/qtest/virtio-9p-test.c
> > > +++ b/tests/qtest/virtio-9p-test.c
> > > @@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct V9fsDirent
> > > *e, const char* name)> 
> > >      return false;
> > >  
> > >  }
> > > 
> > > +/* basic readdir test where reply fits into a single response message */
> > > 
> > >  static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
> > >  {
> > >  
> > >      QVirtio9P *v9p = obj;
> > > 
> > > @@ -631,6 +632,95 @@ static void fs_readdir(void *obj, void *data,
> > > QGuestAllocator *t_alloc)> 
> > >      g_free(wnames[0]);
> > >  
> > >  }
> > > 
> > > +/* readdir test where overall request is splitted over several messages
> > > */
> > > +static void fs_readdir_split(void *obj, void *data, QGuestAllocator
> > > *t_alloc) +{
> > > +    QVirtio9P *v9p = obj;
> > > +    alloc = t_alloc;
> > > +    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR) };
> > > +    uint16_t nqid;
> > > +    v9fs_qid qid;
> > > +    uint32_t count, nentries, npartialentries;
> > > +    struct V9fsDirent *entries, *tail, *partialentries;
> > > +    P9Req *req;
> > > +    int subtest;
> > > +    int fid;
> > > +    uint64_t offset;
> > > +    /* the Treaddir 'count' parameter values to be tested */
> > > +    const uint32_t vcount[] = { 512, 256 };
> 
> ... here. But IMO it does make sense preserving the function's overall 
> structure to allow testing with different 'count' values if necessary. Because 
> that way this test could e.g. guard potential bugs when server's Treaddir 
> handler is rolling back (or not) the dir offset for instance, which server has 
> to do (or not) according to this 'count' value and the precise file name 
> length of the individual directory entries.
> 

I still agree it is useful to be able to check different values but I
now realize that it shouldn't be done like this because adding new
values to vcount[] doesn't scale well with the MAX_REQ limitation I
mentioned in another mail. More values, especially small ones, are
likely to trigger "Failed to decode VirtFS request type 40" at some
point. 

I now think that fs_readdir_split() should rather get count as
an argument and only do one run. By default we would only call
this with an appropriate value, ideally derived from the test
environment (number of files, size of filenames... etc...).
If someone needs to test a specific value, it is easy as adding
a new qos_add_test() line.

This would ensure at least that each run starts with the same
fixed number of possible requests, ie. MAX_REQ minus the cost of
fs_attach().

So I'll revert this patch for now.

> Whatever precise 'count' tests are desired, it would only mean a one line 
> change here.
> 
> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 08/11] 9pfs: readdir benchmark
  2020-01-21  0:23 ` [PATCH v4 08/11] 9pfs: readdir benchmark Christian Schoenebeck
@ 2020-01-23 10:34   ` Greg Kurz
  2020-01-23 13:20     ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-23 10:34 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:23:55 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> This patch is not intended to be merged. It just provides a

Well I like the idea of having such a benchmark available.
It could probably be merged after a few changes...

> temporary benchmark foundation for coneniently A/B comparison
> of the subsequent 9p readdir optimization patches:
> 
> * hw/9pfs/9p-synth: increase amount of simulated files for
>   readdir test to 2000 files.
> 

... the 9p-synth backend could maybe always create this number
of files ?

> * tests/virtio-9p: measure wall time that elapsed between
>   sending T_readdir request and arrival of R_readdir response
>   and print out that measured duration, as well as amount of
>   directory entries received, and the amount of bytes of the
>   response message.
> 

... maybe we should make the printing optional and off by
default so that it doesn't pollute CI logs.

> * tests/virtio-9p: increased msize to 256kiB to allow
>   retrieving all 2000 files (simulated by 9pfs synth driver)
>   with only one T_readdir request.
> 

... always use 256k for both the basic test and a revisited
split test ?

> Running this benchmark is fairly quick & simple and does not
> require any guest OS installation or other prerequisites:
> 
> cd build
> make && make tests/qtest/qos-test
> export QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
> tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic)
> 
> Since this benchmark uses the 9pfs synth driver, the host
> machine's I/O hardware (SSDs/HDDs) is not relevant for the
> benchmark result, because the synth backend's readdir
> implementation returns immediately (without any blocking I/O
> that would incur with a real-life fs driver) and just returns
> already prepared, simulated directory entries directly from RAM.
> So this benchmark focuses on the efficiency of the 9pfs
> controller code (or top half) for readdir request handling.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---
>  hw/9pfs/9p-synth.h           |  2 +-
>  tests/qtest/virtio-9p-test.c | 37 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/9pfs/9p-synth.h b/hw/9pfs/9p-synth.h
> index 036d7e4a5b..7d6cedcdac 100644
> --- a/hw/9pfs/9p-synth.h
> +++ b/hw/9pfs/9p-synth.h
> @@ -58,7 +58,7 @@ int qemu_v9fs_synth_add_file(V9fsSynthNode *parent, int mode,
>  /* for READDIR test */
>  #define QTEST_V9FS_SYNTH_READDIR_DIR "ReadDirDir"
>  #define QTEST_V9FS_SYNTH_READDIR_FILE "ReadDirFile%d"
> -#define QTEST_V9FS_SYNTH_READDIR_NFILES 100
> +#define QTEST_V9FS_SYNTH_READDIR_NFILES 2000
>  
>  /* Any write to the "FLUSH" file is handled one byte at a time by the
>   * backend. If the byte is zero, the backend returns success (ie, 1),
> diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> index e47b286340..d71b37aa6c 100644
> --- a/tests/qtest/virtio-9p-test.c
> +++ b/tests/qtest/virtio-9p-test.c
> @@ -15,6 +15,18 @@
>  #include "libqos/virtio-9p.h"
>  #include "libqos/qgraph.h"
>  
> +/*
> + * to benchmark the real time (not CPU time) that elapsed between start of
> + * a request and arrival of its response
> + */
> +static double wall_time(void)
> +{
> +    struct timeval t;
> +    struct timezone tz;
> +    gettimeofday(&t, &tz);
> +    return t.tv_sec + t.tv_usec * 0.000001;
> +}
> +
>  #define QVIRTIO_9P_TIMEOUT_US (10 * 1000 * 1000)
>  static QGuestAllocator *alloc;
>  
> @@ -36,7 +48,7 @@ static void pci_config(void *obj, void *data, QGuestAllocator *t_alloc)
>      g_free(tag);
>  }
>  
> -#define P9_MAX_SIZE 4096 /* Max size of a T-message or R-message */
> +#define P9_MAX_SIZE (256 * 1024) /* Max size of a T-message or R-message */
>  
>  typedef struct {
>      QTestState *qts;
> @@ -600,12 +612,35 @@ static void fs_readdir(void *obj, void *data, QGuestAllocator *t_alloc)
>      v9fs_req_wait_for_reply(req, NULL);
>      v9fs_rlopen(req, &qid, NULL);
>  
> +    const double start = wall_time();
> +
>      /*
>       * submit count = msize - 11, because 11 is the header size of Rreaddir
>       */
>      req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
> +    const double treaddir = wall_time();
>      v9fs_req_wait_for_reply(req, NULL);
> +    const double waitforreply = wall_time();
>      v9fs_rreaddir(req, &count, &nentries, &entries);
> +    const double end = wall_time();
> +
> +    printf("\nTime client spent on sending T_readdir: %fs\n\n",
> +           treaddir - start);
> +
> +    printf("Time client spent for waiting for reply from server: %fs "
> +           "[MOST IMPORTANT]\n", waitforreply - start);
> +    printf("(This is the most important value, because it reflects the time\n"
> +           "the 9p server required to process and return the result of the\n"
> +           "T_readdir request.)\n\n");
> +
> +    printf("Total client time: %fs\n", end - start);
> +    printf("(NOTE: this time is not relevant; this huge time comes from\n"
> +           "inefficient qtest_memread() calls. So you can discard this\n"
> +           "value as a problem of this test client implementation while\n"
> +           "processing the received server T_readdir reply.)\n\n");
> +
> +    printf("Details of response message data: R_readddir nentries=%d "
> +           "rbytes=%d\n", nentries, count);
>  
>      /*
>       * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve all



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir()
  2020-01-21  0:26 ` [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir() Christian Schoenebeck
@ 2020-01-23 11:13   ` Greg Kurz
  2020-01-23 12:40     ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-23 11:13 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:26:15 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> This patch is just a temporary benchmark hack, not intended
> to be merged!
> 
> 9pfs synth driver's readdir() implementation has a severe
> n-square performance problem. This patch is a quick and dirty
> hack to prevent that performance problem from tainting the
> readdir() benchmark results. In its current form, this patch
> is not useful for anything else than for an isolated readdir
> benchmark.
> 
> NOTE: This patch would break the new readdir/split test,
> because it would alter the behaviour of seekdir() required
> for retrieving directory entries splitted over several
> requests.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

Honestly it doesn't seem to change anything significant for me.
Mean time calculated over 100 runs:

Without this patch:

[greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/ { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
.055654

With this patch:

[greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/ { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
.058786

>  hw/9pfs/9p-synth.c | 29 ++++++++++++++++++++++++++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/9pfs/9p-synth.c b/hw/9pfs/9p-synth.c
> index 7eb210ffa8..54dc30f37b 100644
> --- a/hw/9pfs/9p-synth.c
> +++ b/hw/9pfs/9p-synth.c
> @@ -225,7 +225,8 @@ static void synth_direntry(V9fsSynthNode *node,
>  }
>  
>  static struct dirent *synth_get_dentry(V9fsSynthNode *dir,
> -                                            struct dirent *entry, off_t off)
> +                                       struct dirent *entry, off_t off,
> +                                       V9fsSynthNode **hack)
>  {
>      int i = 0;
>      V9fsSynthNode *node;
> @@ -243,16 +244,38 @@ static struct dirent *synth_get_dentry(V9fsSynthNode *dir,
>          /* end of directory */
>          return NULL;
>      }
> +    *hack = node;
>      synth_direntry(node, entry, off);
>      return entry;
>  }
>  
>  static struct dirent *synth_readdir(FsContext *ctx, V9fsFidOpenState *fs)
>  {
> -    struct dirent *entry;
> +    struct dirent *entry = NULL;
>      V9fsSynthOpenState *synth_open = fs->private;
>      V9fsSynthNode *node = synth_open->node;
> -    entry = synth_get_dentry(node, &synth_open->dent, synth_open->offset);
> +
> +    /*
> +     * HACK: This is just intended for benchmark, to avoid severe n-square
> +     * performance problem of synth driver's readdir implementation here which
> +     * would otherwise unncessarily taint the benchmark results. By simply
> +     * caching (globally) the previous node (of the previous synth_readdir()
> +     * call) we can simply proceed to next node in chained list efficiently.
> +     *
> +     * not a good idea for any production code ;-)
> +     */
> +    static struct V9fsSynthNode *cachedNode;
> +
> +    if (!cachedNode) {
> +        entry = synth_get_dentry(node, &synth_open->dent, synth_open->offset,
> +                                 &cachedNode);
> +    } else {
> +        cachedNode = cachedNode->sibling.le_next;
> +        if (cachedNode) {
> +            entry = &synth_open->dent;
> +            synth_direntry(cachedNode, entry, synth_open->offset + 1);
> +        }
> +    }
>      if (entry) {
>          synth_open->offset++;
>      }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-01-21  0:30 ` [PATCH v4 10/11] 9pfs: T_readdir latency optimization Christian Schoenebeck
@ 2020-01-23 11:33   ` Greg Kurz
  2020-01-23 12:57     ` Christian Schoenebeck
  2020-03-09 14:09   ` Christian Schoenebeck
  1 sibling, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-01-23 11:33 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 21 Jan 2020 01:30:10 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> Make top half really top half and bottom half really bottom half:
> 
> Each T_readdir request handling is hopping between threads (main
> I/O thread and background I/O driver threads) several times for
> every individual directory entry, which sums up to huge latencies
> for handling just a single T_readdir request.
> 
> Instead of doing that, collect now all required directory entries
> (including all potentially required stat buffers for each entry) in
> one rush on a background I/O thread from fs driver, then assemble
> the entire resulting network response message for the readdir
> request on main I/O thread. The fs driver is still aborting the
> directory entry retrieval loop (on the background I/O thread) as
> soon as it would exceed the client's requested maximum R_readdir
> response size. So we should not have any performance penalty by
> doing this.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

Ok so this is it. Not reviewed this huge patch yet but I could at
least give a try. The gain is impressive indeed:

[greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/ { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
.009806

instead of .055654, i.e. nearly 6 times faster ! This sounds promising :)

Now I need to find time to do a decent review... :-\

>  hw/9pfs/9p.c    | 124 +++++++++++++++-----------------
>  hw/9pfs/9p.h    |  23 ++++++
>  hw/9pfs/codir.c | 183 +++++++++++++++++++++++++++++++++++++++++++++---
>  hw/9pfs/coth.h  |   3 +
>  4 files changed, 254 insertions(+), 79 deletions(-)
> 
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index 18370183c4..e0ca45d46b 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -971,30 +971,6 @@ static int coroutine_fn fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
>      return 0;
>  }
>  
> -static int coroutine_fn dirent_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
> -                                      struct dirent *dent, V9fsQID *qidp)
> -{
> -    struct stat stbuf;
> -    V9fsPath path;
> -    int err;
> -
> -    v9fs_path_init(&path);
> -
> -    err = v9fs_co_name_to_path(pdu, &fidp->path, dent->d_name, &path);
> -    if (err < 0) {
> -        goto out;
> -    }
> -    err = v9fs_co_lstat(pdu, &path, &stbuf);
> -    if (err < 0) {
> -        goto out;
> -    }
> -    err = stat_to_qid(pdu, &stbuf, qidp);
> -
> -out:
> -    v9fs_path_free(&path);
> -    return err;
> -}
> -
>  V9fsPDU *pdu_alloc(V9fsState *s)
>  {
>      V9fsPDU *pdu = NULL;
> @@ -2314,7 +2290,7 @@ out_nofid:
>      pdu_complete(pdu, err);
>  }
>  
> -static size_t v9fs_readdir_data_size(V9fsString *name)
> +size_t v9fs_readdir_response_size(V9fsString *name)
>  {
>      /*
>       * Size of each dirent on the wire: size of qid (13) + size of offset (8)
> @@ -2323,6 +2299,18 @@ static size_t v9fs_readdir_data_size(V9fsString *name)
>      return 24 + v9fs_string_size(name);
>  }
>  
> +static void v9fs_free_dirents(struct V9fsDirEnt *e)
> +{
> +    struct V9fsDirEnt *next = NULL;
> +
> +    for (; e; e = next) {
> +        next = e->next;
> +        g_free(e->dent);
> +        g_free(e->st);
> +        g_free(e);
> +    }
> +}
> +
>  static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>                                          int32_t max_count)
>  {
> @@ -2331,54 +2319,53 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>      V9fsString name;
>      int len, err = 0;
>      int32_t count = 0;
> -    off_t saved_dir_pos;
>      struct dirent *dent;
> +    struct stat *st;
> +    struct V9fsDirEnt *entries = NULL;
>  
> -    /* save the directory position */
> -    saved_dir_pos = v9fs_co_telldir(pdu, fidp);
> -    if (saved_dir_pos < 0) {
> -        return saved_dir_pos;
> -    }
> -
> -    while (1) {
> -        v9fs_readdir_lock(&fidp->fs.dir);
> +    /*
> +     * inode remapping requires the device id, which in turn might be
> +     * different for different directory entries, so if inode remapping is
> +     * enabled we have to make a full stat for each directory entry
> +     */
> +    const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
>  
> -        err = v9fs_co_readdir(pdu, fidp, &dent);
> -        if (err || !dent) {
> -            break;
> -        }
> -        v9fs_string_init(&name);
> -        v9fs_string_sprintf(&name, "%s", dent->d_name);
> -        if ((count + v9fs_readdir_data_size(&name)) > max_count) {
> -            v9fs_readdir_unlock(&fidp->fs.dir);
> +    /*
> +     * Fetch all required directory entries altogether on a background IO
> +     * thread from fs driver. We don't want to do that for each entry
> +     * individually, because hopping between threads (this main IO thread
> +     * and background IO driver thread) would sum up to huge latencies.
> +     */
> +    count = v9fs_co_readdir_lowlat(pdu, fidp, &entries, max_count, dostat);
> +    if (count < 0) {
> +        err = count;
> +        count = 0;
> +        goto out;
> +    }
> +    count = 0;
>  
> -            /* Ran out of buffer. Set dir back to old position and return */
> -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -            v9fs_string_free(&name);
> -            return count;
> -        }
> +    for (struct V9fsDirEnt *e = entries; e; e = e->next) {
> +        dent = e->dent;
>  
>          if (pdu->s->ctx.export_flags & V9FS_REMAP_INODES) {
> -            /*
> -             * dirent_to_qid() implies expensive stat call for each entry,
> -             * we must do that here though since inode remapping requires
> -             * the device id, which in turn might be different for
> -             * different entries; we cannot make any assumption to avoid
> -             * that here.
> -             */
> -            err = dirent_to_qid(pdu, fidp, dent, &qid);
> +            st = e->st;
> +            /* e->st should never be NULL, but just to be sure */
> +            if (!st) {
> +                err = -1;
> +                break;
> +            }
> +
> +            /* remap inode */
> +            err = stat_to_qid(pdu, st, &qid);
>              if (err < 0) {
> -                v9fs_readdir_unlock(&fidp->fs.dir);
> -                v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -                v9fs_string_free(&name);
> -                return err;
> +                break;
>              }
>          } else {
>              /*
>               * Fill up just the path field of qid because the client uses
>               * only that. To fill the entire qid structure we will have
>               * to stat each dirent found, which is expensive. For the
> -             * latter reason we don't call dirent_to_qid() here. Only drawback
> +             * latter reason we don't call stat_to_qid() here. Only drawback
>               * is that no multi-device export detection of stat_to_qid()
>               * would be done and provided as error to the user here. But
>               * user would get that error anyway when accessing those
> @@ -2391,25 +2378,26 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>              qid.version = 0;
>          }
>  
> +        v9fs_string_init(&name);
> +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> +
>          /* 11 = 7 + 4 (7 = start offset, 4 = space for storing count) */
>          len = pdu_marshal(pdu, 11 + count, "Qqbs",
>                            &qid, dent->d_off,
>                            dent->d_type, &name);
>  
> -        v9fs_readdir_unlock(&fidp->fs.dir);
> +        v9fs_string_free(&name);
>  
>          if (len < 0) {
> -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -            v9fs_string_free(&name);
> -            return len;
> +            err = len;
> +            break;
>          }
> +
>          count += len;
> -        v9fs_string_free(&name);
> -        saved_dir_pos = dent->d_off;
>      }
>  
> -    v9fs_readdir_unlock(&fidp->fs.dir);
> -
> +out:
> +    v9fs_free_dirents(entries);
>      if (err < 0) {
>          return err;
>      }
> diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> index 6fffe44f5a..1dbb0ad189 100644
> --- a/hw/9pfs/9p.h
> +++ b/hw/9pfs/9p.h
> @@ -215,6 +215,28 @@ static inline void v9fs_readdir_init(V9fsDir *dir)
>      qemu_mutex_init(&dir->readdir_mutex);
>  }
>  
> +/*
> + * Type for 9p fs drivers' (a.k.a. 9p backends) result of readdir requests,
> + * which is a chained list of directory entries.
> + */
> +typedef struct V9fsDirEnt {
> +    /* mandatory (must not be NULL) information for all readdir requests */
> +    struct dirent *dent;
> +    /*
> +     * optional (may be NULL): A full stat of each directory entry is just
> +     * done if explicitly told to fs driver.
> +     */
> +    struct stat *st;
> +    /*
> +     * instead of an array, directory entries are always returned as
> +     * chained list, that's because the amount of entries retrieved by fs
> +     * drivers is dependent on the individual entries' name (since response
> +     * messages are size limited), so the final amount cannot be estimated
> +     * before hand
> +     */
> +    struct V9fsDirEnt *next;
> +} V9fsDirEnt;
> +
>  /*
>   * Filled by fs driver on open and other
>   * calls.
> @@ -419,6 +441,7 @@ void v9fs_path_init(V9fsPath *path);
>  void v9fs_path_free(V9fsPath *path);
>  void v9fs_path_sprintf(V9fsPath *path, const char *fmt, ...);
>  void v9fs_path_copy(V9fsPath *dst, const V9fsPath *src);
> +size_t v9fs_readdir_response_size(V9fsString *name);
>  int v9fs_name_to_path(V9fsState *s, V9fsPath *dirpath,
>                        const char *name, V9fsPath *path);
>  int v9fs_device_realize_common(V9fsState *s, const V9fsTransport *t,
> diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> index 73f9a751e1..6ce7dc8cde 100644
> --- a/hw/9pfs/codir.c
> +++ b/hw/9pfs/codir.c
> @@ -18,28 +18,189 @@
>  #include "qemu/main-loop.h"
>  #include "coth.h"
>  
> +/*
> + * This is solely executed on a background IO thread.
> + */
> +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent **dent)
> +{
> +    int err = 0;
> +    V9fsState *s = pdu->s;
> +    struct dirent *entry;
> +
> +    errno = 0;
> +    entry = s->ops->readdir(&s->ctx, &fidp->fs);
> +    if (!entry && errno) {
> +        *dent = NULL;
> +        err = -errno;
> +    } else {
> +        *dent = entry;
> +    }
> +    return err;
> +}
> +
> +/*
> + * TODO: This will be removed for performance reasons.
> + * Use v9fs_co_readdir_lowlat() instead.
> + */
>  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>                                   struct dirent **dent)
>  {
>      int err;
> -    V9fsState *s = pdu->s;
>  
>      if (v9fs_request_cancelled(pdu)) {
>          return -EINTR;
>      }
> -    v9fs_co_run_in_worker(
> -        {
> -            struct dirent *entry;
> +    v9fs_co_run_in_worker({
> +        err = do_readdir(pdu, fidp, dent);
> +    });
> +    return err;
> +}
> +
> +/*
> + * This is solely executed on a background IO thread.
> + *
> + * See v9fs_co_readdir_lowlat() (as its only user) below for details.
> + */
> +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> +                             struct V9fsDirEnt **entries,
> +                             int32_t maxsize, bool dostat)
> +{
> +    V9fsState *s = pdu->s;
> +    V9fsString name;
> +    int len, err = 0;
> +    int32_t size = 0;
> +    off_t saved_dir_pos;
> +    struct dirent *dent;
> +    struct V9fsDirEnt *e = NULL;
> +    V9fsPath path;
> +    struct stat stbuf;
> +
> +    *entries = NULL;
> +    v9fs_path_init(&path);
> +
> +    /*
> +     * TODO: Here should be a warn_report_once() if lock failed.
> +     *
> +     * With a good 9p client we should not get into concurrency here,
> +     * because a good client would not use the same fid for concurrent
> +     * requests. We do the lock here for safety reasons though. However
> +     * the client would then suffer performance issues, so better log that
> +     * issue here.
> +     */
> +    v9fs_readdir_lock(&fidp->fs.dir);
> +
> +    /* save the directory position */
> +    saved_dir_pos = s->ops->telldir(&s->ctx, &fidp->fs);
> +    if (saved_dir_pos < 0) {
> +        err = saved_dir_pos;
> +        goto out;
> +    }
> +
> +    while (true) {
> +        /* get directory entry from fs driver */
> +        err = do_readdir(pdu, fidp, &dent);
> +        if (err || !dent) {
> +            break;
> +        }
> +
> +        /*
> +         * stop this loop as soon as it would exceed the allowed maximum
> +         * response message size for the directory entries collected so far,
> +         * because anything beyond that size would need to be discarded by
> +         * 9p controller (main thread / top half) anyway
> +         */
> +        v9fs_string_init(&name);
> +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> +        len = v9fs_readdir_response_size(&name);
> +        v9fs_string_free(&name);
> +        if (size + len > maxsize) {
> +            /* this is not an error case actually */
> +            break;
> +        }
> +
> +        /* append next node to result chain */
> +        if (!e) {
> +            *entries = e = g_malloc0(sizeof(V9fsDirEnt));
> +        } else {
> +            e = e->next = g_malloc0(sizeof(V9fsDirEnt));
> +        }
> +        e->dent = g_malloc0(sizeof(struct dirent));
> +        memcpy(e->dent, dent, sizeof(struct dirent));
>  
> -            errno = 0;
> -            entry = s->ops->readdir(&s->ctx, &fidp->fs);
> -            if (!entry && errno) {
> +        /* perform a full stat() for directory entry if requested by caller */
> +        if (dostat) {
> +            err = s->ops->name_to_path(
> +                &s->ctx, &fidp->path, dent->d_name, &path
> +            );
> +            if (err < 0) {
>                  err = -errno;
> -            } else {
> -                *dent = entry;
> -                err = 0;
> +                break;
>              }
> -        });
> +
> +            err = s->ops->lstat(&s->ctx, &path, &stbuf);
> +            if (err < 0) {
> +                err = -errno;
> +                break;
> +            }
> +
> +            e->st = g_malloc0(sizeof(struct stat));
> +            memcpy(e->st, &stbuf, sizeof(struct stat));
> +        }
> +
> +        size += len;
> +        saved_dir_pos = dent->d_off;
> +    }
> +
> +    /* restore (last) saved position */
> +    s->ops->seekdir(&s->ctx, &fidp->fs, saved_dir_pos);
> +
> +out:
> +    v9fs_readdir_unlock(&fidp->fs.dir);
> +    v9fs_path_free(&path);
> +    if (err < 0) {
> +        return err;
> +    }
> +    return size;
> +}
> +
> +/**
> + * @brief Low latency variant of fs driver readdir handling.
> + *
> + * Retrieves the requested (max. amount of) directory entries from the fs
> + * driver. This function must only be called by the main IO thread (top half).
> + * Internally this function call will be dispatched to a background IO thread
> + * (bottom half) where it is eventually executed by the fs driver.
> + *
> + * The old readdir implementation above just retrieves always one dir entry
> + * per call. The problem of that implementation above is that latency is
> + * added for (retrieval of) each directory entry, which in practice lead to
> + * latencies of several hundred ms for readdir of only one directory.
> + *
> + * This is avoided in this function by letting the fs driver retrieve all
> + * required directory entries with only call of this function and hence with
> + * only a single fs driver request.
> + *
> + * @param pdu - the causing 9p (T_readdir) client request
> + * @param fidp - already opened directory where readdir shall be performed on
> + * @param entries - output for directory entries (must not be NULL)
> + * @param maxsize - maximum result message body size (in bytes)
> + * @param dostat - whether a stat() should be performed and returned for
> + *                 each directory entry
> + * @returns resulting response message body size (in bytes) on success,
> + *          negative error code otherwise
> + */
> +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> +                                        struct V9fsDirEnt **entries,
> +                                        int32_t maxsize, bool dostat)
> +{
> +    int err = 0;
> +
> +    if (v9fs_request_cancelled(pdu)) {
> +        return -EINTR;
> +    }
> +    v9fs_co_run_in_worker({
> +        err = do_readdir_lowlat(pdu, fidp, entries, maxsize, dostat);
> +    });
>      return err;
>  }
>  
> diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> index c2cdc7a9ea..1249dbe6df 100644
> --- a/hw/9pfs/coth.h
> +++ b/hw/9pfs/coth.h
> @@ -49,6 +49,9 @@
>  void co_run_in_worker_bh(void *);
>  int coroutine_fn v9fs_co_readlink(V9fsPDU *, V9fsPath *, V9fsString *);
>  int coroutine_fn v9fs_co_readdir(V9fsPDU *, V9fsFidState *, struct dirent **);
> +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *, V9fsFidState *,
> +                                        struct V9fsDirEnt **,
> +                                        int32_t, bool);
>  off_t coroutine_fn v9fs_co_telldir(V9fsPDU *, V9fsFidState *);
>  void coroutine_fn v9fs_co_seekdir(V9fsPDU *, V9fsFidState *, off_t);
>  void coroutine_fn v9fs_co_rewinddir(V9fsPDU *, V9fsFidState *);



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 07/11] tests/virtio-9p: failing splitted readdir test
  2020-01-22 22:59   ` Greg Kurz
@ 2020-01-23 11:36     ` Christian Schoenebeck
  2020-01-23 12:08       ` Greg Kurz
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-23 11:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Mittwoch, 22. Januar 2020 23:59:54 CET Greg Kurz wrote:
> On Tue, 21 Jan 2020 01:17:35 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > This patch is not intended to be merged. It resembles
> > an issue (with debug messages) where the splitted
> > readdir test fails because server is interrupted with
> > transport error "Failed to decode VirtFS request type 40",
> 
> Ok. When we send a new request, we call:
> 
> uint32_t qvirtqueue_add(QTestState *qts, QVirtQueue *vq, uint64_t data,
>                         uint32_t len, bool write, bool next)
> {
>     uint16_t flags = 0;
>     vq->num_free--;
> 
> [...]
> 
>     return vq->free_head++; /* Return and increase, in this order */
> }

Ah, I see!

> where vq->num_free is the number of available buffers (aka. requests) in
> the vq and vq->free_head the index of the next available buffer. The size
> of the vq of the virtio-9p device is MAX_REQ (128) buffers. The driver
> is very simple and doesn't care to handle the scenario of a full vq,
> ie, num_free == 0 and free_head is past the vq->desc[] array. It seems
> that count=128 generates enough extra requests to reach the end of the
> vq. Hence the "decode" error you get. Maybe an assert(vq->num_free) in
> qvirtqueue_add() would make that more clear ?

So just that I get it right; currently the 9pfs test suite writes to a 
ringbuffer with every request (decreasing the free space in the ringbuffer), 
but it never frees up that space in the ringbuffer?

> Not sure it is worth to address this limitation though. Especially since
> count=128 isn't really a recommended choice in the first place. 

Well, if that's what happens with the ringbuffer, it would need to be 
addressed somehow anyway, otherwise it would be impossible to add more 9pfs 
tests, since they would hit the ringbuffer limit as well at a certain point, 
no matter how simple the requests are.

Wouldn't it make sense to reset the ringbuffer after every succesful, 
individual 9pfs test?

> It has
> more chances to cause a disconnect if the server needs to return a longer
> file name (which is expected since most fs have 255 character long file
> names).

Well, this test is dependent on what's provided exactly by the synth driver 
anyway. So I don't see the value 128 as a problem here. The readdir/split test 
could even determine the max. length of a file provided by synth driver if you 
are concerned about that, because the file name template macro 
QTEST_V9FS_SYNTH_READDIR_FILE used by synth driver is public.

And BTW it is not really this specific 'count' value (128) that triggers this 
issue, if you just run the readdir/split test with i.e.:

	const uint32_t vcount[] = { 128 };

then you won't trigger this test environment issue.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 07/11] tests/virtio-9p: failing splitted readdir test
  2020-01-23 11:36     ` Christian Schoenebeck
@ 2020-01-23 12:08       ` Greg Kurz
  0 siblings, 0 replies; 39+ messages in thread
From: Greg Kurz @ 2020-01-23 12:08 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Thu, 23 Jan 2020 12:36:12 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 22. Januar 2020 23:59:54 CET Greg Kurz wrote:
> > On Tue, 21 Jan 2020 01:17:35 +0100
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > This patch is not intended to be merged. It resembles
> > > an issue (with debug messages) where the splitted
> > > readdir test fails because server is interrupted with
> > > transport error "Failed to decode VirtFS request type 40",
> > 
> > Ok. When we send a new request, we call:
> > 
> > uint32_t qvirtqueue_add(QTestState *qts, QVirtQueue *vq, uint64_t data,
> >                         uint32_t len, bool write, bool next)
> > {
> >     uint16_t flags = 0;
> >     vq->num_free--;
> > 
> > [...]
> > 
> >     return vq->free_head++; /* Return and increase, in this order */
> > }
> 
> Ah, I see!
> 
> > where vq->num_free is the number of available buffers (aka. requests) in
> > the vq and vq->free_head the index of the next available buffer. The size
> > of the vq of the virtio-9p device is MAX_REQ (128) buffers. The driver
> > is very simple and doesn't care to handle the scenario of a full vq,
> > ie, num_free == 0 and free_head is past the vq->desc[] array. It seems
> > that count=128 generates enough extra requests to reach the end of the
> > vq. Hence the "decode" error you get. Maybe an assert(vq->num_free) in
> > qvirtqueue_add() would make that more clear ?
> 
> So just that I get it right; currently the 9pfs test suite writes to a 
> ringbuffer with every request (decreasing the free space in the ringbuffer), 
> but it never frees up that space in the ringbuffer?
> 

Correct.

> > Not sure it is worth to address this limitation though. Especially since
> > count=128 isn't really a recommended choice in the first place. 
> 
> Well, if that's what happens with the ringbuffer, it would need to be 
> addressed somehow anyway, otherwise it would be impossible to add more 9pfs 
> tests, since they would hit the ringbuffer limit as well at a certain point, 
> no matter how simple the requests are.
> 

This just means that a single test shouldn't generate more than
128 requests. I guess this is enough for a variety of tests.

> Wouldn't it make sense to reset the ringbuffer after every succesful, 
> individual 9pfs test?
> 

This is the case, hence my suggestion to pass count to fs_readdir_split()
instead of the having a vcount[] array.

> > It has
> > more chances to cause a disconnect if the server needs to return a longer
> > file name (which is expected since most fs have 255 character long file
> > names).
> 
> Well, this test is dependent on what's provided exactly by the synth driver 
> anyway. So I don't see the value 128 as a problem here. The readdir/split test 
> could even determine the max. length of a file provided by synth driver if you 
> are concerned about that, because the file name template macro 
> QTEST_V9FS_SYNTH_READDIR_FILE used by synth driver is public.
> 

It would make sense to use this knowledge and come up with
a _good_ default value for 'count'.

> And BTW it is not really this specific 'count' value (128) that triggers this 
> issue, if you just run the readdir/split test with i.e.:
> 
> 	const uint32_t vcount[] = { 128 };
> 
> then you won't trigger this test environment issue.
> 

I mean that I don't really care to check small values because
they're likely never used by real clients, and we already know
what we might get in the end: the server disconnects.

> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir()
  2020-01-23 11:13   ` Greg Kurz
@ 2020-01-23 12:40     ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-23 12:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Donnerstag, 23. Januar 2020 12:13:51 CET Greg Kurz wrote:
> Honestly it doesn't seem to change anything significant for me.
> Mean time calculated over 100 runs:
> 
> Without this patch:
> 
> [greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export
> QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all
> tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p
> $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/
> { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
> .055654
> 
> With this patch:
> 
> [greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export
> QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all
> tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p
> $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/
> { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
> .058786

:))) Mhmm, that's because you have run this test WITHOUT the actual readdir 
optimization patch. In this scenario the readdir latency issue is so bad, that 
the driver's n-square issue does not even matter, so same here:

Unoptimized readdir, no n-squre correction hack:
Time client spent for waiting for reply from server: 0.082831s [MOST 
IMPORTANT]

Unoptimized readdir, with n-squre correction hack:
Time client spent for waiting for reply from server: 0.082539s [MOST 
IMPORTANT]

BUT, now look at this *with* readdir optimization applied:

Optimized readdir, no n-square correction hack:
Time 9p server spent on synth_readdir() I/O only (synth driver): 0.021304s
Time 9p server spent on entire T_readdir request: 0.022033s [IMPORTANT]
Time client spent for waiting for reply from server: 0.022408s [MOST 
IMPORTANT]

Optimized readdir, with n-square correction hack:
Time 9p server spent on synth_readdir() I/O only (synth driver): 0.001576s
Time 9p server spent on entire T_readdir request: 0.002244s [IMPORTANT]
Time client spent for waiting for reply from server: 0.002566s [MOST 
IMPORTANT]

Got it? :)

I had good reasons for printing out the time spent on raw driver only.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-01-23 11:33   ` Greg Kurz
@ 2020-01-23 12:57     ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-23 12:57 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Donnerstag, 23. Januar 2020 12:33:42 CET Greg Kurz wrote:
> On Tue, 21 Jan 2020 01:30:10 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > Make top half really top half and bottom half really bottom half:
> > 
> > Each T_readdir request handling is hopping between threads (main
> > I/O thread and background I/O driver threads) several times for
> > every individual directory entry, which sums up to huge latencies
> > for handling just a single T_readdir request.
> > 
> > Instead of doing that, collect now all required directory entries
> > (including all potentially required stat buffers for each entry) in
> > one rush on a background I/O thread from fs driver, then assemble
> > the entire resulting network response message for the readdir
> > request on main I/O thread. The fs driver is still aborting the
> > directory entry retrieval loop (on the background I/O thread) as
> > soon as it would exceed the client's requested maximum R_readdir
> > response size. So we should not have any performance penalty by
> > doing this.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> 
> Ok so this is it. Not reviewed this huge patch yet but I could at
> least give a try. The gain is impressive indeed:

Tseses, so much scepticism. :)

> [greg@bahia qemu-9p]$ (cd .mbuild-$(stg branch)/obj ; export
> QTEST_QEMU_BINARY='x86_64-softmmu/qemu-system-x86_64'; make all
> tests/qtest/qos-test && for i in {1..100}; do tests/qtest/qos-test -p
> $(tests/qtest/qos-test -l | grep readdir/basic); done) |& awk '/IMPORTANT/
> { print $10 }' | sed -e 's/s//' -e 's/^/n+=1;x+=/;$ascale=6;x/n' | bc
> .009806
> 
> instead of .055654, i.e. nearly 6 times faster ! This sounds promising :)

Like mentioned in the other email, performance improvement by this patch is 
actually far more than factor 6 since you probably just dropped the n-square 
driver hack in your benchmarks (which tainted your benchmark results):

Unoptimized readdir, with n-square correction hack:
Time client spent for waiting for reply from server: 0.082539s [MOST 
IMPORTANT]

Optimized readdir, with n-square correction hack:
Time 9p server spent on synth_readdir() I/O only (synth driver): 0.001576s
Time 9p server spent on entire T_readdir request: 0.002244s [IMPORTANT]
Time client spent for waiting for reply from server: 0.002566s [MOST 
IMPORTANT]

So in this particular test run performance improvement by around factor 32, 
but I also observed factors around 40 before in my tests.

> Now I need to find time to do a decent review... :-\

Sure, take your time! But as you can see, it is really worth it.

And it's not just the performance improvement. This patch also reduces program 
flow complexity significantly, e.g. there is just one lock and one unlock; 
entry name allocation is immediately freed without any potential branch in 
between, and much more. In other words: it adds safety.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/11] tests/virtio-9p: added splitted readdir test
  2020-01-23 10:30       ` Greg Kurz
@ 2020-01-23 13:07         ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-23 13:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Donnerstag, 23. Januar 2020 11:30:58 CET Greg Kurz wrote:
> On Wed, 22 Jan 2020 23:36:22 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 22. Januar 2020 22:19:05 CET Greg Kurz wrote:
> > > On Tue, 21 Jan 2020 01:16:21 +0100
> > > 
> > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > > The previous, already existing readdir test simply used a 'count'
> > > > parameter big enough to retrieve all directory entries with a
> > > > single Treaddir request.
> > > > 
> > > > In this new 'splitted' readdir test, directory entries are
> > > > retrieved, splitted over several Treaddir requests by picking small
> > > > 'count' parameters which force the server to truncate the response.
> > > > So the test client sends as many Treaddir requests as necessary to
> > > > get all directory entries. Currently this test covers actually two
> > > > tests: a sequence of Treaddir requests with count=512 and then a
> > > > subsequent test with a sequence of Treaddir requests with count=256.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > > > ---
> > > 
> > > Not sure it is really needed to check with multiple values for 'count',
> > > but it doesn't eat too many cycles so I guess it doesn't hurt.
> > 
> > Yes, it is a cheap test, nobody will feel the difference. One could argue
> > about the precise 'count' values being used ...
> > 
> > > Applied as well.
> > > 
> > > >  tests/qtest/virtio-9p-test.c | 91
> > > >  ++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 91 insertions(+)
> > > > 
> > > > diff --git a/tests/qtest/virtio-9p-test.c
> > > > b/tests/qtest/virtio-9p-test.c
> > > > index 2167322985..8b0d94546e 100644
> > > > --- a/tests/qtest/virtio-9p-test.c
> > > > +++ b/tests/qtest/virtio-9p-test.c
> > > > @@ -578,6 +578,7 @@ static bool fs_dirents_contain_name(struct
> > > > V9fsDirent
> > > > *e, const char* name)>
> > > > 
> > > >      return false;
> > > >  
> > > >  }
> > > > 
> > > > +/* basic readdir test where reply fits into a single response message
> > > > */
> > > > 
> > > >  static void fs_readdir(void *obj, void *data, QGuestAllocator
> > > >  *t_alloc)
> > > >  {
> > > >  
> > > >      QVirtio9P *v9p = obj;
> > > > 
> > > > @@ -631,6 +632,95 @@ static void fs_readdir(void *obj, void *data,
> > > > QGuestAllocator *t_alloc)>
> > > > 
> > > >      g_free(wnames[0]);
> > > >  
> > > >  }
> > > > 
> > > > +/* readdir test where overall request is splitted over several
> > > > messages
> > > > */
> > > > +static void fs_readdir_split(void *obj, void *data, QGuestAllocator
> > > > *t_alloc) +{
> > > > +    QVirtio9P *v9p = obj;
> > > > +    alloc = t_alloc;
> > > > +    char *const wnames[] = { g_strdup(QTEST_V9FS_SYNTH_READDIR_DIR)
> > > > };
> > > > +    uint16_t nqid;
> > > > +    v9fs_qid qid;
> > > > +    uint32_t count, nentries, npartialentries;
> > > > +    struct V9fsDirent *entries, *tail, *partialentries;
> > > > +    P9Req *req;
> > > > +    int subtest;
> > > > +    int fid;
> > > > +    uint64_t offset;
> > > > +    /* the Treaddir 'count' parameter values to be tested */
> > > > +    const uint32_t vcount[] = { 512, 256 };
> > 
> > ... here. But IMO it does make sense preserving the function's overall
> > structure to allow testing with different 'count' values if necessary.
> > Because that way this test could e.g. guard potential bugs when server's
> > Treaddir handler is rolling back (or not) the dir offset for instance,
> > which server has to do (or not) according to this 'count' value and the
> > precise file name length of the individual directory entries.
> 
> I still agree it is useful to be able to check different values but I
> now realize that it shouldn't be done like this because adding new
> values to vcount[] doesn't scale well with the MAX_REQ limitation I
> mentioned in another mail. More values, especially small ones, are
> likely to trigger "Failed to decode VirtFS request type 40" at some
> point.
> 
> I now think that fs_readdir_split() should rather get count as
> an argument and only do one run. By default we would only call
> this with an appropriate value, ideally derived from the test
> environment (number of files, size of filenames... etc...).
> If someone needs to test a specific value, it is easy as adding
> a new qos_add_test() line.

I actually had this variant in the exact same way as you're suggesting here as 
well before (not submitted to the list though), that is I had a version of 
this function with a count argument, and I had 3 separate qtest cases, but 
that variant failed in the exact same way.

Furthermore, I even get some strange "could not push stack" error messages 
from the qtest environment when I just add some empty noop tests to the 9pfs 
test suite. I am not sure if that's related to this ringbuffer issue here, but 
it was a show stopper for this 'separate test for each count' variant, so I 
reverted it to this suggested array solution for now.

So to avoid any misapprehension: it seems to me as if you were thinking that 
the ringbuffer is freed between every individual test. That's not what I am 
observing here. It rather seems as space in the ringbuffer is never freed 
(before the entire 9pfs test suite completed).

> This would ensure at least that each run starts with the same
> fixed number of possible requests, ie. MAX_REQ minus the cost of
> fs_attach().
> 
> So I'll revert this patch for now.
> 
> > Whatever precise 'count' tests are desired, it would only mean a one line
> > change here.
> > 
> > Best regards,
> > Christian Schoenebeck


Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 08/11] 9pfs: readdir benchmark
  2020-01-23 10:34   ` Greg Kurz
@ 2020-01-23 13:20     ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-01-23 13:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Donnerstag, 23. Januar 2020 11:34:15 CET Greg Kurz wrote:
> On Tue, 21 Jan 2020 01:23:55 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > This patch is not intended to be merged. It just provides a
> 
> Well I like the idea of having such a benchmark available.
> It could probably be merged after a few changes...

Never thought about this benchmark patch being useful in general, because it's 
just few lines of code actually to benchmark readdir performance. Plus, as you 
just found out, it does matter whether the synth driver's n-square issue 
taints the benchmark results.

So sure, I could add a boolean macro P9_BENCHMARK_READDIR or whatever, that 
one could just enable if desired, but my concern would be that people would 
not interpret the values correctly, since they are certainly unaware about the 
impact of the driver performance on these values being printed.

I mean if you missed that point, then other ones will definitely as well.

> > temporary benchmark foundation for coneniently A/B comparison
> > of the subsequent 9p readdir optimization patches:
> > 
> > * hw/9pfs/9p-synth: increase amount of simulated files for
> > 
> >   readdir test to 2000 files.
> 
> ... the 9p-synth backend could maybe always create this number
> of files ?

That's up to you. I don't mind about the precise value in the production 
version. The tests just take more time to execute if there are 2000 files.

> > * tests/virtio-9p: measure wall time that elapsed between
> > 
> >   sending T_readdir request and arrival of R_readdir response
> >   and print out that measured duration, as well as amount of
> >   directory entries received, and the amount of bytes of the
> >   response message.
> 
> ... maybe we should make the printing optional and off by
> default so that it doesn't pollute CI logs.
> 
> > * tests/virtio-9p: increased msize to 256kiB to allow
> > 
> >   retrieving all 2000 files (simulated by 9pfs synth driver)
> >   with only one T_readdir request.
> 
> ... always use 256k for both the basic test and a revisited
> split test ?

Same thing, I don't mind, it's up to you to decide.

Actually I would find it more important to document it somewhere how to 
actually run these tests and/or just run one specific individual test. Because 
I don't find it obvious for external people how to do that. That might explain 
why there are so little 9p tests so far. I also did many tests manually before 
(with a real guest OS, and fs rollbacks, etc.), simply because I did not know.

> > Running this benchmark is fairly quick & simple and does not
> > require any guest OS installation or other prerequisites:
> > 
> > cd build
> > make && make tests/qtest/qos-test
> > export QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
> > tests/qtest/qos-test -p $(tests/qtest/qos-test -l | grep readdir/basic)
> > 
> > Since this benchmark uses the 9pfs synth driver, the host
> > machine's I/O hardware (SSDs/HDDs) is not relevant for the
> > benchmark result, because the synth backend's readdir
> > implementation returns immediately (without any blocking I/O
> > that would incur with a real-life fs driver) and just returns
> > already prepared, simulated directory entries directly from RAM.
> > So this benchmark focuses on the efficiency of the 9pfs
> > controller code (or top half) for readdir request handling.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> > 
> >  hw/9pfs/9p-synth.h           |  2 +-
> >  tests/qtest/virtio-9p-test.c | 37 +++++++++++++++++++++++++++++++++++-
> >  2 files changed, 37 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/9pfs/9p-synth.h b/hw/9pfs/9p-synth.h
> > index 036d7e4a5b..7d6cedcdac 100644
> > --- a/hw/9pfs/9p-synth.h
> > +++ b/hw/9pfs/9p-synth.h
> > @@ -58,7 +58,7 @@ int qemu_v9fs_synth_add_file(V9fsSynthNode *parent, int
> > mode,> 
> >  /* for READDIR test */
> >  #define QTEST_V9FS_SYNTH_READDIR_DIR "ReadDirDir"
> >  #define QTEST_V9FS_SYNTH_READDIR_FILE "ReadDirFile%d"
> > 
> > -#define QTEST_V9FS_SYNTH_READDIR_NFILES 100
> > +#define QTEST_V9FS_SYNTH_READDIR_NFILES 2000
> > 
> >  /* Any write to the "FLUSH" file is handled one byte at a time by the
> >  
> >   * backend. If the byte is zero, the backend returns success (ie, 1),
> > 
> > diff --git a/tests/qtest/virtio-9p-test.c b/tests/qtest/virtio-9p-test.c
> > index e47b286340..d71b37aa6c 100644
> > --- a/tests/qtest/virtio-9p-test.c
> > +++ b/tests/qtest/virtio-9p-test.c
> > @@ -15,6 +15,18 @@
> > 
> >  #include "libqos/virtio-9p.h"
> >  #include "libqos/qgraph.h"
> > 
> > +/*
> > + * to benchmark the real time (not CPU time) that elapsed between start
> > of
> > + * a request and arrival of its response
> > + */
> > +static double wall_time(void)
> > +{
> > +    struct timeval t;
> > +    struct timezone tz;
> > +    gettimeofday(&t, &tz);
> > +    return t.tv_sec + t.tv_usec * 0.000001;
> > +}
> > +
> > 
> >  #define QVIRTIO_9P_TIMEOUT_US (10 * 1000 * 1000)
> >  static QGuestAllocator *alloc;
> > 
> > @@ -36,7 +48,7 @@ static void pci_config(void *obj, void *data,
> > QGuestAllocator *t_alloc)> 
> >      g_free(tag);
> >  
> >  }
> > 
> > -#define P9_MAX_SIZE 4096 /* Max size of a T-message or R-message */
> > +#define P9_MAX_SIZE (256 * 1024) /* Max size of a T-message or R-message
> > */> 
> >  typedef struct {
> >  
> >      QTestState *qts;
> > 
> > @@ -600,12 +612,35 @@ static void fs_readdir(void *obj, void *data,
> > QGuestAllocator *t_alloc)> 
> >      v9fs_req_wait_for_reply(req, NULL);
> >      v9fs_rlopen(req, &qid, NULL);
> > 
> > +    const double start = wall_time();
> > +
> > 
> >      /*
> >      
> >       * submit count = msize - 11, because 11 is the header size of
> >       Rreaddir
> >       */
> >      
> >      req = v9fs_treaddir(v9p, 1, 0, P9_MAX_SIZE - 11, 0);
> > 
> > +    const double treaddir = wall_time();
> > 
> >      v9fs_req_wait_for_reply(req, NULL);
> > 
> > +    const double waitforreply = wall_time();
> > 
> >      v9fs_rreaddir(req, &count, &nentries, &entries);
> > 
> > +    const double end = wall_time();
> > +
> > +    printf("\nTime client spent on sending T_readdir: %fs\n\n",
> > +           treaddir - start);
> > +
> > +    printf("Time client spent for waiting for reply from server: %fs "
> > +           "[MOST IMPORTANT]\n", waitforreply - start);
> > +    printf("(This is the most important value, because it reflects the
> > time\n" +           "the 9p server required to process and return the
> > result of the\n" +           "T_readdir request.)\n\n");
> > +
> > +    printf("Total client time: %fs\n", end - start);
> > +    printf("(NOTE: this time is not relevant; this huge time comes
> > from\n"
> > +           "inefficient qtest_memread() calls. So you can discard this\n"
> > +           "value as a problem of this test client implementation
> > while\n"
> > +           "processing the received server T_readdir reply.)\n\n");
> > +
> > +    printf("Details of response message data: R_readddir nentries=%d "
> > +           "rbytes=%d\n", nentries, count);
> > 
> >      /*
> >      
> >       * Assuming msize (P9_MAX_SIZE) is large enough so we can retrieve
> >       all

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-01-21  0:30 ` [PATCH v4 10/11] 9pfs: T_readdir latency optimization Christian Schoenebeck
  2020-01-23 11:33   ` Greg Kurz
@ 2020-03-09 14:09   ` Christian Schoenebeck
  2020-03-09 15:42     ` Greg Kurz
  1 sibling, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-03-09 14:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz, Eric Blake

On Dienstag, 21. Januar 2020 01:30:10 CET Christian Schoenebeck wrote:
> Make top half really top half and bottom half really bottom half:
> 
> Each T_readdir request handling is hopping between threads (main
> I/O thread and background I/O driver threads) several times for
> every individual directory entry, which sums up to huge latencies
> for handling just a single T_readdir request.
> 
> Instead of doing that, collect now all required directory entries
> (including all potentially required stat buffers for each entry) in
> one rush on a background I/O thread from fs driver, then assemble
> the entire resulting network response message for the readdir
> request on main I/O thread. The fs driver is still aborting the
> directory entry retrieval loop (on the background I/O thread) as
> soon as it would exceed the client's requested maximum R_readdir
> response size. So we should not have any performance penalty by
> doing this.
> 
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> ---

PING! Idle for 7 weeks.

Could anybody help out reviewing this particular patch 10?

Complete thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg04418.html

Patches 1..5 are already on master. Subsequent patches already have been 
reviewed, this patch 10 is the only one not being reviewed at all yet.

Test cases and benchmark patches are provided by thread, instructions to run 
them as well.

>  hw/9pfs/9p.c    | 124 +++++++++++++++-----------------
>  hw/9pfs/9p.h    |  23 ++++++
>  hw/9pfs/codir.c | 183 +++++++++++++++++++++++++++++++++++++++++++++---
>  hw/9pfs/coth.h  |   3 +
>  4 files changed, 254 insertions(+), 79 deletions(-)
> 
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index 18370183c4..e0ca45d46b 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -971,30 +971,6 @@ static int coroutine_fn fid_to_qid(V9fsPDU *pdu,
> V9fsFidState *fidp, return 0;
>  }
> 
> -static int coroutine_fn dirent_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
> -                                      struct dirent *dent, V9fsQID *qidp)
> -{
> -    struct stat stbuf;
> -    V9fsPath path;
> -    int err;
> -
> -    v9fs_path_init(&path);
> -
> -    err = v9fs_co_name_to_path(pdu, &fidp->path, dent->d_name, &path);
> -    if (err < 0) {
> -        goto out;
> -    }
> -    err = v9fs_co_lstat(pdu, &path, &stbuf);
> -    if (err < 0) {
> -        goto out;
> -    }
> -    err = stat_to_qid(pdu, &stbuf, qidp);
> -
> -out:
> -    v9fs_path_free(&path);
> -    return err;
> -}
> -
>  V9fsPDU *pdu_alloc(V9fsState *s)
>  {
>      V9fsPDU *pdu = NULL;
> @@ -2314,7 +2290,7 @@ out_nofid:
>      pdu_complete(pdu, err);
>  }
> 
> -static size_t v9fs_readdir_data_size(V9fsString *name)
> +size_t v9fs_readdir_response_size(V9fsString *name)
>  {
>      /*
>       * Size of each dirent on the wire: size of qid (13) + size of offset
> (8) @@ -2323,6 +2299,18 @@ static size_t v9fs_readdir_data_size(V9fsString
> *name) return 24 + v9fs_string_size(name);
>  }
> 
> +static void v9fs_free_dirents(struct V9fsDirEnt *e)
> +{
> +    struct V9fsDirEnt *next = NULL;
> +
> +    for (; e; e = next) {
> +        next = e->next;
> +        g_free(e->dent);
> +        g_free(e->st);
> +        g_free(e);
> +    }
> +}
> +
>  static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>                                          int32_t max_count)
>  {
> @@ -2331,54 +2319,53 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU
> *pdu, V9fsFidState *fidp, V9fsString name;
>      int len, err = 0;
>      int32_t count = 0;
> -    off_t saved_dir_pos;
>      struct dirent *dent;
> +    struct stat *st;
> +    struct V9fsDirEnt *entries = NULL;
> 
> -    /* save the directory position */
> -    saved_dir_pos = v9fs_co_telldir(pdu, fidp);
> -    if (saved_dir_pos < 0) {
> -        return saved_dir_pos;
> -    }
> -
> -    while (1) {
> -        v9fs_readdir_lock(&fidp->fs.dir);
> +    /*
> +     * inode remapping requires the device id, which in turn might be
> +     * different for different directory entries, so if inode remapping is
> +     * enabled we have to make a full stat for each directory entry
> +     */
> +    const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
> 
> -        err = v9fs_co_readdir(pdu, fidp, &dent);
> -        if (err || !dent) {
> -            break;
> -        }
> -        v9fs_string_init(&name);
> -        v9fs_string_sprintf(&name, "%s", dent->d_name);
> -        if ((count + v9fs_readdir_data_size(&name)) > max_count) {
> -            v9fs_readdir_unlock(&fidp->fs.dir);
> +    /*
> +     * Fetch all required directory entries altogether on a background IO
> +     * thread from fs driver. We don't want to do that for each entry
> +     * individually, because hopping between threads (this main IO thread
> +     * and background IO driver thread) would sum up to huge latencies.
> +     */
> +    count = v9fs_co_readdir_lowlat(pdu, fidp, &entries, max_count, dostat);
> +    if (count < 0) {
> +        err = count;
> +        count = 0;
> +        goto out;
> +    }
> +    count = 0;
> 
> -            /* Ran out of buffer. Set dir back to old position and return
> */ -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -            v9fs_string_free(&name);
> -            return count;
> -        }
> +    for (struct V9fsDirEnt *e = entries; e; e = e->next) {
> +        dent = e->dent;
> 
>          if (pdu->s->ctx.export_flags & V9FS_REMAP_INODES) {
> -            /*
> -             * dirent_to_qid() implies expensive stat call for each entry,
> -             * we must do that here though since inode remapping requires
> -             * the device id, which in turn might be different for
> -             * different entries; we cannot make any assumption to avoid
> -             * that here.
> -             */
> -            err = dirent_to_qid(pdu, fidp, dent, &qid);
> +            st = e->st;
> +            /* e->st should never be NULL, but just to be sure */
> +            if (!st) {
> +                err = -1;
> +                break;
> +            }
> +
> +            /* remap inode */
> +            err = stat_to_qid(pdu, st, &qid);
>              if (err < 0) {
> -                v9fs_readdir_unlock(&fidp->fs.dir);
> -                v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -                v9fs_string_free(&name);
> -                return err;
> +                break;
>              }
>          } else {
>              /*
>               * Fill up just the path field of qid because the client uses
>               * only that. To fill the entire qid structure we will have
>               * to stat each dirent found, which is expensive. For the
> -             * latter reason we don't call dirent_to_qid() here. Only
> drawback +             * latter reason we don't call stat_to_qid() here.
> Only drawback * is that no multi-device export detection of stat_to_qid() *
> would be done and provided as error to the user here. But * user would get
> that error anyway when accessing those @@ -2391,25 +2378,26 @@ static int
> coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, qid.version
> = 0;
>          }
> 
> +        v9fs_string_init(&name);
> +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> +
>          /* 11 = 7 + 4 (7 = start offset, 4 = space for storing count) */
>          len = pdu_marshal(pdu, 11 + count, "Qqbs",
>                            &qid, dent->d_off,
>                            dent->d_type, &name);
> 
> -        v9fs_readdir_unlock(&fidp->fs.dir);
> +        v9fs_string_free(&name);
> 
>          if (len < 0) {
> -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> -            v9fs_string_free(&name);
> -            return len;
> +            err = len;
> +            break;
>          }
> +
>          count += len;
> -        v9fs_string_free(&name);
> -        saved_dir_pos = dent->d_off;
>      }
> 
> -    v9fs_readdir_unlock(&fidp->fs.dir);
> -
> +out:
> +    v9fs_free_dirents(entries);
>      if (err < 0) {
>          return err;
>      }
> diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> index 6fffe44f5a..1dbb0ad189 100644
> --- a/hw/9pfs/9p.h
> +++ b/hw/9pfs/9p.h
> @@ -215,6 +215,28 @@ static inline void v9fs_readdir_init(V9fsDir *dir)
>      qemu_mutex_init(&dir->readdir_mutex);
>  }
> 
> +/*
> + * Type for 9p fs drivers' (a.k.a. 9p backends) result of readdir requests,
> + * which is a chained list of directory entries.
> + */
> +typedef struct V9fsDirEnt {
> +    /* mandatory (must not be NULL) information for all readdir requests */
> +    struct dirent *dent;
> +    /*
> +     * optional (may be NULL): A full stat of each directory entry is just
> +     * done if explicitly told to fs driver.
> +     */
> +    struct stat *st;
> +    /*
> +     * instead of an array, directory entries are always returned as
> +     * chained list, that's because the amount of entries retrieved by fs
> +     * drivers is dependent on the individual entries' name (since response
> +     * messages are size limited), so the final amount cannot be estimated
> +     * before hand
> +     */
> +    struct V9fsDirEnt *next;
> +} V9fsDirEnt;
> +
>  /*
>   * Filled by fs driver on open and other
>   * calls.
> @@ -419,6 +441,7 @@ void v9fs_path_init(V9fsPath *path);
>  void v9fs_path_free(V9fsPath *path);
>  void v9fs_path_sprintf(V9fsPath *path, const char *fmt, ...);
>  void v9fs_path_copy(V9fsPath *dst, const V9fsPath *src);
> +size_t v9fs_readdir_response_size(V9fsString *name);
>  int v9fs_name_to_path(V9fsState *s, V9fsPath *dirpath,
>                        const char *name, V9fsPath *path);
>  int v9fs_device_realize_common(V9fsState *s, const V9fsTransport *t,
> diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> index 73f9a751e1..6ce7dc8cde 100644
> --- a/hw/9pfs/codir.c
> +++ b/hw/9pfs/codir.c
> @@ -18,28 +18,189 @@
>  #include "qemu/main-loop.h"
>  #include "coth.h"
> 
> +/*
> + * This is solely executed on a background IO thread.
> + */
> +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent
> **dent) +{
> +    int err = 0;
> +    V9fsState *s = pdu->s;
> +    struct dirent *entry;
> +
> +    errno = 0;
> +    entry = s->ops->readdir(&s->ctx, &fidp->fs);
> +    if (!entry && errno) {
> +        *dent = NULL;
> +        err = -errno;
> +    } else {
> +        *dent = entry;
> +    }
> +    return err;
> +}
> +
> +/*
> + * TODO: This will be removed for performance reasons.
> + * Use v9fs_co_readdir_lowlat() instead.
> + */
>  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
>                                   struct dirent **dent)
>  {
>      int err;
> -    V9fsState *s = pdu->s;
> 
>      if (v9fs_request_cancelled(pdu)) {
>          return -EINTR;
>      }
> -    v9fs_co_run_in_worker(
> -        {
> -            struct dirent *entry;
> +    v9fs_co_run_in_worker({
> +        err = do_readdir(pdu, fidp, dent);
> +    });
> +    return err;
> +}
> +
> +/*
> + * This is solely executed on a background IO thread.
> + *
> + * See v9fs_co_readdir_lowlat() (as its only user) below for details.
> + */
> +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> +                             struct V9fsDirEnt **entries,
> +                             int32_t maxsize, bool dostat)
> +{
> +    V9fsState *s = pdu->s;
> +    V9fsString name;
> +    int len, err = 0;
> +    int32_t size = 0;
> +    off_t saved_dir_pos;
> +    struct dirent *dent;
> +    struct V9fsDirEnt *e = NULL;
> +    V9fsPath path;
> +    struct stat stbuf;
> +
> +    *entries = NULL;
> +    v9fs_path_init(&path);
> +
> +    /*
> +     * TODO: Here should be a warn_report_once() if lock failed.
> +     *
> +     * With a good 9p client we should not get into concurrency here,
> +     * because a good client would not use the same fid for concurrent
> +     * requests. We do the lock here for safety reasons though. However
> +     * the client would then suffer performance issues, so better log that
> +     * issue here.
> +     */
> +    v9fs_readdir_lock(&fidp->fs.dir);
> +
> +    /* save the directory position */
> +    saved_dir_pos = s->ops->telldir(&s->ctx, &fidp->fs);
> +    if (saved_dir_pos < 0) {
> +        err = saved_dir_pos;
> +        goto out;
> +    }
> +
> +    while (true) {
> +        /* get directory entry from fs driver */
> +        err = do_readdir(pdu, fidp, &dent);
> +        if (err || !dent) {
> +            break;
> +        }
> +
> +        /*
> +         * stop this loop as soon as it would exceed the allowed maximum
> +         * response message size for the directory entries collected so
> far, +         * because anything beyond that size would need to be
> discarded by +         * 9p controller (main thread / top half) anyway
> +         */
> +        v9fs_string_init(&name);
> +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> +        len = v9fs_readdir_response_size(&name);
> +        v9fs_string_free(&name);
> +        if (size + len > maxsize) {
> +            /* this is not an error case actually */
> +            break;
> +        }
> +
> +        /* append next node to result chain */
> +        if (!e) {
> +            *entries = e = g_malloc0(sizeof(V9fsDirEnt));
> +        } else {
> +            e = e->next = g_malloc0(sizeof(V9fsDirEnt));
> +        }
> +        e->dent = g_malloc0(sizeof(struct dirent));
> +        memcpy(e->dent, dent, sizeof(struct dirent));
> 
> -            errno = 0;
> -            entry = s->ops->readdir(&s->ctx, &fidp->fs);
> -            if (!entry && errno) {
> +        /* perform a full stat() for directory entry if requested by caller
> */ +        if (dostat) {
> +            err = s->ops->name_to_path(
> +                &s->ctx, &fidp->path, dent->d_name, &path
> +            );
> +            if (err < 0) {
>                  err = -errno;
> -            } else {
> -                *dent = entry;
> -                err = 0;
> +                break;
>              }
> -        });
> +
> +            err = s->ops->lstat(&s->ctx, &path, &stbuf);
> +            if (err < 0) {
> +                err = -errno;
> +                break;
> +            }
> +
> +            e->st = g_malloc0(sizeof(struct stat));
> +            memcpy(e->st, &stbuf, sizeof(struct stat));
> +        }
> +
> +        size += len;
> +        saved_dir_pos = dent->d_off;
> +    }
> +
> +    /* restore (last) saved position */
> +    s->ops->seekdir(&s->ctx, &fidp->fs, saved_dir_pos);
> +
> +out:
> +    v9fs_readdir_unlock(&fidp->fs.dir);
> +    v9fs_path_free(&path);
> +    if (err < 0) {
> +        return err;
> +    }
> +    return size;
> +}
> +
> +/**
> + * @brief Low latency variant of fs driver readdir handling.
> + *
> + * Retrieves the requested (max. amount of) directory entries from the fs
> + * driver. This function must only be called by the main IO thread (top
> half). + * Internally this function call will be dispatched to a background
> IO thread + * (bottom half) where it is eventually executed by the fs
> driver. + *
> + * The old readdir implementation above just retrieves always one dir entry
> + * per call. The problem of that implementation above is that latency is +
> * added for (retrieval of) each directory entry, which in practice lead to
> + * latencies of several hundred ms for readdir of only one directory. + *
> + * This is avoided in this function by letting the fs driver retrieve all
> + * required directory entries with only call of this function and hence
> with + * only a single fs driver request.
> + *
> + * @param pdu - the causing 9p (T_readdir) client request
> + * @param fidp - already opened directory where readdir shall be performed
> on + * @param entries - output for directory entries (must not be NULL) + *
> @param maxsize - maximum result message body size (in bytes)
> + * @param dostat - whether a stat() should be performed and returned for
> + *                 each directory entry
> + * @returns resulting response message body size (in bytes) on success,
> + *          negative error code otherwise
> + */
> +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> +                                        struct V9fsDirEnt **entries,
> +                                        int32_t maxsize, bool dostat)
> +{
> +    int err = 0;
> +
> +    if (v9fs_request_cancelled(pdu)) {
> +        return -EINTR;
> +    }
> +    v9fs_co_run_in_worker({
> +        err = do_readdir_lowlat(pdu, fidp, entries, maxsize, dostat);
> +    });
>      return err;
>  }
> 
> diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> index c2cdc7a9ea..1249dbe6df 100644
> --- a/hw/9pfs/coth.h
> +++ b/hw/9pfs/coth.h
> @@ -49,6 +49,9 @@
>  void co_run_in_worker_bh(void *);
>  int coroutine_fn v9fs_co_readlink(V9fsPDU *, V9fsPath *, V9fsString *);
>  int coroutine_fn v9fs_co_readdir(V9fsPDU *, V9fsFidState *, struct dirent
> **); +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *, V9fsFidState *, + 
>                                       struct V9fsDirEnt **,
> +                                        int32_t, bool);
>  off_t coroutine_fn v9fs_co_telldir(V9fsPDU *, V9fsFidState *);
>  void coroutine_fn v9fs_co_seekdir(V9fsPDU *, V9fsFidState *, off_t);
>  void coroutine_fn v9fs_co_rewinddir(V9fsPDU *, V9fsFidState *);




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-09 14:09   ` Christian Schoenebeck
@ 2020-03-09 15:42     ` Greg Kurz
  2020-03-10 15:10       ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-03-09 15:42 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Mon, 09 Mar 2020 15:09:21 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Dienstag, 21. Januar 2020 01:30:10 CET Christian Schoenebeck wrote:
> > Make top half really top half and bottom half really bottom half:
> > 
> > Each T_readdir request handling is hopping between threads (main
> > I/O thread and background I/O driver threads) several times for
> > every individual directory entry, which sums up to huge latencies
> > for handling just a single T_readdir request.
> > 
> > Instead of doing that, collect now all required directory entries
> > (including all potentially required stat buffers for each entry) in
> > one rush on a background I/O thread from fs driver, then assemble
> > the entire resulting network response message for the readdir
> > request on main I/O thread. The fs driver is still aborting the
> > directory entry retrieval loop (on the background I/O thread) as
> > soon as it would exceed the client's requested maximum R_readdir
> > response size. So we should not have any performance penalty by
> > doing this.
> > 
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > ---
> 
> PING! Idle for 7 weeks.
> 

Yeah sorry for that, I couldn't find enough time yet...

> Could anybody help out reviewing this particular patch 10?
> 

The main issue I see with this patch is that it is too big. At
least from my POV, with the little time I have for 9p, that's
the first thing that pushes me back... So if you could split
this patch down to some reviewable chunks, this may help
going forward.

Anyway, here's a first high level shot.

> Complete thread:
> https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg04418.html
> 
> Patches 1..5 are already on master. Subsequent patches already have been 
> reviewed, this patch 10 is the only one not being reviewed at all yet.
> 
> Test cases and benchmark patches are provided by thread, instructions to run 
> them as well.
> 
> >  hw/9pfs/9p.c    | 124 +++++++++++++++-----------------
> >  hw/9pfs/9p.h    |  23 ++++++
> >  hw/9pfs/codir.c | 183 +++++++++++++++++++++++++++++++++++++++++++++---
> >  hw/9pfs/coth.h  |   3 +
> >  4 files changed, 254 insertions(+), 79 deletions(-)
> > 
> > diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> > index 18370183c4..e0ca45d46b 100644
> > --- a/hw/9pfs/9p.c
> > +++ b/hw/9pfs/9p.c
> > @@ -971,30 +971,6 @@ static int coroutine_fn fid_to_qid(V9fsPDU *pdu,
> > V9fsFidState *fidp, return 0;
> >  }
> > 
> > -static int coroutine_fn dirent_to_qid(V9fsPDU *pdu, V9fsFidState *fidp,
> > -                                      struct dirent *dent, V9fsQID *qidp)
> > -{
> > -    struct stat stbuf;
> > -    V9fsPath path;
> > -    int err;
> > -
> > -    v9fs_path_init(&path);
> > -
> > -    err = v9fs_co_name_to_path(pdu, &fidp->path, dent->d_name, &path);
> > -    if (err < 0) {
> > -        goto out;
> > -    }
> > -    err = v9fs_co_lstat(pdu, &path, &stbuf);
> > -    if (err < 0) {
> > -        goto out;
> > -    }
> > -    err = stat_to_qid(pdu, &stbuf, qidp);
> > -
> > -out:
> > -    v9fs_path_free(&path);
> > -    return err;
> > -}
> > -
> >  V9fsPDU *pdu_alloc(V9fsState *s)
> >  {
> >      V9fsPDU *pdu = NULL;
> > @@ -2314,7 +2290,7 @@ out_nofid:
> >      pdu_complete(pdu, err);
> >  }
> > 
> > -static size_t v9fs_readdir_data_size(V9fsString *name)
> > +size_t v9fs_readdir_response_size(V9fsString *name)
> >  {
> >      /*
> >       * Size of each dirent on the wire: size of qid (13) + size of offset
> > (8) @@ -2323,6 +2299,18 @@ static size_t v9fs_readdir_data_size(V9fsString
> > *name) return 24 + v9fs_string_size(name);
> >  }
> > 
> > +static void v9fs_free_dirents(struct V9fsDirEnt *e)
> > +{
> > +    struct V9fsDirEnt *next = NULL;
> > +
> > +    for (; e; e = next) {
> > +        next = e->next;
> > +        g_free(e->dent);
> > +        g_free(e->st);
> > +        g_free(e);
> > +    }
> > +}
> > +
> >  static int coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
> >                                          int32_t max_count)
> >  {
> > @@ -2331,54 +2319,53 @@ static int coroutine_fn v9fs_do_readdir(V9fsPDU
> > *pdu, V9fsFidState *fidp, V9fsString name;
> >      int len, err = 0;
> >      int32_t count = 0;
> > -    off_t saved_dir_pos;
> >      struct dirent *dent;
> > +    struct stat *st;
> > +    struct V9fsDirEnt *entries = NULL;
> > 
> > -    /* save the directory position */
> > -    saved_dir_pos = v9fs_co_telldir(pdu, fidp);
> > -    if (saved_dir_pos < 0) {
> > -        return saved_dir_pos;
> > -    }
> > -
> > -    while (1) {
> > -        v9fs_readdir_lock(&fidp->fs.dir);
> > +    /*
> > +     * inode remapping requires the device id, which in turn might be
> > +     * different for different directory entries, so if inode remapping is
> > +     * enabled we have to make a full stat for each directory entry
> > +     */
> > +    const bool dostat = pdu->s->ctx.export_flags & V9FS_REMAP_INODES;
> > 
> > -        err = v9fs_co_readdir(pdu, fidp, &dent);
> > -        if (err || !dent) {
> > -            break;
> > -        }
> > -        v9fs_string_init(&name);
> > -        v9fs_string_sprintf(&name, "%s", dent->d_name);
> > -        if ((count + v9fs_readdir_data_size(&name)) > max_count) {
> > -            v9fs_readdir_unlock(&fidp->fs.dir);
> > +    /*
> > +     * Fetch all required directory entries altogether on a background IO
> > +     * thread from fs driver. We don't want to do that for each entry
> > +     * individually, because hopping between threads (this main IO thread
> > +     * and background IO driver thread) would sum up to huge latencies.
> > +     */
> > +    count = v9fs_co_readdir_lowlat(pdu, fidp, &entries, max_count, dostat);
> > +    if (count < 0) {
> > +        err = count;
> > +        count = 0;
> > +        goto out;
> > +    }
> > +    count = 0;
> > 
> > -            /* Ran out of buffer. Set dir back to old position and return
> > */ -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> > -            v9fs_string_free(&name);
> > -            return count;
> > -        }
> > +    for (struct V9fsDirEnt *e = entries; e; e = e->next) {
> > +        dent = e->dent;
> > 
> >          if (pdu->s->ctx.export_flags & V9FS_REMAP_INODES) {
> > -            /*
> > -             * dirent_to_qid() implies expensive stat call for each entry,
> > -             * we must do that here though since inode remapping requires
> > -             * the device id, which in turn might be different for
> > -             * different entries; we cannot make any assumption to avoid
> > -             * that here.
> > -             */
> > -            err = dirent_to_qid(pdu, fidp, dent, &qid);
> > +            st = e->st;
> > +            /* e->st should never be NULL, but just to be sure */
> > +            if (!st) {
> > +                err = -1;
> > +                break;
> > +            }
> > +
> > +            /* remap inode */
> > +            err = stat_to_qid(pdu, st, &qid);
> >              if (err < 0) {
> > -                v9fs_readdir_unlock(&fidp->fs.dir);
> > -                v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> > -                v9fs_string_free(&name);
> > -                return err;
> > +                break;
> >              }
> >          } else {
> >              /*
> >               * Fill up just the path field of qid because the client uses
> >               * only that. To fill the entire qid structure we will have
> >               * to stat each dirent found, which is expensive. For the
> > -             * latter reason we don't call dirent_to_qid() here. Only
> > drawback +             * latter reason we don't call stat_to_qid() here.
> > Only drawback * is that no multi-device export detection of stat_to_qid() *
> > would be done and provided as error to the user here. But * user would get
> > that error anyway when accessing those @@ -2391,25 +2378,26 @@ static int
> > coroutine_fn v9fs_do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, qid.version
> > = 0;
> >          }
> > 
> > +        v9fs_string_init(&name);
> > +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> > +
> >          /* 11 = 7 + 4 (7 = start offset, 4 = space for storing count) */
> >          len = pdu_marshal(pdu, 11 + count, "Qqbs",
> >                            &qid, dent->d_off,
> >                            dent->d_type, &name);
> > 
> > -        v9fs_readdir_unlock(&fidp->fs.dir);
> > +        v9fs_string_free(&name);
> > 
> >          if (len < 0) {
> > -            v9fs_co_seekdir(pdu, fidp, saved_dir_pos);
> > -            v9fs_string_free(&name);
> > -            return len;
> > +            err = len;
> > +            break;
> >          }
> > +
> >          count += len;
> > -        v9fs_string_free(&name);
> > -        saved_dir_pos = dent->d_off;
> >      }
> > 
> > -    v9fs_readdir_unlock(&fidp->fs.dir);
> > -
> > +out:
> > +    v9fs_free_dirents(entries);
> >      if (err < 0) {
> >          return err;
> >      }
> > diff --git a/hw/9pfs/9p.h b/hw/9pfs/9p.h
> > index 6fffe44f5a..1dbb0ad189 100644
> > --- a/hw/9pfs/9p.h
> > +++ b/hw/9pfs/9p.h
> > @@ -215,6 +215,28 @@ static inline void v9fs_readdir_init(V9fsDir *dir)
> >      qemu_mutex_init(&dir->readdir_mutex);
> >  }
> > 
> > +/*
> > + * Type for 9p fs drivers' (a.k.a. 9p backends) result of readdir requests,
> > + * which is a chained list of directory entries.
> > + */
> > +typedef struct V9fsDirEnt {
> > +    /* mandatory (must not be NULL) information for all readdir requests */
> > +    struct dirent *dent;
> > +    /*
> > +     * optional (may be NULL): A full stat of each directory entry is just
> > +     * done if explicitly told to fs driver.
> > +     */
> > +    struct stat *st;
> > +    /*
> > +     * instead of an array, directory entries are always returned as
> > +     * chained list, that's because the amount of entries retrieved by fs
> > +     * drivers is dependent on the individual entries' name (since response
> > +     * messages are size limited), so the final amount cannot be estimated
> > +     * before hand
> > +     */
> > +    struct V9fsDirEnt *next;
> > +} V9fsDirEnt;
> > +
> >  /*
> >   * Filled by fs driver on open and other
> >   * calls.
> > @@ -419,6 +441,7 @@ void v9fs_path_init(V9fsPath *path);
> >  void v9fs_path_free(V9fsPath *path);
> >  void v9fs_path_sprintf(V9fsPath *path, const char *fmt, ...);
> >  void v9fs_path_copy(V9fsPath *dst, const V9fsPath *src);
> > +size_t v9fs_readdir_response_size(V9fsString *name);
> >  int v9fs_name_to_path(V9fsState *s, V9fsPath *dirpath,
> >                        const char *name, V9fsPath *path);
> >  int v9fs_device_realize_common(V9fsState *s, const V9fsTransport *t,
> > diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> > index 73f9a751e1..6ce7dc8cde 100644
> > --- a/hw/9pfs/codir.c
> > +++ b/hw/9pfs/codir.c
> > @@ -18,28 +18,189 @@
> >  #include "qemu/main-loop.h"
> >  #include "coth.h"
> > 
> > +/*
> > + * This is solely executed on a background IO thread.
> > + */
> > +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent
> > **dent) +{
> > +    int err = 0;
> > +    V9fsState *s = pdu->s;
> > +    struct dirent *entry;
> > +
> > +    errno = 0;
> > +    entry = s->ops->readdir(&s->ctx, &fidp->fs);
> > +    if (!entry && errno) {
> > +        *dent = NULL;
> > +        err = -errno;
> > +    } else {
> > +        *dent = entry;
> > +    }
> > +    return err;
> > +}
> > +
> > +/*
> > + * TODO: This will be removed for performance reasons.
> > + * Use v9fs_co_readdir_lowlat() instead.

Oh, so we'd still have the current implementation being used, even
with this patch applied... This would be okay for a RFC patch but
in the end I'd really like to merge something that also converts
v9fs_do_readdir_with_stat().

> > + */
> >  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
> >                                   struct dirent **dent)
> >  {
> >      int err;
> > -    V9fsState *s = pdu->s;
> > 
> >      if (v9fs_request_cancelled(pdu)) {
> >          return -EINTR;
> >      }
> > -    v9fs_co_run_in_worker(
> > -        {
> > -            struct dirent *entry;
> > +    v9fs_co_run_in_worker({
> > +        err = do_readdir(pdu, fidp, dent);
> > +    });
> > +    return err;
> > +}
> > +
> > +/*
> > + * This is solely executed on a background IO thread.
> > + *
> > + * See v9fs_co_readdir_lowlat() (as its only user) below for details.

Since the only user is quite small, maybe just open-code this.

> > + */
> > +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> > +                             struct V9fsDirEnt **entries,
> > +                             int32_t maxsize, bool dostat)
> > +{
> > +    V9fsState *s = pdu->s;
> > +    V9fsString name;
> > +    int len, err = 0;
> > +    int32_t size = 0;
> > +    off_t saved_dir_pos;
> > +    struct dirent *dent;
> > +    struct V9fsDirEnt *e = NULL;
> > +    V9fsPath path;
> > +    struct stat stbuf;
> > +
> > +    *entries = NULL;
> > +    v9fs_path_init(&path);
> > +
> > +    /*
> > +     * TODO: Here should be a warn_report_once() if lock failed.
> > +     *
> > +     * With a good 9p client we should not get into concurrency here,
> > +     * because a good client would not use the same fid for concurrent
> > +     * requests. We do the lock here for safety reasons though. However
> > +     * the client would then suffer performance issues, so better log that
> > +     * issue here.
> > +     */

This should probably be done with qemu_mutex_trylock() in a loop directly
in v9fs_readdir_lock().

> > +    v9fs_readdir_lock(&fidp->fs.dir);
> > +
> > +    /* save the directory position */
> > +    saved_dir_pos = s->ops->telldir(&s->ctx, &fidp->fs);
> > +    if (saved_dir_pos < 0) {
> > +        err = saved_dir_pos;
> > +        goto out;
> > +    }
> > +
> > +    while (true) {
> > +        /* get directory entry from fs driver */
> > +        err = do_readdir(pdu, fidp, &dent);
> > +        if (err || !dent) {
> > +            break;
> > +        }
> > +
> > +        /*
> > +         * stop this loop as soon as it would exceed the allowed maximum
> > +         * response message size for the directory entries collected so
> > far, +         * because anything beyond that size would need to be
> > discarded by +         * 9p controller (main thread / top half) anyway
> > +         */
> > +        v9fs_string_init(&name);
> > +        v9fs_string_sprintf(&name, "%s", dent->d_name);
> > +        len = v9fs_readdir_response_size(&name);
> > +        v9fs_string_free(&name);
> > +        if (size + len > maxsize) {
> > +            /* this is not an error case actually */
> > +            break;
> > +        }
> > +
> > +        /* append next node to result chain */
> > +        if (!e) {
> > +            *entries = e = g_malloc0(sizeof(V9fsDirEnt));
> > +        } else {
> > +            e = e->next = g_malloc0(sizeof(V9fsDirEnt));
> > +        }
> > +        e->dent = g_malloc0(sizeof(struct dirent));
> > +        memcpy(e->dent, dent, sizeof(struct dirent));
> > 
> > -            errno = 0;
> > -            entry = s->ops->readdir(&s->ctx, &fidp->fs);
> > -            if (!entry && errno) {
> > +        /* perform a full stat() for directory entry if requested by caller
> > */ +        if (dostat) {
> > +            err = s->ops->name_to_path(
> > +                &s->ctx, &fidp->path, dent->d_name, &path
> > +            );
> > +            if (err < 0) {
> >                  err = -errno;
> > -            } else {
> > -                *dent = entry;
> > -                err = 0;
> > +                break;
> >              }
> > -        });
> > +
> > +            err = s->ops->lstat(&s->ctx, &path, &stbuf);
> > +            if (err < 0) {
> > +                err = -errno;
> > +                break;
> > +            }
> > +
> > +            e->st = g_malloc0(sizeof(struct stat));
> > +            memcpy(e->st, &stbuf, sizeof(struct stat));
> > +        }
> > +
> > +        size += len;
> > +        saved_dir_pos = dent->d_off;
> > +    }
> > +
> > +    /* restore (last) saved position */
> > +    s->ops->seekdir(&s->ctx, &fidp->fs, saved_dir_pos);
> > +
> > +out:
> > +    v9fs_readdir_unlock(&fidp->fs.dir);
> > +    v9fs_path_free(&path);
> > +    if (err < 0) {
> > +        return err;
> > +    }
> > +    return size;
> > +}
> > +
> > +/**
> > + * @brief Low latency variant of fs driver readdir handling.
> > + *

Not sure it brings much to mention latency here... telling what this
function does actually ie. "read multiple directory entries in a row"
would be more informative.

> > + * Retrieves the requested (max. amount of) directory entries from the fs
> > + * driver. This function must only be called by the main IO thread (top
> > half). + * Internally this function call will be dispatched to a background
> > IO thread + * (bottom half) where it is eventually executed by the fs
> > driver. + *
> > + * The old readdir implementation above just retrieves always one dir entry
> > + * per call. The problem of that implementation above is that latency is +
> > * added for (retrieval of) each directory entry, which in practice lead to
> > + * latencies of several hundred ms for readdir of only one directory. + *
> > + * This is avoided in this function by letting the fs driver retrieve all
> > + * required directory entries with only call of this function and hence
> > with + * only a single fs driver request.

True but these kind of considerations rather belong to the changelog...
otherwise, we'll have to not forget to kill these lines when the "old
readdir implementation" is no more.

> > + *
> > + * @param pdu - the causing 9p (T_readdir) client request
> > + * @param fidp - already opened directory where readdir shall be performed
> > on + * @param entries - output for directory entries (must not be NULL) + *
> > @param maxsize - maximum result message body size (in bytes)
> > + * @param dostat - whether a stat() should be performed and returned for
> > + *                 each directory entry
> > + * @returns resulting response message body size (in bytes) on success,
> > + *          negative error code otherwise
> > + */
> > +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,

Not really a fan of the _lowlat (low fat ?) wording.

v9fs_co_readdir_many() or anything else that suggests we're going to
read several entries in a row.


> > +                                        struct V9fsDirEnt **entries,
> > +                                        int32_t maxsize, bool dostat)

s/maxsize/max_count since this is the wording use in the caller.

> > +{
> > +    int err = 0;
> > +
> > +    if (v9fs_request_cancelled(pdu)) {
> > +        return -EINTR;
> > +    }
> > +    v9fs_co_run_in_worker({
> > +        err = do_readdir_lowlat(pdu, fidp, entries, maxsize, dostat);
> > +    });
> >      return err;
> >  }
> > 
> > diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
> > index c2cdc7a9ea..1249dbe6df 100644
> > --- a/hw/9pfs/coth.h
> > +++ b/hw/9pfs/coth.h
> > @@ -49,6 +49,9 @@
> >  void co_run_in_worker_bh(void *);
> >  int coroutine_fn v9fs_co_readlink(V9fsPDU *, V9fsPath *, V9fsString *);
> >  int coroutine_fn v9fs_co_readdir(V9fsPDU *, V9fsFidState *, struct dirent
> > **); +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *, V9fsFidState *, + 
> >                                       struct V9fsDirEnt **,
> > +                                        int32_t, bool);
> >  off_t coroutine_fn v9fs_co_telldir(V9fsPDU *, V9fsFidState *);
> >  void coroutine_fn v9fs_co_seekdir(V9fsPDU *, V9fsFidState *, off_t);
> >  void coroutine_fn v9fs_co_rewinddir(V9fsPDU *, V9fsFidState *);
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-09 15:42     ` Greg Kurz
@ 2020-03-10 15:10       ` Christian Schoenebeck
  2020-03-10 18:33         ` Greg Kurz
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-03-10 15:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Montag, 9. März 2020 16:42:59 CET Greg Kurz wrote:
> The main issue I see with this patch is that it is too big. At
> least from my POV, with the little time I have for 9p, that's
> the first thing that pushes me back... So if you could split
> this patch down to some reviewable chunks, this may help
> going forward.

This patch is also too big for my preference, but I don't see a viable way to 
split it further into separate patches. I already separated all the patches I 
could. If you have suggestions, very much appreciated!

The reason for this is that in order to fix these issues with current 
T_readdir implementation, it requires to separate what's currently one task 
(i.e. one function) into two separate tasks (i.e. two functions). There is no 
sensible way to do that.

But don't be too scared about this patch; if you look just at the diff 
directly then yes, the diff is not very human readable. However if you apply 
the patch and look at the resulting code, you'll notice that there are 
actually only 2 functions (v9fs_do_readdir() in 9p.c and do_readdir_lowlat() 
in codir.c) which require careful reviewing and that their resulting code is 
actually very, very straight forward, and not long either.

Current code on master is much more tricky and error prone due to the huge 
amount of potential branches, individual error/cleanup handlers, high amount 
of thread dispatching and much more. In the patched version all these code 
complexities and error sources are removed.

> > > diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> > > index 73f9a751e1..6ce7dc8cde 100644
> > > --- a/hw/9pfs/codir.c
> > > +++ b/hw/9pfs/codir.c
> > > @@ -18,28 +18,189 @@
> > > 
> > >  #include "qemu/main-loop.h"
> > >  #include "coth.h"
> > > 
> > > +/*
> > > + * This is solely executed on a background IO thread.
> > > + */
> > > +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent
> > > **dent) +{
> > > +    int err = 0;
> > > +    V9fsState *s = pdu->s;
> > > +    struct dirent *entry;
> > > +
> > > +    errno = 0;
> > > +    entry = s->ops->readdir(&s->ctx, &fidp->fs);
> > > +    if (!entry && errno) {
> > > +        *dent = NULL;
> > > +        err = -errno;
> > > +    } else {
> > > +        *dent = entry;
> > > +    }
> > > +    return err;
> > > +}
> > > +
> > > +/*
> > > + * TODO: This will be removed for performance reasons.
> > > + * Use v9fs_co_readdir_lowlat() instead.
> 
> Oh, so we'd still have the current implementation being used, even
> with this patch applied... This would be okay for a RFC patch but
> in the end I'd really like to merge something that also converts
> v9fs_do_readdir_with_stat().

Yes, I know, but I would not recommend mixing these things at this point, 
because it would be an entire effort on its own.

v9fs_do_readdir_with_stat() is used for 9P2000.u, while v9fs_do_readdir() is 
used for 9P2000.L. They're behaving very differently, so it would not only 
require me to update v9fs_do_readdir_with_stat() and v9fs_read(), I would also 
need to write their own test cases (plural, since there are none at all yet) 
and benchmarks, and of course somebody what need to review all that additional 
amount of code, and who would actually test it? I am actively testing my 
9P2000.L changes, but I am not actually using 9P2000.u, so I could not 
guarantee for the same quality.

Considering the current commitment situation regarding 9pfs, we can only bring 
things forward with compromises. Hence I suggest I concentrate on fixing the 
worst performance issues on 9P2000.L side first, and later on if there are 
really people interested in seeing these improvements on 9P2000.u side and 
willing to at least help out with testing, then I can definitely also adjust 
9P2000.u side accordingly as well.

But to also make this clear: this patch 10 is not introducing any behaviour 
change for 9P2000.u side.

> > >  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
> > >  
> > >                                   struct dirent **dent)
> > >  
> > >  {
> > >  
> > >      int err;
> > > 
> > > -    V9fsState *s = pdu->s;
> > > 
> > >      if (v9fs_request_cancelled(pdu)) {
> > >      
> > >          return -EINTR;
> > >      
> > >      }
> > > 
> > > -    v9fs_co_run_in_worker(
> > > -        {
> > > -            struct dirent *entry;
> > > +    v9fs_co_run_in_worker({
> > > +        err = do_readdir(pdu, fidp, dent);
> > > +    });
> > > +    return err;
> > > +}
> > > +
> > > +/*
> > > + * This is solely executed on a background IO thread.
> > > + *
> > > + * See v9fs_co_readdir_lowlat() (as its only user) below for details.
> 
> Since the only user is quite small, maybe just open-code this.

Mmm... maybe later on, as a cleanup patch or something. Current version is 
tested. I would like to avoid to make things more complicated than they 
already are (i.e. accidental introduction of some bug just due to this).

> 
> > > + */
> > > +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> > > +                             struct V9fsDirEnt **entries,
> > > +                             int32_t maxsize, bool dostat)
> > > +{
> > > +    V9fsState *s = pdu->s;
> > > +    V9fsString name;
> > > +    int len, err = 0;
> > > +    int32_t size = 0;
> > > +    off_t saved_dir_pos;
> > > +    struct dirent *dent;
> > > +    struct V9fsDirEnt *e = NULL;
> > > +    V9fsPath path;
> > > +    struct stat stbuf;
> > > +
> > > +    *entries = NULL;
> > > +    v9fs_path_init(&path);
> > > +
> > > +    /*
> > > +     * TODO: Here should be a warn_report_once() if lock failed.
> > > +     *
> > > +     * With a good 9p client we should not get into concurrency here,
> > > +     * because a good client would not use the same fid for concurrent
> > > +     * requests. We do the lock here for safety reasons though. However
> > > +     * the client would then suffer performance issues, so better log
> > > that
> > > +     * issue here.
> > > +     */
> 
> This should probably be done with qemu_mutex_trylock() in a loop directly
> in v9fs_readdir_lock().

No definitely not in the loop. That's intentionally *one* lock *outside* of 
the loop. The normal case is not requiring a lock at all, like the comment 
describes. Doing multiple locks inside the loop unnecessaririly reduces 
performance for no reason.

About *_trylock() instead of v9fs_readdir_lock(): might be a candidate to 
address the TODO comment, but I would like to pospone that for the same 
reasons: it is an odd/ill case encountering concurrency here. So for all well 
implemented clients this is not an issue at all, and security (concerning 
malicious clients) is provided already by this lock. Addressing this TODO 
already now would potentially unnecessarily introduce bugs due to added 
complexity, so really I would postpone that.

> > > +/**
> > > + * @brief Low latency variant of fs driver readdir handling.
> > > + *
> 
> Not sure it brings much to mention latency here... telling what this
> function does actually ie. "read multiple directory entries in a row"
> would be more informative.

NP

> > > + * Retrieves the requested (max. amount of) directory entries from the
> > > fs
> > > + * driver. This function must only be called by the main IO thread (top
> > > half). + * Internally this function call will be dispatched to a
> > > background
> > > IO thread + * (bottom half) where it is eventually executed by the fs
> > > driver. + *
> > > + * The old readdir implementation above just retrieves always one dir
> > > entry + * per call. The problem of that implementation above is that
> > > latency is + * added for (retrieval of) each directory entry, which in
> > > practice lead to + * latencies of several hundred ms for readdir of
> > > only one directory. + * + * This is avoided in this function by letting
> > > the fs driver retrieve all + * required directory entries with only
> > > call of this function and hence with + * only a single fs driver
> > > request.
> 
> True but these kind of considerations rather belong to the changelog...
> otherwise, we'll have to not forget to kill these lines when the "old
> readdir implementation" is no more.

Mmm... I do think this overall latency issue should be clarified somewhere as 
a comment. Where exactly is not that important to me. For instance it could 
also be described on v9fs_co_run_in_worker() macro definition in coth.h 
instead, as general rule of thumb when dispatching stuff.

The thing is: for >10 years several developers obviously did not realize the 
severe negative performance impact of frivolously dispatching tasks to 
different threads too often. It looks like linear code, right? But no, it is 
not.

> > > + *
> > > + * @param pdu - the causing 9p (T_readdir) client request
> > > + * @param fidp - already opened directory where readdir shall be
> > > performed
> > > on + * @param entries - output for directory entries (must not be NULL)
> > > + *
> > > @param maxsize - maximum result message body size (in bytes)
> > > + * @param dostat - whether a stat() should be performed and returned
> > > for
> > > + *                 each directory entry
> > > + * @returns resulting response message body size (in bytes) on success,
> > > + *          negative error code otherwise
> > > + */
> > > +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState
> > > *fidp,
> 
> Not really a fan of the _lowlat (low fat ?) wording.
> 
> v9fs_co_readdir_many() or anything else that suggests we're going to
> read several entries in a row.

NP

> 
> > > +                                        struct V9fsDirEnt **entries,
> > > +                                        int32_t maxsize, bool dostat)
> 
> s/maxsize/max_count since this is the wording use in the caller.

Wouldn't do that. This is the driver side, not the 9p protocol request 
handler. As you might have relized before, "max_count" is semantically 
completely wrong. This variable does not count integral entries, it is rather 
a maximum size (in bytes) of the destination buffer being filled.

On 9p request handler side it is ok to use "max_count" instead, because the 
protocol definition uses exactly this term for the request parameter, but on 
driver side it's IMO inappropriate.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-10 15:10       ` Christian Schoenebeck
@ 2020-03-10 18:33         ` Greg Kurz
  2020-03-11  1:18           ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-03-10 18:33 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Tue, 10 Mar 2020 16:10:41 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Montag, 9. März 2020 16:42:59 CET Greg Kurz wrote:
> > The main issue I see with this patch is that it is too big. At
> > least from my POV, with the little time I have for 9p, that's
> > the first thing that pushes me back... So if you could split
> > this patch down to some reviewable chunks, this may help
> > going forward.
> 
> This patch is also too big for my preference, but I don't see a viable way to 
> split it further into separate patches. I already separated all the patches I 
> could. If you have suggestions, very much appreciated!
> 

Well, the patch could be split in two or three parts at least:

(1) introduce the new function that reads multiple entries in codir.c

(2) use it from 9p.c

(3) remove unused stuff if anything remains

This doesn't seem to change much but the smaller diffstats
for each individual patch make them less scary :) and with
(1) applied only it is easier to compare what the old code
in 9p.c and the new one in codir.c do.

> The reason for this is that in order to fix these issues with current 
> T_readdir implementation, it requires to separate what's currently one task 
> (i.e. one function) into two separate tasks (i.e. two functions). There is no 
> sensible way to do that.
> 

Yeah, I won't ask for finer grain.

> But don't be too scared about this patch; if you look just at the diff 
> directly then yes, the diff is not very human readable. However if you apply 
> the patch and look at the resulting code, you'll notice that there are 
> actually only 2 functions (v9fs_do_readdir() in 9p.c and do_readdir_lowlat() 
> in codir.c) which require careful reviewing and that their resulting code is 
> actually very, very straight forward, and not long either.
> 

These are personal opinions. Careful reviewing can take a lot of time :)

> Current code on master is much more tricky and error prone due to the huge 
> amount of potential branches, individual error/cleanup handlers, high amount 
> of thread dispatching and much more. In the patched version all these code 
> complexities and error sources are removed.
> 

Come on :) The 9pfs code has been a can of worms from the beginning.
It produced more than the average amount of security-related bugs,
and most sadly, due to the overall lack of interest, it bitrotted
and missed a lot of cool improvements like an appropriate support of
unlinked files, QOM-ification of fsdev, conversion of proxy fsdev to
vhost-user, hot plug/unplug support, live migration support and
"much more"... The performance aspect of things is a full time job
I never had the opportunity to even consider. So yes, your changes
are likely beneficial but the code base is still extremely fragile
and sensible to huge changes... not breaking things that happen
to work, even in a sub-optimal way, is essentially what I care for
these days.

> > > > diff --git a/hw/9pfs/codir.c b/hw/9pfs/codir.c
> > > > index 73f9a751e1..6ce7dc8cde 100644
> > > > --- a/hw/9pfs/codir.c
> > > > +++ b/hw/9pfs/codir.c
> > > > @@ -18,28 +18,189 @@
> > > > 
> > > >  #include "qemu/main-loop.h"
> > > >  #include "coth.h"
> > > > 
> > > > +/*
> > > > + * This is solely executed on a background IO thread.
> > > > + */
> > > > +static int do_readdir(V9fsPDU *pdu, V9fsFidState *fidp, struct dirent
> > > > **dent) +{
> > > > +    int err = 0;
> > > > +    V9fsState *s = pdu->s;
> > > > +    struct dirent *entry;
> > > > +
> > > > +    errno = 0;
> > > > +    entry = s->ops->readdir(&s->ctx, &fidp->fs);
> > > > +    if (!entry && errno) {
> > > > +        *dent = NULL;
> > > > +        err = -errno;
> > > > +    } else {
> > > > +        *dent = entry;
> > > > +    }
> > > > +    return err;
> > > > +}
> > > > +
> > > > +/*
> > > > + * TODO: This will be removed for performance reasons.
> > > > + * Use v9fs_co_readdir_lowlat() instead.
> > 
> > Oh, so we'd still have the current implementation being used, even
> > with this patch applied... This would be okay for a RFC patch but
> > in the end I'd really like to merge something that also converts
> > v9fs_do_readdir_with_stat().
> 
> Yes, I know, but I would not recommend mixing these things at this point, 
> because it would be an entire effort on its own.
> 
> v9fs_do_readdir_with_stat() is used for 9P2000.u, while v9fs_do_readdir() is 
> used for 9P2000.L. They're behaving very differently, so it would not only 
> require me to update v9fs_do_readdir_with_stat() and v9fs_read(), I would also 
> need to write their own test cases (plural, since there are none at all yet) 
> and benchmarks, and of course somebody what need to review all that additional 
> amount of code, and who would actually test it? I am actively testing my 
> 9P2000.L changes, but I am not actually using 9P2000.u, so I could not 
> guarantee for the same quality.
> 
> Considering the current commitment situation regarding 9pfs, we can only bring 
> things forward with compromises. Hence I suggest I concentrate on fixing the 
> worst performance issues on 9P2000.L side first, and later on if there are 
> really people interested in seeing these improvements on 9P2000.u side and 
> willing to at least help out with testing, then I can definitely also adjust 
> 9P2000.u side accordingly as well.
> 
> But to also make this clear: this patch 10 is not introducing any behaviour 
> change for 9P2000.u side.
> 

Ok, fair enough to leave 9p2000.U behind for now but I had to ask :)
/me is even wondering if we should start deprecating 9p2000.U since
most clients can use 9p2000.L now...

> > > >  int coroutine_fn v9fs_co_readdir(V9fsPDU *pdu, V9fsFidState *fidp,
> > > >  
> > > >                                   struct dirent **dent)
> > > >  
> > > >  {
> > > >  
> > > >      int err;
> > > > 
> > > > -    V9fsState *s = pdu->s;
> > > > 
> > > >      if (v9fs_request_cancelled(pdu)) {
> > > >      
> > > >          return -EINTR;
> > > >      
> > > >      }
> > > > 
> > > > -    v9fs_co_run_in_worker(
> > > > -        {
> > > > -            struct dirent *entry;
> > > > +    v9fs_co_run_in_worker({
> > > > +        err = do_readdir(pdu, fidp, dent);
> > > > +    });
> > > > +    return err;
> > > > +}
> > > > +
> > > > +/*
> > > > + * This is solely executed on a background IO thread.
> > > > + *
> > > > + * See v9fs_co_readdir_lowlat() (as its only user) below for details.
> > 
> > Since the only user is quite small, maybe just open-code this.
> 
> Mmm... maybe later on, as a cleanup patch or something. Current version is 
> tested. I would like to avoid to make things more complicated than they 
> already are (i.e. accidental introduction of some bug just due to this).
> 

It seems we might agree on not breaking things that work ;-)

> > 
> > > > + */
> > > > +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> > > > +                             struct V9fsDirEnt **entries,
> > > > +                             int32_t maxsize, bool dostat)
> > > > +{
> > > > +    V9fsState *s = pdu->s;
> > > > +    V9fsString name;
> > > > +    int len, err = 0;
> > > > +    int32_t size = 0;
> > > > +    off_t saved_dir_pos;
> > > > +    struct dirent *dent;
> > > > +    struct V9fsDirEnt *e = NULL;
> > > > +    V9fsPath path;
> > > > +    struct stat stbuf;
> > > > +
> > > > +    *entries = NULL;
> > > > +    v9fs_path_init(&path);
> > > > +
> > > > +    /*
> > > > +     * TODO: Here should be a warn_report_once() if lock failed.
> > > > +     *
> > > > +     * With a good 9p client we should not get into concurrency here,
> > > > +     * because a good client would not use the same fid for concurrent
> > > > +     * requests. We do the lock here for safety reasons though. However
> > > > +     * the client would then suffer performance issues, so better log
> > > > that
> > > > +     * issue here.
> > > > +     */
> > 
> > This should probably be done with qemu_mutex_trylock() in a loop directly
> > in v9fs_readdir_lock().
> 
> No definitely not in the loop. That's intentionally *one* lock *outside* of 

Not sure we're talking about the same loop here...

I'm just suggesting that v9fs_readdir_lock() could be something like:

static inline void v9fs_readdir_lock(V9fsDir *dir)
{
    while (qemu_mutex_trylock(&dir->readdir_mutex)) {
        warn_report_once("blah");
    }
}

> the loop. The normal case is not requiring a lock at all, like the comment 
> describes. Doing multiple locks inside the loop unnecessaririly reduces 
> performance for no reason.
> 
> About *_trylock() instead of v9fs_readdir_lock(): might be a candidate to 

Hmm... are you suggesting that do_readdir_lowlat() should return if it
can't take the lock ?

> address the TODO comment, but I would like to pospone that for the same 
> reasons: it is an odd/ill case encountering concurrency here. So for all well 
> implemented clients this is not an issue at all, and security (concerning 
> malicious clients) is provided already by this lock. Addressing this TODO 
> already now would potentially unnecessarily introduce bugs due to added 
> complexity, so really I would postpone that.
> 

:)

> > > > +/**
> > > > + * @brief Low latency variant of fs driver readdir handling.
> > > > + *
> > 
> > Not sure it brings much to mention latency here... telling what this
> > function does actually ie. "read multiple directory entries in a row"
> > would be more informative.
> 
> NP
> 
> > > > + * Retrieves the requested (max. amount of) directory entries from the
> > > > fs
> > > > + * driver. This function must only be called by the main IO thread (top
> > > > half). + * Internally this function call will be dispatched to a
> > > > background
> > > > IO thread + * (bottom half) where it is eventually executed by the fs
> > > > driver. + *
> > > > + * The old readdir implementation above just retrieves always one dir
> > > > entry + * per call. The problem of that implementation above is that
> > > > latency is + * added for (retrieval of) each directory entry, which in
> > > > practice lead to + * latencies of several hundred ms for readdir of
> > > > only one directory. + * + * This is avoided in this function by letting
> > > > the fs driver retrieve all + * required directory entries with only
> > > > call of this function and hence with + * only a single fs driver
> > > > request.
> > 
> > True but these kind of considerations rather belong to the changelog...
> > otherwise, we'll have to not forget to kill these lines when the "old
> > readdir implementation" is no more.
> 
> Mmm... I do think this overall latency issue should be clarified somewhere as 

The issue itself is best described in a changelog, but...

> a comment. Where exactly is not that important to me. For instance it could 
> also be described on v9fs_co_run_in_worker() macro definition in coth.h 
> instead, as general rule of thumb when dispatching stuff.

... if you have useful recommendations that don't involve referring to a
piece of code that might go away, a comment in coth.h is certainly a good
idea.

> The thing is: for >10 years several developers obviously did not realize the 
> severe negative performance impact of frivolously dispatching tasks to 

I won't dare to comment on some people I don't know not doing the obvious
right things at the time... what I do know is that for >10 years, nobody
cared for this code. Period. You're lucky it is still functional ;-)

> different threads too often. It looks like linear code, right? But no, it is 
> not.
> 
> > > > + *
> > > > + * @param pdu - the causing 9p (T_readdir) client request
> > > > + * @param fidp - already opened directory where readdir shall be
> > > > performed
> > > > on + * @param entries - output for directory entries (must not be NULL)
> > > > + *
> > > > @param maxsize - maximum result message body size (in bytes)
> > > > + * @param dostat - whether a stat() should be performed and returned
> > > > for
> > > > + *                 each directory entry
> > > > + * @returns resulting response message body size (in bytes) on success,
> > > > + *          negative error code otherwise
> > > > + */
> > > > +int coroutine_fn v9fs_co_readdir_lowlat(V9fsPDU *pdu, V9fsFidState
> > > > *fidp,
> > 
> > Not really a fan of the _lowlat (low fat ?) wording.
> > 
> > v9fs_co_readdir_many() or anything else that suggests we're going to
> > read several entries in a row.
> 
> NP
> 
> > 
> > > > +                                        struct V9fsDirEnt **entries,
> > > > +                                        int32_t maxsize, bool dostat)
> > 
> > s/maxsize/max_count since this is the wording use in the caller.
> 
> Wouldn't do that. This is the driver side, not the 9p protocol request 
> handler. As you might have relized before, "max_count" is semantically 
> completely wrong. This variable does not count integral entries, it is rather 
> a maximum size (in bytes) of the destination buffer being filled.
> 
> On 9p request handler side it is ok to use "max_count" instead, because the 
> protocol definition uses exactly this term for the request parameter, but on 
> driver side it's IMO inappropriate.
> 

I agree the max_count wording sucks and I don't care that much to
name variables according to protocol definitions, but I do care
to use the same name in caller/callee when it makes sense, for a
better grep or cscope experience.

> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-10 18:33         ` Greg Kurz
@ 2020-03-11  1:18           ` Christian Schoenebeck
       [not found]             ` <20200311171408.3b3a2dfa@bahia.home>
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-03-11  1:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Dienstag, 10. März 2020 19:33:36 CET Greg Kurz wrote:
> > This patch is also too big for my preference, but I don't see a viable way
> > to split it further into separate patches. I already separated all the
> > patches I could. If you have suggestions, very much appreciated!
> 
> Well, the patch could be split in two or three parts at least:
> 
> (1) introduce the new function that reads multiple entries in codir.c
> 
> (2) use it from 9p.c
> 
> (3) remove unused stuff if anything remains
> 
> This doesn't seem to change much but the smaller diffstats
> for each individual patch make them less scary :) and with
> (1) applied only it is easier to compare what the old code
> in 9p.c and the new one in codir.c do.
> 
> > The reason for this is that in order to fix these issues with current
> > T_readdir implementation, it requires to separate what's currently one
> > task
> > (i.e. one function) into two separate tasks (i.e. two functions). There is
> > no sensible way to do that.
> 
> Yeah, I won't ask for finer grain.

Me confused. Does that mean your split suggestion was just theoretical, or do 
you need it?

> > But don't be too scared about this patch; if you look just at the diff
> > directly then yes, the diff is not very human readable. However if you
> > apply the patch and look at the resulting code, you'll notice that there
> > are actually only 2 functions (v9fs_do_readdir() in 9p.c and
> > do_readdir_lowlat() in codir.c) which require careful reviewing and that
> > their resulting code is actually very, very straight forward, and not
> > long either.
> 
> These are personal opinions. Careful reviewing can take a lot of time :)

Well, depends on what precisely you mean with opinion. :) That this patch 
actually reduces code complexity is a fact (less branches, less locks, less 
dispatches, less side effects, less error/cleanup handlers). These are 
objectively measurable quantities.

But yes, nevertheless reviewing it costs time.

> > Current code on master is much more tricky and error prone due to the huge
> > amount of potential branches, individual error/cleanup handlers, high
> > amount of thread dispatching and much more. In the patched version all
> > these code complexities and error sources are removed.
> 
> Come on :) The 9pfs code has been a can of worms from the beginning.
> It produced more than the average amount of security-related bugs,
> and most sadly, due to the overall lack of interest, it bitrotted
> and missed a lot of cool improvements like an appropriate support of
> unlinked files, QOM-ification of fsdev, conversion of proxy fsdev to
> vhost-user, hot plug/unplug support, live migration support and
> "much more"... The performance aspect of things is a full time job

No, the performance issues are actually very managable in case of 9pfs.
I already addressed readdir with this patch (by far the worst performance 
issue), and then there would just be one more severe performance issue: 
walkdir.

My intention is not to squeeze out the last fractional percent of performance 
for 9pfs, but you certainly agree that a simple "ls" blocking for more than 1 
second is something that should be fixed, and fortunately the amount of 
changes involved are far less than I originally feared they would.

> I never had the opportunity to even consider. So yes, your changes
> are likely beneficial but the code base is still extremely fragile
> and sensible to huge changes... not breaking things that happen
> to work, even in a sub-optimal way, is essentially what I care for
> these days.

And I think I'm also very careful not breaking anything. I carefully consider 
what to touch and what not. I wrote test cases and I am actively testing my 
changes with real installations and snapshots as well.

I think the cause of disagreements we have are solely our use cases of 9pfs: 
your personal use case does not seem to require any performance considerations 
or multi-user aspects, whereas my use case requires at least some minimum 
performance grade for utilizing 9pfs for server applications.

> > > Oh, so we'd still have the current implementation being used, even
> > > with this patch applied... This would be okay for a RFC patch but
> > > in the end I'd really like to merge something that also converts
> > > v9fs_do_readdir_with_stat().
> > 
> > Yes, I know, but I would not recommend mixing these things at this point,
> > because it would be an entire effort on its own.
> > 
> > v9fs_do_readdir_with_stat() is used for 9P2000.u, while v9fs_do_readdir()
> > is used for 9P2000.L. They're behaving very differently, so it would not
> > only require me to update v9fs_do_readdir_with_stat() and v9fs_read(), I
> > would also need to write their own test cases (plural, since there are
> > none at all yet) and benchmarks, and of course somebody what need to
> > review all that additional amount of code, and who would actually test
> > it? I am actively testing my 9P2000.L changes, but I am not actually
> > using 9P2000.u, so I could not guarantee for the same quality.
> > 
> > Considering the current commitment situation regarding 9pfs, we can only
> > bring things forward with compromises. Hence I suggest I concentrate on
> > fixing the worst performance issues on 9P2000.L side first, and later on
> > if there are really people interested in seeing these improvements on
> > 9P2000.u side and willing to at least help out with testing, then I can
> > definitely also adjust 9P2000.u side accordingly as well.
> > 
> > But to also make this clear: this patch 10 is not introducing any
> > behaviour
> > change for 9P2000.u side.
> 
> Ok, fair enough to leave 9p2000.U behind for now but I had to ask :)
> /me is even wondering if we should start deprecating 9p2000.U since
> most clients can use 9p2000.L now...

I probably wouldn't. On macOS for instance there's only 9p2000.u. In the end 
9p2000.L is Linux specific. By deprecating 9p2000.u that would mean a shift 
towards only supporting Linux guests.

> > Mmm... maybe later on, as a cleanup patch or something. Current version is
> > tested. I would like to avoid to make things more complicated than they
> > already are (i.e. accidental introduction of some bug just due to this).
> 
> It seems we might agree on not breaking things that work ;-)

Of course! All the things I work on 9pfs are just fixes. But that's probably 
where we start to disagree. :)

> > > > > +static int do_readdir_lowlat(V9fsPDU *pdu, V9fsFidState *fidp,
> > > > > +                             struct V9fsDirEnt **entries,
> > > > > +                             int32_t maxsize, bool dostat)
> > > > > +{
> > > > > +    V9fsState *s = pdu->s;
> > > > > +    V9fsString name;
> > > > > +    int len, err = 0;
> > > > > +    int32_t size = 0;
> > > > > +    off_t saved_dir_pos;
> > > > > +    struct dirent *dent;
> > > > > +    struct V9fsDirEnt *e = NULL;
> > > > > +    V9fsPath path;
> > > > > +    struct stat stbuf;
> > > > > +
> > > > > +    *entries = NULL;
> > > > > +    v9fs_path_init(&path);
> > > > > +
> > > > > +    /*
> > > > > +     * TODO: Here should be a warn_report_once() if lock failed.
> > > > > +     *
> > > > > +     * With a good 9p client we should not get into concurrency
> > > > > here,
> > > > > +     * because a good client would not use the same fid for
> > > > > concurrent
> > > > > +     * requests. We do the lock here for safety reasons though.
> > > > > However
> > > > > +     * the client would then suffer performance issues, so better
> > > > > log
> > > > > that
> > > > > +     * issue here.
> > > > > +     */
> > > 
> > > This should probably be done with qemu_mutex_trylock() in a loop
> > > directly
> > > in v9fs_readdir_lock().
> > 
> > No definitely not in the loop. That's intentionally *one* lock *outside*
> > of
> 
> Not sure we're talking about the same loop here...
> 
> I'm just suggesting that v9fs_readdir_lock() could be something like:
> 
> static inline void v9fs_readdir_lock(V9fsDir *dir)
> {
>     while (qemu_mutex_trylock(&dir->readdir_mutex)) {
>         warn_report_once("blah");
>     }
> }

Ah yeah, in fact I got you wrong on this one. I thought you meant to move the 
lock within do_readdir_lowlat(). About your actual suggestion above ...

> > the loop. The normal case is not requiring a lock at all, like the comment
> > describes. Doing multiple locks inside the loop unnecessaririly reduces
> > performance for no reason.
> > 
> > About *_trylock() instead of v9fs_readdir_lock(): might be a candidate to
> 
> Hmm... are you suggesting that do_readdir_lowlat() should return if it
> can't take the lock ?

... yep, that's what I had in mind for it later on. I would not mind running 
trylock in a loop like you suggested; if it fails, log a warning and return.
Because if the lock fails, then client is buggy. User can read the warning in 
the logs and report the bug for client to be fixed. Not worth to handle that 
case in any fancy way by server.

> > > > > + * Retrieves the requested (max. amount of) directory entries from
> > > > > the
> > > > > fs
> > > > > + * driver. This function must only be called by the main IO thread
> > > > > (top
> > > > > half). + * Internally this function call will be dispatched to a
> > > > > background
> > > > > IO thread + * (bottom half) where it is eventually executed by the
> > > > > fs
> > > > > driver. + *
> > > > > + * The old readdir implementation above just retrieves always one
> > > > > dir
> > > > > entry + * per call. The problem of that implementation above is that
> > > > > latency is + * added for (retrieval of) each directory entry, which
> > > > > in
> > > > > practice lead to + * latencies of several hundred ms for readdir of
> > > > > only one directory. + * + * This is avoided in this function by
> > > > > letting
> > > > > the fs driver retrieve all + * required directory entries with only
> > > > > call of this function and hence with + * only a single fs driver
> > > > > request.
> > > 
> > > True but these kind of considerations rather belong to the changelog...
> > > otherwise, we'll have to not forget to kill these lines when the "old
> > > readdir implementation" is no more.
> > 
> > Mmm... I do think this overall latency issue should be clarified somewhere
> > as
> The issue itself is best described in a changelog, but...
> 
> > a comment. Where exactly is not that important to me. For instance it
> > could
> > also be described on v9fs_co_run_in_worker() macro definition in coth.h
> > instead, as general rule of thumb when dispatching stuff.
> 
> ... if you have useful recommendations that don't involve referring to a
> piece of code that might go away, a comment in coth.h is certainly a good
> idea.

Ok, I'll move it to coth.h then.

> > The thing is: for >10 years several developers obviously did not realize
> > the severe negative performance impact of frivolously dispatching tasks
> > to
> I won't dare to comment on some people I don't know not doing the obvious
> right things at the time... what I do know is that for >10 years, nobody
> cared for this code. Period. You're lucky it is still functional ;-)

This was not about judging individuals, just to express that the latency 
impact of dispatching does not seem to be obvious for people, and trying to 
avoid similar mistakes by placing appropriate comment(s).

> > > > > +                                        struct V9fsDirEnt
> > > > > **entries,
> > > > > +                                        int32_t maxsize, bool
> > > > > dostat)
> > > 
> > > s/maxsize/max_count since this is the wording use in the caller.
> > 
> > Wouldn't do that. This is the driver side, not the 9p protocol request
> > handler. As you might have relized before, "max_count" is semantically
> > completely wrong. This variable does not count integral entries, it is
> > rather a maximum size (in bytes) of the destination buffer being filled.
> > 
> > On 9p request handler side it is ok to use "max_count" instead, because
> > the
> > protocol definition uses exactly this term for the request parameter, but
> > on driver side it's IMO inappropriate.
> 
> I agree the max_count wording sucks and I don't care that much to
> name variables according to protocol definitions, but I do care
> to use the same name in caller/callee when it makes sense, for a
> better grep or cscope experience.

So you insist on that max_count change?

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
       [not found]             ` <20200311171408.3b3a2dfa@bahia.home>
@ 2020-03-11 19:54               ` Christian Schoenebeck
  2020-03-17 14:14                 ` Greg Kurz
  0 siblings, 1 reply; 39+ messages in thread
From: Christian Schoenebeck @ 2020-03-11 19:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Mittwoch, 11. März 2020 17:14:08 CET Greg Kurz wrote:
> On Wed, 11 Mar 2020 02:18:04 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Dienstag, 10. März 2020 19:33:36 CET Greg Kurz wrote:
> > > > This patch is also too big for my preference, but I don't see a viable
> > > > way
> > > > to split it further into separate patches. I already separated all the
> > > > patches I could. If you have suggestions, very much appreciated!
> > > 
> > > Well, the patch could be split in two or three parts at least:
> > > 
> > > (1) introduce the new function that reads multiple entries in codir.c
> > > 
> > > (2) use it from 9p.c
> > > 
> > > (3) remove unused stuff if anything remains
> > > 
> > > This doesn't seem to change much but the smaller diffstats
> > > for each individual patch make them less scary :) and with
> > > (1) applied only it is easier to compare what the old code
> > > in 9p.c and the new one in codir.c do.
> > > 
> > > > The reason for this is that in order to fix these issues with current
> > > > T_readdir implementation, it requires to separate what's currently one
> > > > task
> > > > (i.e. one function) into two separate tasks (i.e. two functions).
> > > > There is
> > > > no sensible way to do that.
> > > 
> > > Yeah, I won't ask for finer grain.
> > 
> > Me confused. Does that mean your split suggestion was just theoretical, or
> > do you need it?
> 
> I need it and I won't ask for more splitting. Promised ! :)

Okay then. :)

> > > > Current code on master is much more tricky and error prone due to the
> > > > huge
> > > > amount of potential branches, individual error/cleanup handlers, high
> > > > amount of thread dispatching and much more. In the patched version all
> > > > these code complexities and error sources are removed.
> > > 
> > > Come on :) The 9pfs code has been a can of worms from the beginning.
> > > It produced more than the average amount of security-related bugs,
> > > and most sadly, due to the overall lack of interest, it bitrotted
> > > and missed a lot of cool improvements like an appropriate support of
> > > unlinked files, QOM-ification of fsdev, conversion of proxy fsdev to
> > > vhost-user, hot plug/unplug support, live migration support and
> > > "much more"... The performance aspect of things is a full time job
> > 
> > No, the performance issues are actually very managable in case of 9pfs.
> > I already addressed readdir with this patch (by far the worst performance
> 
> They're very manageable if someone cares and spends time. Thanks again
> for doing this.

Thanks!

> > My intention is not to squeeze out the last fractional percent of
> > performance for 9pfs, but you certainly agree that a simple "ls" blocking
> > for more than 1 second is something that should be fixed, and fortunately
> > the amount of
> I never observed such timeouts with my personal toy use of 9p but
> you did and this motivated you to step in, which is very welcome.

Probably you don't notice it much because of the dirent cache on guest side. 
If guest's dirent cache is cold, and you do a readdir() ("ls") on some 
directory with e.g. several hundred entries, you should notice it.

> > I think the cause of disagreements we have are solely our use cases of
> > 9pfs: your personal use case does not seem to require any performance
> > considerations or multi-user aspects, whereas my use case requires at
> > least some minimum performance grade for utilizing 9pfs for server
> > applications.
> 
> Your point about the personal use case is right indeed but our
> disagreements, if any, aren't uniquely related to that. It's more about
> maintainership and available time really. I'm 100% benevolent "Odd fixer"
> now and I just try to avoid being forced into fixing things after my PR is
> merged. If you want to go faster, then you're welcome to upgrade to
> maintainer and send PRs. This would make sense since you care for 9p, you
> showed a good understanding of the code and you provided beneficial
> contributions so far :)

That maintainership upgrade is planned. The question is just when. What was 
your idea, co-maintainership?

> > > > > Oh, so we'd still have the current implementation being used, even
> > > > > with this patch applied... This would be okay for a RFC patch but
> > > > > in the end I'd really like to merge something that also converts
> > > > > v9fs_do_readdir_with_stat().
> > > > 
> > > > Yes, I know, but I would not recommend mixing these things at this
> > > > point,
> > > > because it would be an entire effort on its own.
> > > > 
> > > > v9fs_do_readdir_with_stat() is used for 9P2000.u, while
> > > > v9fs_do_readdir()
> > > > is used for 9P2000.L. They're behaving very differently, so it would
> > > > not
> > > > only require me to update v9fs_do_readdir_with_stat() and v9fs_read(),
> > > > I
> > > > would also need to write their own test cases (plural, since there are
> > > > none at all yet) and benchmarks, and of course somebody what need to
> > > > review all that additional amount of code, and who would actually test
> > > > it? I am actively testing my 9P2000.L changes, but I am not actually
> > > > using 9P2000.u, so I could not guarantee for the same quality.
> > > > 
> > > > Considering the current commitment situation regarding 9pfs, we can
> > > > only
> > > > bring things forward with compromises. Hence I suggest I concentrate
> > > > on
> > > > fixing the worst performance issues on 9P2000.L side first, and later
> > > > on
> > > > if there are really people interested in seeing these improvements on
> > > > 9P2000.u side and willing to at least help out with testing, then I
> > > > can
> > > > definitely also adjust 9P2000.u side accordingly as well.
> > > > 
> > > > But to also make this clear: this patch 10 is not introducing any
> > > > behaviour
> > > > change for 9P2000.u side.
> > > 
> > > Ok, fair enough to leave 9p2000.U behind for now but I had to ask :)
> > > /me is even wondering if we should start deprecating 9p2000.U since
> > > most clients can use 9p2000.L now...
> > 
> > I probably wouldn't. On macOS for instance there's only 9p2000.u. In the
> > end
> Hmm... the only thing I've heard about is:
> 
> https://github.com/benavento/mac9p
> 
> and unless I'm missing something, this seems to only have a socket based
> transport... the server in QEMU is for virtio-9p and Xen 9p devices only.
> Unless mac9p seriously plans to implement a driver for either of those,
> I'm still not convinced it is worth keeping .U around... and BTW, our
> deprecation policy imposes a 2 QEMU versions cooling period before
> actually removing the code. This would leave ample time for someone
> to speak up.

Well, I leave that up to you. I could consider adding a transport for macOS 
guests, but that's definitely not going to happen in any near future.

> > > Not sure we're talking about the same loop here...
> > > 
> > > I'm just suggesting that v9fs_readdir_lock() could be something like:
> > > 
> > > static inline void v9fs_readdir_lock(V9fsDir *dir)
> > > {
> > > 
> > >     while (qemu_mutex_trylock(&dir->readdir_mutex)) {
> > >     
> > >         warn_report_once("blah");
> > >     
> > >     }
> > > 
> > > }
> > 
> > Ah yeah, in fact I got you wrong on this one. I thought you meant to move
> > the lock within do_readdir_lowlat(). About your actual suggestion above
> > ...> 
> > > > the loop. The normal case is not requiring a lock at all, like the
> > > > comment
> > > > describes. Doing multiple locks inside the loop unnecessaririly
> > > > reduces
> > > > performance for no reason.
> > > > 
> > > > About *_trylock() instead of v9fs_readdir_lock(): might be a candidate
> > > > to
> > > 
> > > Hmm... are you suggesting that do_readdir_lowlat() should return if it
> > > can't take the lock ?
> > 
> > ... yep, that's what I had in mind for it later on. I would not mind
> > running trylock in a loop like you suggested; if it fails, log a warning
> > and return. Because if the lock fails, then client is buggy. User can
> > read the warning in the logs and report the bug for client to be fixed.
> > Not worth to handle that case in any fancy way by server.
> 
> The locking is just a safety net because readdir(3) isn't necessarily
> thread safe. It was never my intention to add an error path for that.
> And thinking again, sending concurrent Treaddir requests on the same
> fid is certainly a weird thing to do but is it really a bug ?

Well, I was unresolved on that issue as well, hence my decision to only leave 
a TODO comment for now. My expectation would be separate fid for concurrent 
readdir, but you are right of course, Treaddir (unlike its 9p2000.u 
counterpart) is reentrant by design, since it has an offset parameter, so from 
protocol perspective concurrent Treaddir is not per se impossible.

The lock would not be noticable either with this patch, since unlike on 
master, the lock is just retained on driver side now (with this patch), which 
returns very fast even on large directories, as you can see from the benchmark 
output.

So yes, returning from function with an error is probably not the best choice. 
My tendency is still though logging a message at least. I don't care about 
that too much though to deal with trylock() in a loop or something right now. 
There are more important things to take care of ATM.

> > avoid similar mistakes by placing appropriate comment(s).
> > 
> > > > > > > +                                        struct V9fsDirEnt
> > > > > > > **entries,
> > > > > > > +                                        int32_t maxsize, bool
> > > > > > > dostat)
> > > > > 
> > > > > s/maxsize/max_count since this is the wording use in the caller.
> > > > 
> > > > Wouldn't do that. This is the driver side, not the 9p protocol request
> > > > handler. As you might have relized before, "max_count" is semantically
> > > > completely wrong. This variable does not count integral entries, it is
> > > > rather a maximum size (in bytes) of the destination buffer being
> > > > filled.
> > > > 
> > > > On 9p request handler side it is ok to use "max_count" instead,
> > > > because
> > > > the
> > > > protocol definition uses exactly this term for the request parameter,
> > > > but
> > > > on driver side it's IMO inappropriate.
> > > 
> > > I agree the max_count wording sucks and I don't care that much to
> > > name variables according to protocol definitions, but I do care
> > > to use the same name in caller/callee when it makes sense, for a
> > > better grep or cscope experience.
> > 
> > So you insist on that max_count change?
> 
> Rather the opposite. Kill max_count and make it maxsize, in a preparatory
> patch with the reasons you've exposed.

Ah got it. OK then.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-11 19:54               ` Christian Schoenebeck
@ 2020-03-17 14:14                 ` Greg Kurz
  2020-03-17 16:09                   ` Christian Schoenebeck
  0 siblings, 1 reply; 39+ messages in thread
From: Greg Kurz @ 2020-03-17 14:14 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Wed, 11 Mar 2020 20:54:58 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 11. März 2020 17:14:08 CET Greg Kurz wrote:
> > On Wed, 11 Mar 2020 02:18:04 +0100
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Dienstag, 10. März 2020 19:33:36 CET Greg Kurz wrote:
> > > > > This patch is also too big for my preference, but I don't see a viable
> > > > > way
> > > > > to split it further into separate patches. I already separated all the
> > > > > patches I could. If you have suggestions, very much appreciated!
> > > > 
> > > > Well, the patch could be split in two or three parts at least:
> > > > 
> > > > (1) introduce the new function that reads multiple entries in codir.c
> > > > 
> > > > (2) use it from 9p.c
> > > > 
> > > > (3) remove unused stuff if anything remains
> > > > 
> > > > This doesn't seem to change much but the smaller diffstats
> > > > for each individual patch make them less scary :) and with
> > > > (1) applied only it is easier to compare what the old code
> > > > in 9p.c and the new one in codir.c do.
> > > > 
> > > > > The reason for this is that in order to fix these issues with current
> > > > > T_readdir implementation, it requires to separate what's currently one
> > > > > task
> > > > > (i.e. one function) into two separate tasks (i.e. two functions).
> > > > > There is
> > > > > no sensible way to do that.
> > > > 
> > > > Yeah, I won't ask for finer grain.
> > > 
> > > Me confused. Does that mean your split suggestion was just theoretical, or
> > > do you need it?
> > 
> > I need it and I won't ask for more splitting. Promised ! :)
> 
> Okay then. :)
> 
> > > > > Current code on master is much more tricky and error prone due to the
> > > > > huge
> > > > > amount of potential branches, individual error/cleanup handlers, high
> > > > > amount of thread dispatching and much more. In the patched version all
> > > > > these code complexities and error sources are removed.
> > > > 
> > > > Come on :) The 9pfs code has been a can of worms from the beginning.
> > > > It produced more than the average amount of security-related bugs,
> > > > and most sadly, due to the overall lack of interest, it bitrotted
> > > > and missed a lot of cool improvements like an appropriate support of
> > > > unlinked files, QOM-ification of fsdev, conversion of proxy fsdev to
> > > > vhost-user, hot plug/unplug support, live migration support and
> > > > "much more"... The performance aspect of things is a full time job
> > > 
> > > No, the performance issues are actually very managable in case of 9pfs.
> > > I already addressed readdir with this patch (by far the worst performance
> > 
> > They're very manageable if someone cares and spends time. Thanks again
> > for doing this.
> 
> Thanks!
> 
> > > My intention is not to squeeze out the last fractional percent of
> > > performance for 9pfs, but you certainly agree that a simple "ls" blocking
> > > for more than 1 second is something that should be fixed, and fortunately
> > > the amount of
> > I never observed such timeouts with my personal toy use of 9p but
> > you did and this motivated you to step in, which is very welcome.
> 
> Probably you don't notice it much because of the dirent cache on guest side. 
> If guest's dirent cache is cold, and you do a readdir() ("ls") on some 
> directory with e.g. several hundred entries, you should notice it.
> 
> > > I think the cause of disagreements we have are solely our use cases of
> > > 9pfs: your personal use case does not seem to require any performance
> > > considerations or multi-user aspects, whereas my use case requires at
> > > least some minimum performance grade for utilizing 9pfs for server
> > > applications.
> > 
> > Your point about the personal use case is right indeed but our
> > disagreements, if any, aren't uniquely related to that. It's more about
> > maintainership and available time really. I'm 100% benevolent "Odd fixer"
> > now and I just try to avoid being forced into fixing things after my PR is
> > merged. If you want to go faster, then you're welcome to upgrade to
> > maintainer and send PRs. This would make sense since you care for 9p, you
> > showed a good understanding of the code and you provided beneficial
> > contributions so far :)
> 
> That maintainership upgrade is planned. The question is just when. What was 
> your idea, co-maintainership?
> 

Basically yes.

> > > > > > Oh, so we'd still have the current implementation being used, even
> > > > > > with this patch applied... This would be okay for a RFC patch but
> > > > > > in the end I'd really like to merge something that also converts
> > > > > > v9fs_do_readdir_with_stat().
> > > > > 
> > > > > Yes, I know, but I would not recommend mixing these things at this
> > > > > point,
> > > > > because it would be an entire effort on its own.
> > > > > 
> > > > > v9fs_do_readdir_with_stat() is used for 9P2000.u, while
> > > > > v9fs_do_readdir()
> > > > > is used for 9P2000.L. They're behaving very differently, so it would
> > > > > not
> > > > > only require me to update v9fs_do_readdir_with_stat() and v9fs_read(),
> > > > > I
> > > > > would also need to write their own test cases (plural, since there are
> > > > > none at all yet) and benchmarks, and of course somebody what need to
> > > > > review all that additional amount of code, and who would actually test
> > > > > it? I am actively testing my 9P2000.L changes, but I am not actually
> > > > > using 9P2000.u, so I could not guarantee for the same quality.
> > > > > 
> > > > > Considering the current commitment situation regarding 9pfs, we can
> > > > > only
> > > > > bring things forward with compromises. Hence I suggest I concentrate
> > > > > on
> > > > > fixing the worst performance issues on 9P2000.L side first, and later
> > > > > on
> > > > > if there are really people interested in seeing these improvements on
> > > > > 9P2000.u side and willing to at least help out with testing, then I
> > > > > can
> > > > > definitely also adjust 9P2000.u side accordingly as well.
> > > > > 
> > > > > But to also make this clear: this patch 10 is not introducing any
> > > > > behaviour
> > > > > change for 9P2000.u side.
> > > > 
> > > > Ok, fair enough to leave 9p2000.U behind for now but I had to ask :)
> > > > /me is even wondering if we should start deprecating 9p2000.U since
> > > > most clients can use 9p2000.L now...
> > > 
> > > I probably wouldn't. On macOS for instance there's only 9p2000.u. In the
> > > end
> > Hmm... the only thing I've heard about is:
> > 
> > https://github.com/benavento/mac9p
> > 
> > and unless I'm missing something, this seems to only have a socket based
> > transport... the server in QEMU is for virtio-9p and Xen 9p devices only.
> > Unless mac9p seriously plans to implement a driver for either of those,
> > I'm still not convinced it is worth keeping .U around... and BTW, our
> > deprecation policy imposes a 2 QEMU versions cooling period before
> > actually removing the code. This would leave ample time for someone
> > to speak up.
> 
> Well, I leave that up to you. I could consider adding a transport for macOS 
> guests, but that's definitely not going to happen in any near future.
> 

The whole idea behind dropping support for .U is really just about
reducing the potential attack surface. But I'm also fine to keep
things as is until the next CVE since I'm confident someone will
help ;-)

> > > > Not sure we're talking about the same loop here...
> > > > 
> > > > I'm just suggesting that v9fs_readdir_lock() could be something like:
> > > > 
> > > > static inline void v9fs_readdir_lock(V9fsDir *dir)
> > > > {
> > > > 
> > > >     while (qemu_mutex_trylock(&dir->readdir_mutex)) {
> > > >     
> > > >         warn_report_once("blah");
> > > >     
> > > >     }
> > > > 
> > > > }
> > > 
> > > Ah yeah, in fact I got you wrong on this one. I thought you meant to move
> > > the lock within do_readdir_lowlat(). About your actual suggestion above
> > > ...> 
> > > > > the loop. The normal case is not requiring a lock at all, like the
> > > > > comment
> > > > > describes. Doing multiple locks inside the loop unnecessaririly
> > > > > reduces
> > > > > performance for no reason.
> > > > > 
> > > > > About *_trylock() instead of v9fs_readdir_lock(): might be a candidate
> > > > > to
> > > > 
> > > > Hmm... are you suggesting that do_readdir_lowlat() should return if it
> > > > can't take the lock ?
> > > 
> > > ... yep, that's what I had in mind for it later on. I would not mind
> > > running trylock in a loop like you suggested; if it fails, log a warning
> > > and return. Because if the lock fails, then client is buggy. User can
> > > read the warning in the logs and report the bug for client to be fixed.
> > > Not worth to handle that case in any fancy way by server.
> > 
> > The locking is just a safety net because readdir(3) isn't necessarily
> > thread safe. It was never my intention to add an error path for that.
> > And thinking again, sending concurrent Treaddir requests on the same
> > fid is certainly a weird thing to do but is it really a bug ?
> 
> Well, I was unresolved on that issue as well, hence my decision to only leave 
> a TODO comment for now. My expectation would be separate fid for concurrent 
> readdir, but you are right of course, Treaddir (unlike its 9p2000.u 
> counterpart) is reentrant by design, since it has an offset parameter, so from 
> protocol perspective concurrent Treaddir is not per se impossible.
> 
> The lock would not be noticable either with this patch, since unlike on 
> master, the lock is just retained on driver side now (with this patch), which 
> returns very fast even on large directories, as you can see from the benchmark 
> output.
> 
> So yes, returning from function with an error is probably not the best choice. 
> My tendency is still though logging a message at least. I don't care about 
> that too much though to deal with trylock() in a loop or something right now. 
> There are more important things to take care of ATM.
> 

Yeah, that can wait.

> > > avoid similar mistakes by placing appropriate comment(s).
> > > 
> > > > > > > > +                                        struct V9fsDirEnt
> > > > > > > > **entries,
> > > > > > > > +                                        int32_t maxsize, bool
> > > > > > > > dostat)
> > > > > > 
> > > > > > s/maxsize/max_count since this is the wording use in the caller.
> > > > > 
> > > > > Wouldn't do that. This is the driver side, not the 9p protocol request
> > > > > handler. As you might have relized before, "max_count" is semantically
> > > > > completely wrong. This variable does not count integral entries, it is
> > > > > rather a maximum size (in bytes) of the destination buffer being
> > > > > filled.
> > > > > 
> > > > > On 9p request handler side it is ok to use "max_count" instead,
> > > > > because
> > > > > the
> > > > > protocol definition uses exactly this term for the request parameter,
> > > > > but
> > > > > on driver side it's IMO inappropriate.
> > > > 
> > > > I agree the max_count wording sucks and I don't care that much to
> > > > name variables according to protocol definitions, but I do care
> > > > to use the same name in caller/callee when it makes sense, for a
> > > > better grep or cscope experience.
> > > 
> > > So you insist on that max_count change?
> > 
> > Rather the opposite. Kill max_count and make it maxsize, in a preparatory
> > patch with the reasons you've exposed.
> 
> Ah got it. OK then.
> 
> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/11] 9pfs: T_readdir latency optimization
  2020-03-17 14:14                 ` Greg Kurz
@ 2020-03-17 16:09                   ` Christian Schoenebeck
  0 siblings, 0 replies; 39+ messages in thread
From: Christian Schoenebeck @ 2020-03-17 16:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

On Dienstag, 17. März 2020 15:14:23 CET Greg Kurz wrote:
> > > > I think the cause of disagreements we have are solely our use cases of
> > > > 9pfs: your personal use case does not seem to require any performance
> > > > considerations or multi-user aspects, whereas my use case requires at
> > > > least some minimum performance grade for utilizing 9pfs for server
> > > > applications.
> > > 
> > > Your point about the personal use case is right indeed but our
> > > disagreements, if any, aren't uniquely related to that. It's more about
> > > maintainership and available time really. I'm 100% benevolent "Odd
> > > fixer"
> > > now and I just try to avoid being forced into fixing things after my PR
> > > is
> > > merged. If you want to go faster, then you're welcome to upgrade to
> > > maintainer and send PRs. This would make sense since you care for 9p,
> > > you
> > > showed a good understanding of the code and you provided beneficial
> > > contributions so far :)
> > 
> > That maintainership upgrade is planned. The question is just when. What
> > was
> > your idea, co-maintainership?
> 
> Basically yes.

Ok, I'll send out a MAINTAINERS patch tomorrow or so to make that
co-maintainer upgrade official. But I obviously still need a while to learn 
the bureaucracy involved for PRs, signing, ML tools, Wiki, etc.

> > > > > Ok, fair enough to leave 9p2000.U behind for now but I had to ask :)
> > > > > /me is even wondering if we should start deprecating 9p2000.U since
> > > > > most clients can use 9p2000.L now...
> > > > 
> > > > I probably wouldn't. On macOS for instance there's only 9p2000.u. In
> > > > the
> > > > end
> > > 
> > > Hmm... the only thing I've heard about is:
> > > 
> > > https://github.com/benavento/mac9p
> > > 
> > > and unless I'm missing something, this seems to only have a socket based
> > > transport... the server in QEMU is for virtio-9p and Xen 9p devices
> > > only.
> > > Unless mac9p seriously plans to implement a driver for either of those,
> > > I'm still not convinced it is worth keeping .U around... and BTW, our
> > > deprecation policy imposes a 2 QEMU versions cooling period before
> > > actually removing the code. This would leave ample time for someone
> > > to speak up.
> > 
> > Well, I leave that up to you. I could consider adding a transport for
> > macOS
> > guests, but that's definitely not going to happen in any near future.
> 
> The whole idea behind dropping support for .U is really just about
> reducing the potential attack surface. But I'm also fine to keep
> things as is until the next CVE since I'm confident someone will
> help ;-)

Sure, sounds like a plan!

> > > > > > the loop. The normal case is not requiring a lock at all, like the
> > > > > > comment
> > > > > > describes. Doing multiple locks inside the loop unnecessaririly
> > > > > > reduces
> > > > > > performance for no reason.
> > > > > > 
> > > > > > About *_trylock() instead of v9fs_readdir_lock(): might be a
> > > > > > candidate
> > > > > > to
> > > > > 
> > > > > Hmm... are you suggesting that do_readdir_lowlat() should return if
> > > > > it
> > > > > can't take the lock ?
> > > > 
> > > > ... yep, that's what I had in mind for it later on. I would not mind
> > > > running trylock in a loop like you suggested; if it fails, log a
> > > > warning
> > > > and return. Because if the lock fails, then client is buggy. User can
> > > > read the warning in the logs and report the bug for client to be
> > > > fixed.
> > > > Not worth to handle that case in any fancy way by server.
> > > 
> > > The locking is just a safety net because readdir(3) isn't necessarily
> > > thread safe. It was never my intention to add an error path for that.
> > > And thinking again, sending concurrent Treaddir requests on the same
> > > fid is certainly a weird thing to do but is it really a bug ?
> > 
> > Well, I was unresolved on that issue as well, hence my decision to only
> > leave a TODO comment for now. My expectation would be separate fid for
> > concurrent readdir, but you are right of course, Treaddir (unlike its
> > 9p2000.u counterpart) is reentrant by design, since it has an offset
> > parameter, so from protocol perspective concurrent Treaddir is not per se
> > impossible.
> > 
> > The lock would not be noticable either with this patch, since unlike on
> > master, the lock is just retained on driver side now (with this patch),
> > which returns very fast even on large directories, as you can see from
> > the benchmark output.
> > 
> > So yes, returning from function with an error is probably not the best
> > choice. My tendency is still though logging a message at least. I don't
> > care about that too much though to deal with trylock() in a loop or
> > something right now. There are more important things to take care of ATM.
> 
> Yeah, that can wait.

Okay.

The only thing I actually still need to figure out for this patch series to be 
complete (at least from my side), is how to fix the test environment bug 
discussed. Probably I can reset the test environment's buffer after every test 
somehow ...

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-03-17 16:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21  0:36 [PATCH v4 00/11] 9pfs: readdir optimization Christian Schoenebeck
2020-01-20 22:29 ` [PATCH v4 01/11] tests/virtio-9p: add terminating null in v9fs_string_read() Christian Schoenebeck
2020-01-20 22:47 ` [PATCH v4 02/11] 9pfs: require msize >= 4096 Christian Schoenebeck
2020-01-20 23:50 ` [PATCH v4 03/11] 9pfs: validate count sent by client with T_readdir Christian Schoenebeck
2020-01-22 14:11   ` Greg Kurz
2020-01-22 14:26     ` Christian Schoenebeck
2020-01-21  0:01 ` [PATCH v4 04/11] hw/9pfs/9p-synth: added directory for readdir test Christian Schoenebeck
2020-01-21  0:12 ` [PATCH v4 05/11] tests/virtio-9p: added " Christian Schoenebeck
2020-01-22 19:56   ` Greg Kurz
2020-01-21  0:16 ` [PATCH v4 06/11] tests/virtio-9p: added splitted " Christian Schoenebeck
2020-01-22 21:14   ` Eric Blake
2020-01-22 21:29     ` Christian Schoenebeck
2020-01-23  6:59       ` Greg Kurz
2020-01-22 21:19   ` Greg Kurz
2020-01-22 22:36     ` Christian Schoenebeck
2020-01-23 10:30       ` Greg Kurz
2020-01-23 13:07         ` Christian Schoenebeck
2020-01-21  0:17 ` [PATCH v4 07/11] tests/virtio-9p: failing " Christian Schoenebeck
2020-01-22 22:59   ` Greg Kurz
2020-01-23 11:36     ` Christian Schoenebeck
2020-01-23 12:08       ` Greg Kurz
2020-01-21  0:23 ` [PATCH v4 08/11] 9pfs: readdir benchmark Christian Schoenebeck
2020-01-23 10:34   ` Greg Kurz
2020-01-23 13:20     ` Christian Schoenebeck
2020-01-21  0:26 ` [PATCH v4 09/11] hw/9pfs/9p-synth: avoid n-square issue in synth_readdir() Christian Schoenebeck
2020-01-23 11:13   ` Greg Kurz
2020-01-23 12:40     ` Christian Schoenebeck
2020-01-21  0:30 ` [PATCH v4 10/11] 9pfs: T_readdir latency optimization Christian Schoenebeck
2020-01-23 11:33   ` Greg Kurz
2020-01-23 12:57     ` Christian Schoenebeck
2020-03-09 14:09   ` Christian Schoenebeck
2020-03-09 15:42     ` Greg Kurz
2020-03-10 15:10       ` Christian Schoenebeck
2020-03-10 18:33         ` Greg Kurz
2020-03-11  1:18           ` Christian Schoenebeck
     [not found]             ` <20200311171408.3b3a2dfa@bahia.home>
2020-03-11 19:54               ` Christian Schoenebeck
2020-03-17 14:14                 ` Greg Kurz
2020-03-17 16:09                   ` Christian Schoenebeck
2020-01-21  0:32 ` [PATCH v4 11/11] hw/9pfs/9p.c: benchmark time on T_readdir request Christian Schoenebeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.