linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roman Pen <roman.penyaev@profitbricks.com>
To: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>,
	Sagi Grimberg <sagi@grimberg.me>,
	Bart Van Assche <bart.vanassche@sandisk.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Doug Ledford <dledford@redhat.com>,
	Swapnil Ingle <swapnil.ingle@profitbricks.com>,
	Danil Kipnis <danil.kipnis@profitbricks.com>,
	Jack Wang <jinpu.wang@profitbricks.com>,
	Roman Pen <roman.penyaev@profitbricks.com>
Subject: [PATCH v2 15/26] ibtrs: a bit of documentation
Date: Fri, 18 May 2018 15:04:02 +0200	[thread overview]
Message-ID: <20180518130413.16997-16-roman.penyaev@profitbricks.com> (raw)
In-Reply-To: <20180518130413.16997-1-roman.penyaev@profitbricks.com>

README with description of major sysfs entries.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/infiniband/ulp/ibtrs/README | 358 ++++++++++++++++++++++++++++++++++++
 1 file changed, 358 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/README

diff --git a/drivers/infiniband/ulp/ibtrs/README b/drivers/infiniband/ulp/ibtrs/README
new file mode 100644
index 000000000000..010a93b02d9c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/README
@@ -0,0 +1,358 @@
+****************************
+InfiniBand Transport (IBTRS)
+****************************
+
+IBTRS (InfiniBand Transport) is a reliable high speed transport library
+which provides support to establish optimal number of connections
+between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
+transport. It is optimized to transfer (read/write) IO blocks.
+
+In its core interface it follows the BIO semantics of providing the
+possibility to either write data from an sg list to the remote side
+or to request ("read") data transfer from the remote side into a given
+sg list.
+
+IBTRS provides I/O fail-over and load-balancing capabilities by using
+multipath I/O (see "add_path" and "mp_policy" configuration entries).
+
+IBTRS is used by the IBNBD (Infiniband Network Block Device) modules.
+
+======================
+Client Sysfs Interface
+======================
+
+This chapter describes only the most important files of sysfs interface
+on client side.
+
+Entries under /sys/devices/virtual/ibtrs-client/
+================================================
+
+When a user of IBTRS API creates a new session, a directory entry with
+the name of that session is created.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/
+===============================================================
+
+add_path (RW)
+-------------
+
+Adds a new path (connection) to an existing session. Expected format is the
+following:
+
+  <[source addr,]destination addr>
+
+  *addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]
+
+max_reconnect_attempts (RW)
+---------------------------
+
+Maximum number reconnect attempts the client should make before giving up
+after connection breaks unexpectedly.
+
+mp_policy (RW)
+--------------
+
+Multipath policy specifies which path should be selected on each IO:
+
+   round-robin (0):
+       select path in per CPU round-robin manner.
+
+   min-inflight (1):
+       select path with minimum inflights.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/
+=====================================================================
+
+
+Each path belonging to a given session is listed here by its destination
+address. When a new path is added to a session by writing to the "add_path"
+entry, a directory with the corresponding destination address is created.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/<dest-addr>/
+=================================================================================
+
+state (R)
+---------
+
+Contains "connected" if the session is connected to the peer and fully
+functional.  Otherwise the file contains "disconnected"
+
+reconnect (RW)
+--------------
+
+Write "1" to the file in order to reconnect the path.
+Operation is blocking and returns 0 if reconnect was successful.
+
+disconnect (RW)
+---------------
+
+Write "1" to the file in order to disconnect the path.
+Operation blocks until IBTRS path is disconnected.
+
+remove_path (RW)
+----------------
+
+Write "1" to the file in order to disconnected and remove the path
+from the session.  Operation blocks until the path is disconnected
+and removed from the session.
+
+Entries under /sys/devices/virtual/ibtrs-client/<session-name>/paths/<dest-addr>/stats/
+=======================================================================================
+
+Write "0" to any file in that directory to reset corresponding statistics.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the statistics.
+
+sg_entries (RW)
+---------------
+
+Data to be transferred via RDMA is passed to IBTRS as scatter-gather
+list. A scatter-gather list can contain multiple entries.
+Scatter-gather list with less entries require less processing power
+and can therefore transferred faster. The file sg_entries outputs a
+per-CPU distribution table for the number of entries in the
+scatter-gather lists, that were passed to the IBTRS API function
+ibtrs_clt_request (READ or WRITE).
+
+cpu_migration (RW)
+------------------
+
+IBTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
+not the case, the processing of an I/O response could be processed on a
+different CPU than where it was originally submitted.  This file shows
+how many interrupts where generated on a non expected CPU.
+"from:" is the CPU on which the IRQ was expected, but not generated.
+"to:" is the CPU on which the IRQ was generated, but not expected.
+
+reconnects (RW)
+---------------
+
+Contains 2 unsigned int values, the first one records number of successful
+reconnects in the path lifetime, the second one records number of failed
+reconnects in the path lifetime.
+
+rdma_lat (RW)
+-------------
+
+Latency distribution of IBTRS requests.
+The format is:
+   1 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   2 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   4 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+   8 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  16 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  ...
+  65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  >= 65536 ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+  maximum ms: <CNT-LAT-READ> <CNT-LAT-WRITE>
+
+wc_completion (RW)
+------------------
+
+Contains 2 unsigned int values, the first one records max number of work
+requests processed in work_completion in session lifetime, the second
+one records average number of work requests processed in work_completion
+in session lifetime.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 6 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> \
+<inflights> <failovered>
+
+======================
+Server Sysfs Interface
+======================
+
+Entries under /sys/devices/virtual/ibtrs-server/
+================================================
+
+When a user of IBTRS API creates a new session on a client side, a
+directory entry with the name of that session is created in here.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/
+=====================================================================
+
+When new path is created by writing to "add_path" entry on client side,
+a directory entry with source address is created on server.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/<source-addr>/
+===================================================================================
+
+disconnect (RW)
+---------------
+
+When "1" is written to the file, the IBTRS session is being disconnected.
+Operations is non-blocking and returns control immediately to the caller.
+
+hca_name (R)
+------------
+
+Contains the the name of HCA the connection established on.
+
+hca_port (R)
+------------
+
+Contains the port number of active port traffic is going through.
+
+Entries under /sys/devices/virtual/ibtrs-server/<session-name>/paths/<source-addr>/stats/
+=========================================================================================
+
+When "0" is written to a file in this directory, the corresponding counters
+will be reset.
+
+reset_all (RW)
+--------------
+
+Read will return usage help, write 0 will clear all the counters about
+stats.
+
+rdma (RW)
+---------
+
+Contains statistics regarding rdma operations and inflight operations.
+The output consists of 5 values:
+
+<read-count> <read-total-size> <write-count> <write-total-size> <inflights>
+
+wc_completion (RW)
+------------------
+
+Contains 3 values, the first one is int, records max number of work
+requests processed in work_completion in session lifetime, the second
+one long int records total number of work requests processed in
+work_completion in session lifetime and the 3rd one long int records
+total number of calls to the cq completion handler. Division of 2nd
+number through 3rd gives the average number of completions processed
+in completion handler.
+
+==================
+Transport protocol
+==================
+
+Overview
+--------
+An established connection between a client and a server is called ibtrs
+session. A session is associated with a set of memory chunks reserved on the
+server side for a given client for rdma transfer. A session
+consists of multiple paths, each representing a separate physical link
+between client and server. Those are used for load balancing and failover.
+Each path consists of as many connections (QPs) as there are cpus on
+the client.
+
+When processing an incoming rdma write or read request ibtrs client uses memory
+chunks reserved for him on the server side. Their number, size and addresses
+need to be exchanged between client and server during the connection
+establishment phase. Apart from the memory related information client needs to
+inform the server about the session name and identify each path and connection
+individually.
+
+On an established session client sends to server write or read messages.
+Server uses immediate field to tell the client which request is being
+acknowledged and for errno. Client uses immediate field to tell the server
+which of the memory chunks has been accessed and at which offset the message
+can be found.
+
+Connection establishment
+------------------------
+
+1. Client starts establishing connections belonging to a path of a session one
+by one via attaching IBTRS_MSG_CON_REQ messages to the rdma_connect requests.
+Those include uuid of the session and uuid of the path to be
+established. They are used by the server to find a persisting session/path or
+to create a new one when necessary. The message also contains the protocol
+version and magic for compatibility, total number of connections per session
+(as many as cpus on the client), the id of the current connection and
+the reconnect counter, which is used to resolve the situations where
+client is trying to reconnect a path, while server is still destroying the old
+one.
+
+2. Server accepts the connection requests one by one and attaches
+IBTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
+protocol version, the messages include error code, queue depth supported by
+the server (number of memory chunks which are going to be allocated for that
+session) and the maximum size of one io.
+
+3. After all connections of a path are established client sends to server the
+IBTRS_MSG_INFO_REQ message, containing the name of the session. This message
+requests the address information from the server.
+
+4. Server replies to the session info request message with IBTRS_MSG_INFO_RSP,
+which contains the addresses and keys of the RDMA buffers allocated for that
+session.
+
+5. Session becomes connected after all paths to be established are connected
+(i.e. steps 1-4 finished for all paths requested for a session)
+
+6. Server and client exchange periodically heartbeat messages (empty rdma
+messages with an immediate field) which are used to detect a crash on remote
+side or network outage in an absence of IO.
+
+7. On any RDMA related error or in the case of a heartbeat timeout, the
+corresponding path is disconnected, all the inflight IO are failed over to a
+healthy path, if any, and the reconnect mechanism is triggered.
+
+CLT                                     SRV
+*for each connection belonging to a path and for each path:
+IBTRS_MSG_CON_REQ  ------------------->
+                   <------------------- IBTRS_MSG_CON_RSP
+...
+*after all connections are established:
+IBTRS_MSG_INFO_REQ ------------------->
+                   <------------------- IBTRS_MSG_INFO_RSP
+*heartbeat is started from both sides:
+                   -------------------> [IBTRS_HB_MSG_IMM]
+[IBTRS_HB_MSG_ACK] <-------------------
+[IBTRS_HB_MSG_IMM] <-------------------
+                   -------------------> [IBTRS_HB_MSG_ACK]
+
+IO path
+-------
+
+* Write *
+
+1. When processing a write request client selects one of the memory chunks
+on the server side and rdma writes there the user data, user header and the
+IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
+contains size of the user header. The client tells the server which chunk has
+been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by
+using the IMM field.
+
+2. When confirming a write request server sends an "empty" rdma message with
+an immediate field. The 32 bit field is used to specify the outstanding
+inflight IO and for the error code.
+
+CLT                                                          SRV
+usr_data + usr_hdr + ibtrs_msg_rdma_write -----------------> [IBTRS_IO_REQ_IMM]
+[IBTRS_IO_RSP_IMM]                        <----------------- (id + errno)
+
+* Read *
+
+1. When processing a read request client selects one of the memory chunks
+on the server side and rdma writes there the user header and the
+IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of
+the user header, flags (specifying if memory invalidation is necessary) and the
+list of addresses along with keys for the data to be read into.
+
+2. When confirming a read request server transfers the requested data first,
+attaches an invalidation message if requested and finally an "empty" rdma
+message with an immediate field. The 32 bit field is used to specify the
+outstanding inflight IO and the error code.
+
+CLT                                           SRV
+usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM]
+[IBTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
+or in case client requested invalidation:
+[IBTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
+
+
+Contact
+-------
+
+Mailing list: "IBNBD/IBTRS Storage Team" <ibnbd@profitbricks.com>
-- 
2.13.1

  parent reply	other threads:[~2018-05-18 13:04 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-18 13:03 [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Roman Pen
2018-05-18 13:03 ` [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Roman Pen
2018-05-18 16:56   ` Linus Torvalds
2018-05-19 20:25     ` Roman Penyaev
2018-05-19 21:04       ` Linus Torvalds
2018-05-19 16:37   ` Paul E. McKenney
2018-05-19 20:20     ` Roman Penyaev
2018-05-19 20:56       ` Linus Torvalds
2018-05-20  0:43       ` Paul E. McKenney
2018-05-21 13:50         ` Roman Penyaev
2018-05-21 15:16           ` Linus Torvalds
2018-05-21 15:33             ` Paul E. McKenney
2018-05-22  9:09               ` Roman Penyaev
2018-05-22 16:36                 ` Paul E. McKenney
2018-05-22 16:38                 ` Linus Torvalds
2018-05-22 17:04                   ` Paul E. McKenney
2018-05-21 15:31           ` Paul E. McKenney
2018-05-22  9:09             ` Roman Penyaev
2018-05-22 17:03               ` Paul E. McKenney
2018-05-18 13:03 ` [PATCH v2 02/26] sysfs: export sysfs_remove_file_self() Roman Pen
2018-05-18 15:08   ` Tejun Heo
2018-05-18 13:03 ` [PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections Roman Pen
2018-05-18 13:03 ` [PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers Roman Pen
2018-05-18 13:03 ` [PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules Roman Pen
2018-05-18 13:03 ` [PATCH v2 06/26] ibtrs: client: private header with client structs and functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 07/26] ibtrs: client: main functionality Roman Pen
2018-05-18 13:03 ` [PATCH v2 08/26] ibtrs: client: statistics functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 09/26] ibtrs: client: sysfs interface functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 10/26] ibtrs: server: private header with server structs and functions Roman Pen
2018-05-18 13:03 ` [PATCH v2 11/26] ibtrs: server: main functionality Roman Pen
2018-05-18 13:03 ` [PATCH v2 12/26] ibtrs: server: statistics functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 13/26] ibtrs: server: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation Roman Pen
2018-05-20 22:14   ` kbuild test robot
2018-05-21  6:36   ` kbuild test robot
2018-05-22  5:05   ` Leon Romanovsky
2018-05-22  9:27     ` Roman Penyaev
2018-05-22 13:18       ` Leon Romanovsky
2018-05-22 16:12         ` Roman Penyaev
2018-05-18 13:04 ` Roman Pen [this message]
2018-05-18 13:04 ` [PATCH v2 16/26] ibnbd: private headers with IBNBD protocol structs and helpers Roman Pen
2018-05-18 13:04 ` [PATCH v2 17/26] ibnbd: client: private header with client structs and functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 18/26] ibnbd: client: main functionality Roman Pen
2018-05-18 13:04 ` [PATCH v2 19/26] ibnbd: client: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 20/26] ibnbd: server: private header with server structs and functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 21/26] ibnbd: server: main functionality Roman Pen
2018-05-18 13:04 ` [PATCH v2 22/26] ibnbd: server: functionality for IO submission to file or block dev Roman Pen
2018-05-18 13:04 ` [PATCH v2 23/26] ibnbd: server: sysfs interface functions Roman Pen
2018-05-18 13:04 ` [PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation Roman Pen
2018-05-20 17:21   ` kbuild test robot
2018-05-20 22:14   ` kbuild test robot
2018-05-21  5:33   ` kbuild test robot
2018-05-18 13:04 ` [PATCH v2 25/26] ibnbd: a bit of documentation Roman Pen
2018-05-18 13:04 ` [PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules Roman Pen
2018-05-22 16:45 ` [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD) Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180518130413.16997-16-roman.penyaev@profitbricks.com \
    --to=roman.penyaev@profitbricks.com \
    --cc=axboe@kernel.dk \
    --cc=bart.vanassche@sandisk.com \
    --cc=danil.kipnis@profitbricks.com \
    --cc=dledford@redhat.com \
    --cc=hch@infradead.org \
    --cc=jinpu.wang@profitbricks.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=sagi@grimberg.me \
    --cc=swapnil.ingle@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).