From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89850C432C1 for ; Mon, 23 Sep 2019 21:51:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 69F3720820 for ; Mon, 23 Sep 2019 21:51:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2503184AbfIWVvs (ORCPT ); Mon, 23 Sep 2019 17:51:48 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:40377 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387520AbfIWVvs (ORCPT ); Mon, 23 Sep 2019 17:51:48 -0400 Received: by mail-pg1-f194.google.com with SMTP id w10so8770384pgj.7; Mon, 23 Sep 2019 14:51:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=ea09ZjN1fmlsXEBEkl3FdseadDMq9uoMpI8GdwxgyT4=; b=ACL1aF/ft7iNT//Hnpythx1/AU9GQC68/45MbNNnk8CNSuqrNNUqReP88tiFZ14MLm Mo1v+Wy0I8avxBgo4RtcLobw9yeD7OWKJcYKcXUuDi5cBHs0a/dqgyVrAs5dmL3h8FP6 SxvC5DdbRcXCBQkqZ6tACtxiNVhNuY4ydl8R0+wr3A8JSQSznvsG3ja1nx6EE5F5zPrA 7hoA0EmgfU6WI1ArmRcTC2dwT0GAFDdG8XaoqZ7XFTfke2HiFtTaJN9r7dXvc1EiKZMa 1PNfcR48VbJNhy37Cnrgppwkkm+HSKkn4AiYh4GVmC8hqIzxnOOBSzRvSRaDSG8kyGcN AmJg== X-Gm-Message-State: APjAAAXn17SbcFVUSekkL1VoMbP7oWuYSkfAHPDSRqmltaX1xBCbS4Xb QREmLwT/Az2WuRElEkP2g4A= X-Google-Smtp-Source: APXvYqzK16cyhfKcUzdsINdWdVmiv6Yj4AXrMEHluUymgurUNS8oLf1PA2QtDdhZM1hu9+t8opwtcw== X-Received: by 2002:aa7:955d:: with SMTP id w29mr1949355pfq.60.1569275505820; Mon, 23 Sep 2019 14:51:45 -0700 (PDT) Received: from desktop-bart.svl.corp.google.com ([2620:15c:2cd:202:4308:52a3:24b6:2c60]) by smtp.gmail.com with ESMTPSA id 196sm18690773pfz.99.2019.09.23.14.51.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 23 Sep 2019 14:51:44 -0700 (PDT) Subject: Re: [PATCH v4 06/25] ibtrs: client: main functionality To: Jack Wang , linux-block@vger.kernel.org, linux-rdma@vger.kernel.org Cc: axboe@kernel.dk, hch@infradead.org, sagi@grimberg.me, jgg@mellanox.com, dledford@redhat.com, danil.kipnis@cloud.ionos.com, rpenyaev@suse.de, Jack Wang References: <20190620150337.7847-1-jinpuwang@gmail.com> <20190620150337.7847-7-jinpuwang@gmail.com> From: Bart Van Assche Message-ID: Date: Mon, 23 Sep 2019 14:51:43 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190620150337.7847-7-jinpuwang@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On 6/20/19 8:03 AM, Jack Wang wrote: > +static const struct ibtrs_ib_dev_pool_ops dev_pool_ops; > +static struct ibtrs_ib_dev_pool dev_pool = { > + .ops = &dev_pool_ops > +}; Can the definitions in this file be reordered such that the forward declaration of dev_pool_ops can be removed? > +static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con); > +static int ibtrs_clt_rdma_cm_handler(struct rdma_cm_id *cm_id, > + struct rdma_cm_event *ev); > +static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc); > +static void complete_rdma_req(struct ibtrs_clt_io_req *req, int errno, > + bool notify, bool can_wait); > +static int ibtrs_clt_write_req(struct ibtrs_clt_io_req *req); > +static int ibtrs_clt_read_req(struct ibtrs_clt_io_req *req); Please also remove these forward declarations. > +bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess) > +{ > + return sess->state == IBTRS_CLT_CONNECTED; > +} Is it really useful to introduce a one line function for testing the session state? > +static inline struct ibtrs_tag * > +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) > +{ > + size_t max_depth = clt->queue_depth; > + struct ibtrs_tag *tag; > + int cpu, bit; > + > + cpu = get_cpu(); > + do { > + bit = find_first_zero_bit(clt->tags_map, max_depth); > + if (unlikely(bit >= max_depth)) { > + put_cpu(); > + return NULL; > + } > + > + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); > + put_cpu(); > + > + tag = GET_TAG(clt, bit); > + WARN_ON(tag->mem_id != bit); > + tag->cpu_id = cpu; > + tag->con_type = con_type; > + > + return tag; > +} What is the role of the get_cpu() and put_cpu() calls in this function? How can it make sense to assign the cpu number to tag->cpu_id after put_cpu() has been called? > +static inline void ibtrs_clt_init_req(struct ibtrs_clt_io_req *req, > + struct ibtrs_clt_sess *sess, > + ibtrs_conf_fn *conf, > + struct ibtrs_tag *tag, void *priv, > + const struct kvec *vec, size_t usr_len, > + struct scatterlist *sg, size_t sg_cnt, > + size_t data_len, int dir) > +{ > + struct iov_iter iter; > + size_t len; > + > + req->tag = tag; > + req->in_use = true; > + req->usr_len = usr_len; > + req->data_len = data_len; > + req->sglist = sg; > + req->sg_cnt = sg_cnt; > + req->priv = priv; > + req->dir = dir; > + req->con = ibtrs_tag_to_clt_con(sess, tag); > + req->conf = conf; > + req->need_inv = false; > + req->need_inv_comp = false; > + req->inv_errno = 0; > + > + iov_iter_kvec(&iter, READ, vec, 1, usr_len); > + len = _copy_from_iter(req->iu->buf, usr_len, &iter); > + WARN_ON(len != usr_len); > + > + reinit_completion(&req->inv_comp); > + if (sess->stats.enable_rdma_lat) > + req->start_jiffies = jiffies; > +} A comment that explains what "req" stands for would be welcome. Since this function copies the entire payload, I assume that it is only used for control messages and not for reading or writing data from a block device? > +static int ibtrs_clt_failover_req(struct ibtrs_clt *clt, > + struct ibtrs_clt_io_req *fail_req) > +{ > + struct ibtrs_clt_sess *alive_sess; > + struct ibtrs_clt_io_req *req; > + int err = -ECONNABORTED; > + struct path_it it; > + > + do_each_path(alive_sess, clt, &it) { > + if (unlikely(alive_sess->state != IBTRS_CLT_CONNECTED)) > + continue; > + req = ibtrs_clt_get_copy_req(alive_sess, fail_req); > + if (req->dir == DMA_TO_DEVICE) > + err = ibtrs_clt_write_req(req); > + else > + err = ibtrs_clt_read_req(req); > + if (unlikely(err)) { > + req->in_use = false; > + continue; > + } > + /* Success path */ > + ibtrs_clt_inc_failover_cnt(&alive_sess->stats); > + break; > + } while_each_path(&it); > + > + return err; > +} Also for this function, a comment that explains the purpose of this function would be welcome. > +static void fail_all_outstanding_reqs(struct ibtrs_clt_sess *sess) > +{ > + struct ibtrs_clt *clt = sess->clt; > + struct ibtrs_clt_io_req *req; > + int i, err; > + > + if (!sess->reqs) > + return; > + for (i = 0; i < sess->queue_depth; ++i) { > + req = &sess->reqs[i]; > + if (!req->in_use) > + continue; > + > + /* > + * Safely (without notification) complete failed request. > + * After completion this request is still usebale and can > + * be failovered to another path. > + */ > + complete_rdma_req(req, -ECONNABORTED, false, true); > + > + err = ibtrs_clt_failover_req(clt, req); > + if (unlikely(err)) > + /* Failover failed, notify anyway */ > + req->conf(req->priv, err); > + } > +} What guarantees that this function does not call complete_rdma_req() while complete_rdma_req() is called from the regular completion path? > +static bool __ibtrs_clt_change_state(struct ibtrs_clt_sess *sess, > + enum ibtrs_clt_state new_state) > +{ > + enum ibtrs_clt_state old_state; > + bool changed = false; > + > + old_state = sess->state; > + switch (new_state) { Please use lockdep_assert_held() inside this function to verify at runtime that session state changes are serialized properly. > +static enum ibtrs_clt_state ibtrs_clt_state(struct ibtrs_clt_sess *sess) > +{ > + enum ibtrs_clt_state state; > + > + spin_lock_irq(&sess->state_wq.lock); > + state = sess->state; > + spin_unlock_irq(&sess->state_wq.lock); > + > + return state; > +} Please remove this function and read sess->state without holding state_wq.lock. > +static void ibtrs_clt_hb_err_handler(struct ibtrs_con *c, int err) > +{ > + struct ibtrs_clt_con *con; > + > + (void)err; > + con = container_of(c, typeof(*con), c); > + ibtrs_rdma_error_recovery(con); > +} Can "(void)err" be left out? Can the declaration and assignment of 'con' be merged into a single line of code? > +static int create_con(struct ibtrs_clt_sess *sess, unsigned int cid) > +{ > + struct ibtrs_clt_con *con; > + > + con = kzalloc(sizeof(*con), GFP_KERNEL); > + if (unlikely(!con)) > + return -ENOMEM; > + > + /* Map first two connections to the first CPU */ > + con->cpu = (cid ? cid - 1 : 0) % nr_cpu_ids; > + con->c.cid = cid; > + con->c.sess = &sess->s; > + atomic_set(&con->io_cnt, 0); > + > + sess->s.con[cid] = &con->c; > + > + return 0; > +} The code to map a connection ID to onto a CPU occurs multiple times. Has it been considered to introduce a function for that mapping? Although one-line inline functions are not recommended in general, such a function will also make it easier to experiment with other mapping approaches, e.g. mapping hypertread siblings onto the same connection ID. > +static inline bool xchg_sessions(struct ibtrs_clt_sess __rcu **rcu_ppcpu_path, > + struct ibtrs_clt_sess *sess, > + struct ibtrs_clt_sess *next) > +{ > + struct ibtrs_clt_sess **ppcpu_path; > + > + /* Call cmpxchg() without sparse warnings */ > + ppcpu_path = (typeof(ppcpu_path))rcu_ppcpu_path; > + return (sess == cmpxchg(ppcpu_path, sess, next)); > +} This looks suspicious. Has it been considered to protect changes of rcu_ppcpu_path with a mutex and to protect reads with an RCU read lock? > +static void ibtrs_clt_add_path_to_arr(struct ibtrs_clt_sess *sess, > + struct ibtrs_addr *addr) > +{ > + struct ibtrs_clt *clt = sess->clt; > + > + mutex_lock(&clt->paths_mutex); > + clt->paths_num++; > + > + /* > + * Firstly increase paths_num, wait for GP and then > + * add path to the list. Why? Since we add path with > + * !CONNECTED state explanation is similar to what has > + * been written in ibtrs_clt_remove_path_from_arr(). > + */ > + synchronize_rcu(); > + > + list_add_tail_rcu(&sess->s.entry, &clt->paths_list); > + mutex_unlock(&clt->paths_mutex); > +} synchronize_rcu() while a mutex is being held? Really? > +static void ibtrs_clt_close_work(struct work_struct *work) > +{ > + struct ibtrs_clt_sess *sess; > + > + sess = container_of(work, struct ibtrs_clt_sess, close_work); > + > + cancel_delayed_work_sync(&sess->reconnect_dwork); > + ibtrs_clt_stop_and_destroy_conns(sess); > + /* > + * Sounds stupid, huh? No, it is not. Consider this sequence: > + * > + * #CPU0 #CPU1 > + * 1. CONNECTED->RECONNECTING > + * 2. RECONNECTING->CLOSING > + * 3. queue_work(&reconnect_dwork) > + * 4. queue_work(&close_work); > + * 5. reconnect_work(); close_work(); > + * > + * To avoid that case do cancel twice: before and after. > + */ > + cancel_delayed_work_sync(&sess->reconnect_dwork); > + ibtrs_clt_change_state(sess, IBTRS_CLT_CLOSED); > +} The above code looks suspicious to me. I think there should be an additional state change at the start of this function to prevent that reconnect_dwork gets requeued after having been canceled. > +static void ibtrs_clt_dev_release(struct device *dev) > +{ > + /* Nobody plays with device references, so nop */ > +} That comment sounds wrong. Have you reviewed all of the device driver core code and checked that there is no code in there that manipulates struct device refcounts? I think the code that frees struct ibtrs_clt should be moved from free_clt() into the above function. Bart.