All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>,
	Sagi Grimberg <sagi@grimberg.me>,
	Bart Van Assche <bvanassche@acm.org>,
	Leon Romanovsky <leon@kernel.org>,
	Doug Ledford <dledford@redhat.com>,
	Roman Penyaev <rpenyaev@suse.de>,
	Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Subject: Re: [PATCH v10 06/26] RDMA/rtrs: client: main functionality
Date: Thu, 12 Mar 2020 14:25:17 -0300	[thread overview]
Message-ID: <20200312172517.GU31668@ziepe.ca> (raw)
In-Reply-To: <CAHg0HuziyOuUZ48Rp5S_-A9osB==UFOTfWH0+35omiqVjogqww@mail.gmail.com>

On Thu, Mar 12, 2020 at 06:10:06PM +0100, Danil Kipnis wrote:
> Hi Jason,
> 
> On Wed, Mar 11, 2020 at 8:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Mar 11, 2020 at 05:12:20PM +0100, Jack Wang wrote:
> > > +static void rtrs_clt_remove_path_from_arr(struct rtrs_clt_sess *sess)
> > > +{
> > > +     struct rtrs_clt *clt = sess->clt;
> > > +     struct rtrs_clt_sess *next;
> > > +     bool wait_for_grace = false;
> > > +     int cpu;
> > > +
> > > +     mutex_lock(&clt->paths_mutex);
> > > +     list_del_rcu(&sess->s.entry);
> > > +
> > > +     /* Make sure everybody observes path removal. */
> > > +     synchronize_rcu();
> > > +
> > > +     /*
> > > +      * At this point nobody sees @sess in the list, but still we have
> > > +      * dangling pointer @pcpu_path which _can_ point to @sess.  Since
> > > +      * nobody can observe @sess in the list, we guarantee that IO path
> > > +      * will not assign @sess to @pcpu_path, i.e. @pcpu_path can be equal
> > > +      * to @sess, but can never again become @sess.
> > > +      */
> > > +
> > > +     /*
> > > +      * Decrement paths number only after grace period, because
> > > +      * caller of do_each_path() must firstly observe list without
> > > +      * path and only then decremented paths number.
> > > +      *
> > > +      * Otherwise there can be the following situation:
> > > +      *    o Two paths exist and IO is coming.
> > > +      *    o One path is removed:
> > > +      *      CPU#0                          CPU#1
> > > +      *      do_each_path():                rtrs_clt_remove_path_from_arr():
> > > +      *          path = get_next_path()
> > > +      *          ^^^                            list_del_rcu(path)
> > > +      *          [!CONNECTED path]              clt->paths_num--
> > > +      *                                              ^^^^^^^^^
> > > +      *          load clt->paths_num                 from 2 to 1
> > > +      *                    ^^^^^^^^^
> > > +      *                    sees 1
> > > +      *
> > > +      *      path is observed as !CONNECTED, but do_each_path() loop
> > > +      *      ends, because expression i < clt->paths_num is false.
> > > +      */
> > > +     clt->paths_num--;
> > > +
> > > +     /*
> > > +      * Get @next connection from current @sess which is going to be
> > > +      * removed.  If @sess is the last element, then @next is NULL.
> > > +      */
> > > +     next = list_next_or_null_rr_rcu(&clt->paths_list, &sess->s.entry,
> > > +                                     typeof(*next), s.entry);
> >
> > calling rcu list iteration without holding rcu_lock is wrong
> This function (add_path) along with the corresponding
> remove_path_from_arr() are the only functions modifying the
> paths_list. In both functions paths_mutex is taken so that they are
> serialized. Since the modification of the paths_list is protected by
> the mutex, the rcu_read_lock is superfluous here.

Then don't use the _rcu functions.

> >
> > > +     /*
> > > +      * @pcpu paths can still point to the path which is going to be
> > > +      * removed, so change the pointer manually.
> > > +      */
> > > +     for_each_possible_cpu(cpu) {
> > > +             struct rtrs_clt_sess __rcu **ppcpu_path;
> > > +
> > > +             ppcpu_path = per_cpu_ptr(clt->pcpu_path, cpu);
> > > +             if (rcu_dereference(*ppcpu_path) != sess)
> >
> > calling rcu_dereference without holding rcu_lock is wrong.
> We only need a READ_ONCE semantic here. ppcpu_path is pointing to the
> last path used for an IO and is used for the round robin multipath
> policy. I guess the call can be changed to rcu_dereference_raw to
> avoid rcu_lockdep warning. The round-robin algorithm has been reviewed
> by Paul E. McKenney, he wrote a litmus test for it:
> https://lkml.org/lkml/2018/5/28/2080.

You can't call rcu expecting functions without holding the rcu lock -
use READ_ONCE/etc if that is what is really going on

> >
> > > +static void rtrs_clt_add_path_to_arr(struct rtrs_clt_sess *sess,
> > > +                                   struct rtrs_addr *addr)
> > > +{
> > > +     struct rtrs_clt *clt = sess->clt;
> > > +
> > > +     mutex_lock(&clt->paths_mutex);
> > > +     clt->paths_num++;
> > > +
> > > +     /*
> > > +      * Firstly increase paths_num, wait for GP and then
> > > +      * add path to the list.  Why?  Since we add path with
> > > +      * !CONNECTED state explanation is similar to what has
> > > +      * been written in rtrs_clt_remove_path_from_arr().
> > > +      */
> > > +     synchronize_rcu();
> >
> > This makes no sense to me. RCU readers cannot observe the element in
> > the list without also observing paths_num++
> Paths_num is only used to make sure a reader doesn't look for a
> CONNECTED path in the list for ever - instead he makes at most
> paths_num attempts. The reader can in fact observe paths_num++ without
> observing new element in the paths_list, but this is OK. When adding a
> new path we first increase the paths_num and them add the element to
> the list to make sure the reader will also iterate over it. When
> removing the path - the logic is opposite: we first remove element
> from the list and only then decrement the paths_num.

I don't understand how this explains why synchronize_rcu would be need
here.

> > > +static void rtrs_clt_close_work(struct work_struct *work)
> > > +{
> > > +     struct rtrs_clt_sess *sess;
> > > +
> > > +     sess = container_of(work, struct rtrs_clt_sess, close_work);
> > > +
> > > +     cancel_delayed_work_sync(&sess->reconnect_dwork);
> > > +     rtrs_clt_stop_and_destroy_conns(sess);
> > > +     /*
> > > +      * Sounds stupid, huh?  No, it is not.  Consider this sequence:
> >
> > It sounds stupid because it is stupid. cancel_work is a giant race if
> > some other action hasn't been taken to block parallel threads from
> > calling queue_work before calling cancel_work.
> Will double check. It might be possible to avoid the second call to
> the cancel_delayed_work_sync().

I would have guessed first call.. Before doing cancel_work something
must have prevented new work from being created.

> > > +     err = rtrs_clt_create_sysfs_root_folders(clt);
> >
> > sysfs creation that is not done as part of device_regsiter races with
> > udev.
> We only use device_register() to create
> /sys/class/rtrs_client/<sessionname> sysfs directory. We then create
> some folders and files inside this directory (i.e. paths/,
> multipath_policy, etc.). Do you mean that the uevent is generated
> before we create those subdirectories? How can the creation of this
> subdirectories and files be integrated into the device_register()
> call?

Yes the uevent..

Limited types of sysfs files can be created with the group scheme.

Others need to manipulate the uevent unfortunately, see how ib device
registration works

> > > +struct rtrs_clt *rtrs_clt_open(struct rtrs_clt_ops *ops,
> > > +                              const char *sessname,
> > > +                              const struct rtrs_addr *paths,
> > > +                              size_t paths_num,
> > > +                              u16 port,
> > > +                              size_t pdu_sz, u8 reconnect_delay_sec,
> > > +                              u16 max_segments,
> > > +                              s16 max_reconnect_attempts)
> > > +{
> > > +     struct rtrs_clt_sess *sess, *tmp;
> > > +     struct rtrs_clt *clt;
> > > +     int err, i;
> > > +
> > > +     clt = alloc_clt(sessname, paths_num, port, pdu_sz, ops->priv,
> > > +                     ops->link_ev,
> > > +                     max_segments, reconnect_delay_sec,
> > > +                     max_reconnect_attempts);
> > > +     if (IS_ERR(clt)) {
> > > +             err = PTR_ERR(clt);
> > > +             goto out;
> > > +     }
> > > +     for (i = 0; i < paths_num; i++) {
> > > +             struct rtrs_clt_sess *sess;
> > > +
> > > +             sess = alloc_sess(clt, &paths[i], nr_cpu_ids,
> > > +                               max_segments);
> > > +             if (IS_ERR(sess)) {
> > > +                     err = PTR_ERR(sess);
> > > +                     goto close_all_sess;
> > > +             }
> > > +             list_add_tail_rcu(&sess->s.entry, &clt->paths_list);
> > > +
> > > +             err = init_sess(sess);
> > > +             if (err)
> > > +                     goto close_all_sess;
> > > +
> > > +             err = rtrs_clt_create_sess_files(sess);
> > > +             if (err)
> > > +                     goto close_all_sess;
> > > +     }
> > > +     err = alloc_permits(clt);
> > > +     if (err)
> > > +             goto close_all_sess;
> > > +     err = rtrs_clt_create_sysfs_root_files(clt);
> > > +     if (err)
> > > +             goto close_all_sess;
> > > +
> > > +     /*
> > > +      * There is a race if someone decides to completely remove just
> > > +      * newly created path using sysfs entry.  To avoid the race we
> > > +      * use simple 'opened' flag, see rtrs_clt_remove_path_from_sysfs().
> > > +      */
> > > +     clt->opened = true;
> >
> > A race solution without locks?
> We wanted to make sure that a path belonging to a session currently
> being established can't be removed from sysfs before the establishment
> is finished.

There are still no locks, so this solution to races is probably racey.

Jason

  reply	other threads:[~2020-03-12 17:25 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-11 16:12 [PATCH v10 00/26] RTRS (former IBTRS) RDMA Transport Library and RNBD (former IBNBD) RDMA Network Block Device Jack Wang
2020-03-11 16:12 ` [PATCH v10 01/26] sysfs: export sysfs_remove_file_self() Jack Wang
2020-03-11 16:12 ` [PATCH v10 02/26] RDMA/rtrs: public interface header to establish RDMA connections Jack Wang
2020-03-11 18:45   ` Jason Gunthorpe
2020-03-12  9:43     ` Jinpu Wang
2020-03-11 16:12 ` [PATCH v10 03/26] RDMA/rtrs: private headers with rtrs protocol structs and helpers Jack Wang
2020-03-11 16:12 ` [PATCH v10 04/26] RDMA/rtrs: core: lib functions shared between client and server modules Jack Wang
2020-03-11 16:12 ` [PATCH v10 05/26] RDMA/rtrs: client: private header with client structs and functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 06/26] RDMA/rtrs: client: main functionality Jack Wang
2020-03-11 19:01   ` Jason Gunthorpe
2020-03-12 17:10     ` Danil Kipnis
2020-03-12 17:25       ` Jason Gunthorpe [this message]
2020-03-13 12:18         ` Danil Kipnis
2020-03-13 12:25           ` Jason Gunthorpe
2020-03-17  6:46             ` Danil Kipnis
2020-03-18 15:04             ` Jinpu Wang
2020-03-11 16:12 ` [PATCH v10 07/26] RDMA/rtrs: client: statistics functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 08/26] RDMA/rtrs: client: sysfs interface functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 09/26] RDMA/rtrs: server: private header with server structs and functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 10/26] RDMA/rtrs: server: main functionality Jack Wang
2020-03-11 16:12 ` [PATCH v10 11/26] RDMA/rtrs: server: statistics functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 12/26] RDMA/rtrs: server: sysfs interface functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 13/26] RDMA/rtrs: include client and server modules into kernel compilation Jack Wang
2020-03-11 19:03   ` Jason Gunthorpe
2020-03-12 10:50     ` Danil Kipnis
2020-03-12 12:03       ` Jason Gunthorpe
2020-03-11 16:12 ` [PATCH v10 14/26] RDMA/rtrs: a bit of documentation Jack Wang
2020-03-11 16:12 ` [PATCH v10 15/26] block: reexport bio_map_kern Jack Wang
2020-03-11 16:12 ` [PATCH v10 16/26] block/rnbd: private headers with rnbd protocol structs and helpers Jack Wang
2020-03-11 16:12 ` [PATCH v10 17/26] block/rnbd: client: private header with client structs and functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 18/26] block/rnbd: client: main functionality Jack Wang
2020-03-11 16:12 ` [PATCH v10 19/26] block/rnbd: client: sysfs interface functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 20/26] block/rnbd: server: private header with server structs and functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 21/26] block/rnbd: server: main functionality Jack Wang
2020-03-11 16:12 ` [PATCH v10 22/26] block/rnbd: server: functionality for IO submission to file or block dev Jack Wang
2020-03-11 16:12 ` [PATCH v10 23/26] block/rnbd: server: sysfs interface functions Jack Wang
2020-03-11 16:12 ` [PATCH v10 24/26] block/rnbd: include client and server modules into kernel compilation Jack Wang
2020-03-11 16:12 ` [PATCH v10 25/26] block/rnbd: a bit of documentation Jack Wang
2020-03-11 16:12 ` [PATCH v10 26/26] MAINTAINERS: Add maintainers for RNBD/RTRS modules Jack Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200312172517.GU31668@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=danil.kipnis@cloud.ionos.com \
    --cc=dledford@redhat.com \
    --cc=hch@infradead.org \
    --cc=jinpu.wang@cloud.ionos.com \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=pankaj.gupta@cloud.ionos.com \
    --cc=rpenyaev@suse.de \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.