From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752512AbcGMHzG (ORCPT <rfc822;w@1wt.eu>);
	Wed, 13 Jul 2016 03:55:06 -0400
Received: from metis.ext.4.pengutronix.de ([92.198.50.35]:54387 "EHLO
	metis.ext.4.pengutronix.de" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751448AbcGMHy6 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 13 Jul 2016 03:54:58 -0400
From: Markus Pargmann <mpa@pengutronix.de>
To: Pranay Srivastava <pranjas@gmail.com>
Cc: nbd-general@lists.sourceforge.net, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 3/5]nbd: make nbd device wait for its users
Date: Wed, 13 Jul 2016 09:54:55 +0200
Message-ID: <5216356.aDXQUPkGB7@adelgunde>
User-Agent: KMail/4.14.1 (Linux/4.6.0-0.bpo.1-amd64; KDE/4.14.2; x86_64; ; )
In-Reply-To: <CA+aCy1Fg7fzU302r-KCePLevHzjGzRu-=sOV99fe4T==BSdDJw@mail.gmail.com>
References: <1467284524-15676-1-git-send-email-pranjas@gmail.com> <6092424.rvLJmOdVvL@galactica.lan> <CA+aCy1Fg7fzU302r-KCePLevHzjGzRu-=sOV99fe4T==BSdDJw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="nextPart29976552.ZJ6eRY0aMg"; micalg="pgp-sha256"; protocol="application/pgp-signature"
X-SA-Exim-Connect-IP: 2001:67c:670:100:a61f:72ff:fe68:75ba
X-SA-Exim-Mail-From: mpa@pengutronix.de
X-SA-Exim-Scanned: No (on metis.ext.pengutronix.de); SAEximRunCond expanded to false
X-PTX-Original-Recipient: linux-kernel@vger.kernel.org
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--nextPart29976552.ZJ6eRY0aMg
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="us-ascii"

On Sunday 10 July 2016 21:32:07 Pranay Srivastava wrote:
> On Sun, Jul 10, 2016 at 6:32 PM, Markus Pargmann <mpa@pengutronix.de>=
 wrote:
> > On 2016 M06 30, Thu 14:02:03 CEST Pranay Kr. Srivastava wrote:
> >> When a timeout occurs or a recv fails, then
> >> instead of abruplty killing nbd block device
> >> wait for its users to finish.
> >>
> >> This is more required when filesystem(s) like
> >> ext2 or ext3 don't expect their buffer heads to
> >> disappear while the filesystem is mounted.
> >>
> >> Each open of a nbd device is refcounted, while
> >> the userland program [nbd-client] doing the
> >> NBD_DO_IT ioctl would now wait for any other users
> >> of this device before invalidating the nbd device.
> >>
> >> A timedout or a disconnected device, if in use, can't
> >> be used until it has been resetted. The reset happens
> >> when all tasks having this bdev open closes this bdev.
> >>
> >> Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
> >> ---
> >>  drivers/block/nbd.c | 106
> >> ++++++++++++++++++++++++++++++++++++++++++---------- 1 file change=
d, 87
> >> insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> >> index e362d44..fb56dd2 100644
> >> --- a/drivers/block/nbd.c
> >> +++ b/drivers/block/nbd.c
> >> @@ -72,6 +72,8 @@ struct nbd_device {
> >>  #endif
> >>       /* This is specifically for calling sock_shutdown, for now. =
*/
> >>       struct work_struct ws_shutdown;
> >> +     struct kref users;
> >> +     struct completion user_completion;
> >>  };
> >>
> >>  #if IS_ENABLED(CONFIG_DEBUG_FS)
> >> @@ -99,6 +101,8 @@ static int max_part;
> >>  static DEFINE_SPINLOCK(nbd_lock);
> >>
> >>  static void nbd_ws_func_shutdown(struct work_struct *);
> >> +static void nbd_kref_release(struct kref *);
> >> +static int nbd_size_clear(struct nbd_device *, struct block_devic=
e *);
> >
> > More function signatures. Why?
>=20
> To avoid code move. But do let me know why is code signature(s)
> like this are bad , just asking to avoid such things.
>=20
> >
> >>
> >>  static inline struct device *nbd_to_dev(struct nbd_device *nbd)
> >>  {
> >> @@ -145,11 +149,9 @@ static int nbd_size_set(struct nbd_device *nb=
d, struct
> >> block_device *bdev, int blocksize, int nr_blocks)
> >>  {
> >>       int ret;
> >> -
> >>       ret =3D set_blocksize(bdev, blocksize);
> >>       if (ret)
> >>               return ret;
> >> -
> >
> > Unrelated.
> >
> >>       nbd->blksize =3D blocksize;
> >>       nbd->bytesize =3D (loff_t)blocksize * (loff_t)nr_blocks;
> >>
> >> @@ -197,6 +199,9 @@ static void nbd_xmit_timeout(unsigned long arg=
)
> >>  {
> >>       struct nbd_device *nbd =3D (struct nbd_device *)arg;
> >>
> >> +     if (nbd->timedout)
> >> +             return;
> >> +
> >
> > What does this have to do with the patch?
>=20
> to avoid re-scheduling the work function. Apparently that did
> cause some trouble with ext4 and 10K dd processes.

Ah interesting. What was the timeout in this scenario?

>=20
> >
> >>       if (list_empty(&nbd->queue_head))
> >>               return;
> >>
> >> @@ -472,8 +477,6 @@ static int nbd_thread_recv(struct nbd_device *=
nbd,
> >> struct block_device *bdev) nbd_end_request(nbd, req);
> >>       }
> >>
> >> -     nbd_size_clear(nbd, bdev);
> >> -
> >>       device_remove_file(disk_to_dev(nbd->disk), &dev_attr_pid);
> >>
> >>       nbd->task_recv =3D NULL;
> >> @@ -650,12 +653,13 @@ static int nbd_set_socket(struct nbd_device =
*nbd,
> >> struct socket *sock) int ret =3D 0;
> >>
> >>       spin_lock(&nbd->sock_lock);
> >> -     if (nbd->sock)
> >> +
> >> +     if (nbd->sock || nbd->timedout)
> >>               ret =3D -EBUSY;
> >
> > nbd->timedout is already checked in __nbd_ioctl(), no need to check=
 it twice.
> >
> >>       else
> >>               nbd->sock =3D sock;
> >> -     spin_unlock(&nbd->sock_lock);
> >>
> >> +     spin_unlock(&nbd->sock_lock);
> >
> > random modification.
> >
> >>       return ret;
> >>  }
> >>
> >> @@ -670,6 +674,7 @@ static void nbd_reset(struct nbd_device *nbd)
> >>       nbd->flags =3D 0;
> >>       nbd->xmit_timeout =3D 0;
> >>       INIT_WORK(&nbd->ws_shutdown, nbd_ws_func_shutdown);
> >> +     init_completion(&nbd->user_completion);
> >>       queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->que=
ue);
> >>       del_timer_sync(&nbd->timeout_timer);
> >>  }
> >> @@ -704,6 +709,9 @@ static void nbd_dev_dbg_close(struct nbd_devic=
e *nbd);
> >>  static int __nbd_ioctl(struct block_device *bdev, struct nbd_devi=
ce *nbd,
> >>                      unsigned int cmd, unsigned long arg)
> >>  {
> >> +     if (nbd->timedout || nbd->disconnect)
> >> +             return -EBUSY;
> >> +
> >>       switch (cmd) {
> >>       case NBD_DISCONNECT: {
> >>               struct request sreq;
> >> @@ -733,7 +741,6 @@ static int __nbd_ioctl(struct block_device *bd=
ev, struct
> >> nbd_device *nbd, nbd_clear_que(nbd);
> >>               BUG_ON(!list_empty(&nbd->queue_head));
> >>               BUG_ON(!list_empty(&nbd->waiting_queue));
> >> -             kill_bdev(bdev);
> >>               return 0;
> >>
> >>       case NBD_SET_SOCK: {
> >> @@ -752,7 +759,6 @@ static int __nbd_ioctl(struct block_device *bd=
ev, struct
> >> nbd_device *nbd,
> >>
> >>       case NBD_SET_BLKSIZE: {
> >>               loff_t bsize =3D div_s64(nbd->bytesize, arg);
> >> -
> >
> > random modification.
> >
> >>               return nbd_size_set(nbd, bdev, arg, bsize);
> >>       }
> >>
> >> @@ -804,22 +810,29 @@ static int __nbd_ioctl(struct block_device *=
bdev,
> >> struct nbd_device *nbd, error =3D nbd_thread_recv(nbd, bdev);
> >>               nbd_dev_dbg_close(nbd);
> >>               kthread_stop(thread);
> >> -             sock_shutdown(nbd);
> >> -
> >> -             mutex_lock(&nbd->tx_lock);
> >> -             nbd->task_recv =3D NULL;
> >>
> >> -             nbd_clear_que(nbd);
> >> -             kill_bdev(bdev);
> >> -             nbd_bdev_reset(bdev);
> >> +             sock_shutdown(nbd);
> >>
> >>               if (nbd->disconnect) /* user requested, ignore socke=
t errors */
> >>                       error =3D 0;
> >>               if (nbd->timedout)
> >>                       error =3D -ETIMEDOUT;
> >>
> >> -             nbd_reset(nbd);
> >> +             mutex_lock(&nbd->tx_lock);
> >> +             nbd_clear_que(nbd);
> >> +             nbd->disconnect =3D true; /* To kill bdev*/
> >> +             mutex_unlock(&nbd->tx_lock);
> >> +             cancel_work_sync(&nbd->ws_shutdown);
> >> +             kref_put(&nbd->users, nbd_kref_release);
> >> +             wait_for_completion(&nbd->user_completion);
> >>
> >> +             mutex_lock(&bdev->bd_mutex);
> >> +             if (!kref_get_unless_zero(&nbd->users))
> >> +                     kref_init(&nbd->users);
> >
> > This kref usage simply looks wrong and confusing. I commented last =
time
> > already
> > that I think atomics will work better. Please discuss with me what =
you think
> > before sending out a new version. Otherwise this patch series will =
increase in
> > version forever.
>=20
> Alright let's go with atomics.
> But why this looks wrong, are you referring to partitioned device?

No, it looks wrong in respect to what kref was designed for. I really
thought at the beginning that kref would work great for this setup as w=
e
have normal users that request this resource and put it back at some
time (using close). But it didn't turn out so well because of this
ioctl thread that keeps the file descriptor open.

So the code probably does work but the normal kref workflow with
kref_init() and kref_put() simply doesn't work here.

>=20
> >
> >> +             mutex_unlock(&bdev->bd_mutex);
> >> +
> >> +             mutex_lock(&nbd->tx_lock);
> >> +             nbd_reset(nbd);
> >>               return error;
> >>       }
> >>
> >> @@ -857,19 +870,74 @@ static int nbd_ioctl(struct block_device *bd=
ev,
> >> fmode_t mode,
> >>
> >>       return error;
> >>  }
> >> +static void nbd_kref_release(struct kref *kref_users)
> >> +{
> >> +     struct nbd_device *nbd =3D container_of(kref_users, struct n=
bd_device,
> >> +                                             users
> >> +                                             );
> >> +     schedule_work(&nbd->ws_shutdown);
> >
> > Do we need to schedule work here?
>=20
> Yes this is for the kill_bdev part. This is the final kick to bdev wh=
ich happens
> after the wait in NBD_DO_IT.

Sorry what I meant was, whether we can directly call the appropriate
function here. Without using schedule_work here. Is that possible? Or
are we in some context that does not allow that?

>=20
> >
> >> +}
> >> +
> >> +static int nbd_open(struct block_device *bdev, fmode_t mode)
> >> +{
> >> +     struct nbd_device *nbd_dev =3D bdev->bd_disk->private_data;
> >> +
> >> +     if (!kref_get_unless_zero(&nbd_dev->users))
> >> +             kref_init(&nbd_dev->users);
> >> +
> >> +     pr_debug("Opening nbd_dev %s. Active users =3D %u\n",
> >> +                     bdev->bd_disk->disk_name,
> >> +                     atomic_read(&nbd_dev->users.refcount)
> >> +             );
> >> +     return 0;
> >> +}
> >> +
> >> +static void nbd_release(struct gendisk *disk, fmode_t mode)
> >> +{
> >> +     struct nbd_device *nbd_dev =3D disk->private_data;
> >> +
> >> +     kref_put(&nbd_dev->users,  nbd_kref_release);
> >> +
> >> +     pr_debug("Closing nbd_dev %s. Active users =3D %u\n",
> >> +                     disk->disk_name,
> >> +                     atomic_read(&nbd_dev->users.refcount)
> >> +             );
> >> +}
> >>
> >>  static const struct block_device_operations nbd_fops =3D {
> >>       .owner =3D        THIS_MODULE,
> >>       .ioctl =3D        nbd_ioctl,
> >>       .compat_ioctl =3D nbd_ioctl,
> >> +     .open =3D         nbd_open,
> >> +     .release =3D      nbd_release
> >>  };
> >>
> >> +
> >
> > random modification
> >
> >>  static void nbd_ws_func_shutdown(struct work_struct *ws_nbd)
> >>  {
> >>       struct nbd_device *nbd_dev =3D container_of(ws_nbd, struct n=
bd_device,
> >> -                     ws_shutdown);
> >> -
> >> -     sock_shutdown(nbd_dev);
> >> +                                                     ws_shutdown
> >> +                                             );
> >
> > ...???
>=20
> Tried to match the brackets... that's what you meant earlier?

Sorry seems I was unclear about that. This is what I meant:

=09struct nbd_device *nbd_dev =3D container_of(ws_nbd, struct nbd_devic=
e,
=09=09=09=09=09=09  ws_shutdown);

After the line break the line should start at the beginning of the
opening bracket. But closing brackets do not have to be in a separate
line.

>=20
> >
> >> +
> >> +     struct block_device *bdev =3D bdget(part_devt(
> >> +                                             dev_to_part(nbd_to_d=
ev(nbd_dev))
> >> +                                             )
> >> +                                     );
> >> +     BUG_ON(!bdev);
> >
> > A simple check would be enough. Or a warning.
>=20
> Ok, but that's really a bug.

Yes but BUG_ON will kill the process which in this case is a worker. I
think there is no need to influence anything else in the kernel as this=

is a nbd issue.

>=20
> >
> >> +     if (nbd_dev->timedout)
> >> +             sock_shutdown(nbd_dev);
> >
> > This timeout check seems unnecessary. If we do not timeout and the =
socket was
> > already closed, the sock_shutdown() will do nothing.
> >
> >
> > So if I understand you correctly you are trying to block all ioctls=
 while you
> > are shutting down which is a well a behaviour change of the ioctl i=
nterface.
> > Why do you think it is better not to allow any changes until everyo=
ne closed
> > the blockdevice? Shouldn't there be some control left for the user,=
 for
> > example
> > CLEAR_SOCK?
>=20
> Ah... Yes that's indeed what I'm trying to do. Now say if this block
> device is mounted
> and another nbd-client is trying to disconnect it [CLEAR + DISCONNECT=
]
> then clear
> is doing a kill_bdev. Socket already has been disconnected but the
> device is just not
> usable in this case.
>=20
> If however we are trying to provide for an error recovery, like live
> mounted device
> and there's was timeout with all connections teared down and then som=
eone does
> a set socket on this? Is this supported currently ?

This is currently not supported. But the client has implemented
something like this. So if we change this here, we should consider
allowing nbd-client to react on a timeout, for example by setting a new=

socket.

>=20
> A change in the CLEAR, like not actually killing bdev would also not =
be good. So
> better avoid such ioctl if device is in use, no?

What we currently have are nbd-client users that expect the device to b=
e
usable immediately after 'nbd-client -d'. Using this patch as you
proposed would change this behaviour.


As an idea to fix the bug that we currently have (filesystems on blockd=
evice
that is killed):

We could implement the killing in CLEAR_SOCK. CLEAR_SOCK is kind of a
direct statement that the current socket and connection should be
removed.
nbd-client currently calls CLEAR_SOCK after NBD_DO_IT, so from the user=
s
perspective with an old nbd-client, nothing changes. 'nbd-client -d'
disconnects the client and leaves the blockdevice open. The following
CLEAR_SOCK will kill the block device and the user does not notice a
difference.

A newer nbd-client implementation could then use this new feature
properly and not use CLEAR_SOCK anymore and offer something like
'nbd-client -d --force' instead. This would give the user still the
possibility to have the old behaviour. But the new behaviour (keeping
the blockdevice open) is the default.


Another possibility is to replace NBD_DO_IT with a new ioctl that does
things differently.

Best Regards,

Markus

=2D-=20
Pengutronix e.K.                           |                           =
  |
Industrial Linux Solutions                 | http://www.pengutronix.de/=
  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0  =
  |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-555=
5 |

--nextPart29976552.ZJ6eRY0aMg
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJXhfPPAAoJENnm3voMNZulJxgP/RsOM1ELPut/91KVyBQSd0xl
qFuuO+HLerNwZO4pa2+dNGGl274GkLxfiUSfl/RAnUflFwMfOMSk0UlkEuOMFcWS
jEIyU3uSFW080hCjyq/JLsiyMbgqdkO/PEaFDf/ougtT448SRp0TZmNjn1J88fbZ
CTfCC2NPw4aRcLhMKUUSm+BLQzz+xiRxX+XzQ4ayifiCl8zrhx3L3nYbbgY/49df
oy5TfhEkNHJpkhfuV3EK35Go4sVuFpDrdFie2rDLtvP7K+ivIT8DPAmuXt6PWzSH
j1v+XHrq/rkZPDrUn+Ds89XoiQn24Dpfdm4OEoCfI73kc6HRXQ4y0dq+c8G7tY3Y
jHmFhQrG1RD25amHiPL4bAw59WKiN6nE7J5cYel0AD6M3a3ocua0fzF3b07sLR3f
JvaPK+OfOIZmlG2fxvOYtwSrMM8SmqWXSM41CGlq1w3d4NLlwIP+SHYpjCJpCg2i
9SJvd1F6D4U4/OJS5CDqAzh2Iy0viKYD4rJCE4EH1WnWyKWTNJcThtKSod+XLxAn
vAMG/bEgeqCSikMNMesLAaA1I61xWFHOLCIDkrZRjPwZt2wNsY2qd+WGhDSs5OMi
exhxfcBps1eaKYq9dRA9PDnrs5DUc8lmZUVdX6+P14A0S0ygJStMMigsopW4Ml51
IRuLEfkgGvKCB6VF5QUd
=b9rt
-----END PGP SIGNATURE-----

--nextPart29976552.ZJ6eRY0aMg--