All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
       [not found] <290202568.38904309.1608669529163.JavaMail.zimbra@redhat.com>
@ 2020-12-22 20:38 ` Bob Peterson
  2020-12-22 23:27   ` Andreas Gruenbacher
  2021-01-04  9:13   ` Steven Whitehouse
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Peterson @ 2020-12-22 20:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Before this patch, journal recovery was done by a workqueue function that
operated on a per-journal basis. The problem is, these could run simultaneously
which meant that they could all use the same bio, sd_log_bio, to do their
writing to all the various journals. These operations overwrote one another
eventually causing memory corruption.

This patch makes the recovery workqueue operate on a per-superblock basis,
which means a mount point using, for example journal0, could do recovery
for all journals that need recovery. This is done consecutively so the
sd_log_bio is only referenced by one recovery at a time, thus avoiding the
chaos.

Since the journal recovery requests can come in any order, and unpredictably,
the new work func loops until there are no more journals to be recovered.

Since multiple processes may request recovery of a journal, and since they
all now use the same sdp-based workqueue, it's okay for them to get an
error from queue_work: Queueing work while there's already work queued.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/incore.h     |  2 +-
 fs/gfs2/ops_fstype.c |  2 +-
 fs/gfs2/recovery.c   | 32 ++++++++++++++++++++++++++++----
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 8e1ab8ed4abc..b393cbf9efeb 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -529,7 +529,6 @@ struct gfs2_jdesc {
 	struct list_head jd_list;
 	struct list_head extent_list;
 	unsigned int nr_extents;
-	struct work_struct jd_work;
 	struct inode *jd_inode;
 	unsigned long jd_flags;
 #define JDF_RECOVERY 1
@@ -746,6 +745,7 @@ struct gfs2_sbd {
 	struct completion sd_locking_init;
 	struct completion sd_wdack;
 	struct delayed_work sd_control_work;
+	struct work_struct sd_recovery_work;
 
 	/* Inode Stuff */
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 61fce59cb4d3..3d9a6d6d42cb 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -93,6 +93,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
 	init_completion(&sdp->sd_locking_init);
 	init_completion(&sdp->sd_wdack);
 	spin_lock_init(&sdp->sd_statfs_spin);
+	INIT_WORK(&sdp->sd_recovery_work, gfs2_recover_func);
 
 	spin_lock_init(&sdp->sd_rindex_spin);
 	sdp->sd_rindex_tree.rb_node = NULL;
@@ -586,7 +587,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
 		INIT_LIST_HEAD(&jd->extent_list);
 		INIT_LIST_HEAD(&jd->jd_revoke_list);
 
-		INIT_WORK(&jd->jd_work, gfs2_recover_func);
 		jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
 		if (IS_ERR_OR_NULL(jd->jd_inode)) {
 			if (!jd->jd_inode)
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index c26c68ebd29d..cd3e66cdb560 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -399,9 +399,8 @@ static void recover_local_statfs(struct gfs2_jdesc *jd,
 	return;
 }
 
-void gfs2_recover_func(struct work_struct *work)
+static void gfs2_recover_one(struct gfs2_jdesc *jd)
 {
-	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
 	struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
 	struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
 	struct gfs2_log_header_host head;
@@ -562,16 +561,41 @@ void gfs2_recover_func(struct work_struct *work)
 	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
 }
 
+void gfs2_recover_func(struct work_struct *work)
+{
+	struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd,
+					    sd_recovery_work);
+	struct gfs2_jdesc *jd;
+	int count, recovered = 0;
+
+	do {
+		count = 0;
+		spin_lock(&sdp->sd_jindex_spin);
+		list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
+			if (test_bit(JDF_RECOVERY, &jd->jd_flags)) {
+				spin_unlock(&sdp->sd_jindex_spin);
+				gfs2_recover_one(jd);
+				spin_lock(&sdp->sd_jindex_spin);
+				count++;
+				recovered++;
+			}
+		}
+		spin_unlock(&sdp->sd_jindex_spin);
+	} while (count);
+	if (recovered > 1)
+		fs_err(sdp, "Journals recovered: %d\n", recovered);
+}
+
 int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
 {
+	struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
 	int rv;
 
 	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
 		return -EBUSY;
 
 	/* we have JDF_RECOVERY, queue should always succeed */
-	rv = queue_work(gfs_recovery_wq, &jd->jd_work);
-	BUG_ON(!rv);
+	rv = queue_work(gfs_recovery_wq, &sdp->sd_recovery_work);
 
 	if (wait)
 		wait_on_bit(&jd->jd_flags, JDF_RECOVERY,



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2020-12-22 20:38 ` [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal Bob Peterson
@ 2020-12-22 23:27   ` Andreas Gruenbacher
  2020-12-23  2:47     ` Bob Peterson
  2021-01-04  9:13   ` Steven Whitehouse
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2020-12-22 23:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Dec 22, 2020 at 9:39 PM Bob Peterson <rpeterso@redhat.com> wrote:
>
> Hi,
>
> Before this patch, journal recovery was done by a workqueue function that
> operated on a per-journal basis. The problem is, these could run simultaneously
> which meant that they could all use the same bio, sd_log_bio, to do their
> writing to all the various journals. These operations overwrote one another
> eventually causing memory corruption.
>
> This patch makes the recovery workqueue operate on a per-superblock basis,
> which means a mount point using, for example journal0, could do recovery
> for all journals that need recovery. This is done consecutively so the
> sd_log_bio is only referenced by one recovery at a time, thus avoiding the
> chaos.
>
> Since the journal recovery requests can come in any order, and unpredictably,
> the new work func loops until there are no more journals to be recovered.
>
> Since multiple processes may request recovery of a journal, and since they
> all now use the same sdp-based workqueue, it's okay for them to get an
> error from queue_work: Queueing work while there's already work queued.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>  fs/gfs2/incore.h     |  2 +-
>  fs/gfs2/ops_fstype.c |  2 +-
>  fs/gfs2/recovery.c   | 32 ++++++++++++++++++++++++++++----
>  3 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 8e1ab8ed4abc..b393cbf9efeb 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -529,7 +529,6 @@ struct gfs2_jdesc {
>         struct list_head jd_list;
>         struct list_head extent_list;
>         unsigned int nr_extents;
> -       struct work_struct jd_work;
>         struct inode *jd_inode;
>         unsigned long jd_flags;
>  #define JDF_RECOVERY 1
> @@ -746,6 +745,7 @@ struct gfs2_sbd {
>         struct completion sd_locking_init;
>         struct completion sd_wdack;
>         struct delayed_work sd_control_work;
> +       struct work_struct sd_recovery_work;
>
>         /* Inode Stuff */
>
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 61fce59cb4d3..3d9a6d6d42cb 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -93,6 +93,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
>         init_completion(&sdp->sd_locking_init);
>         init_completion(&sdp->sd_wdack);
>         spin_lock_init(&sdp->sd_statfs_spin);
> +       INIT_WORK(&sdp->sd_recovery_work, gfs2_recover_func);
>
>         spin_lock_init(&sdp->sd_rindex_spin);
>         sdp->sd_rindex_tree.rb_node = NULL;
> @@ -586,7 +587,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
>                 INIT_LIST_HEAD(&jd->extent_list);
>                 INIT_LIST_HEAD(&jd->jd_revoke_list);
>
> -               INIT_WORK(&jd->jd_work, gfs2_recover_func);
>                 jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
>                 if (IS_ERR_OR_NULL(jd->jd_inode)) {
>                         if (!jd->jd_inode)
> diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
> index c26c68ebd29d..cd3e66cdb560 100644
> --- a/fs/gfs2/recovery.c
> +++ b/fs/gfs2/recovery.c
> @@ -399,9 +399,8 @@ static void recover_local_statfs(struct gfs2_jdesc *jd,
>         return;
>  }
>
> -void gfs2_recover_func(struct work_struct *work)
> +static void gfs2_recover_one(struct gfs2_jdesc *jd)
>  {
> -       struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
>         struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
>         struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
>         struct gfs2_log_header_host head;
> @@ -562,16 +561,41 @@ void gfs2_recover_func(struct work_struct *work)
>         wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
>  }
>
> +void gfs2_recover_func(struct work_struct *work)
> +{
> +       struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd,
> +                                           sd_recovery_work);
> +       struct gfs2_jdesc *jd;
> +       int count, recovered = 0;
> +
> +       do {
> +               count = 0;
> +               spin_lock(&sdp->sd_jindex_spin);
> +               list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
> +                       if (test_bit(JDF_RECOVERY, &jd->jd_flags)) {
> +                               spin_unlock(&sdp->sd_jindex_spin);
> +                               gfs2_recover_one(jd);
> +                               spin_lock(&sdp->sd_jindex_spin);
> +                               count++;
> +                               recovered++;
> +                       }
> +               }
> +               spin_unlock(&sdp->sd_jindex_spin);
> +       } while (count);

Can't this be a simple loop like below?

repeat:
        spin_lock(&sdp->sd_jindex_spin);
        list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
                if (test_bit(JDF_RECOVERY, &jd->jd_flags)) {
                        spin_unlock(&sdp->sd_jindex_spin);
                        gfs2_recover_one(jd);
                        goto repeat;
                }
        }
        spin_unlock(&sdp->sd_jindex_spin);

> +       if (recovered > 1)
> +               fs_err(sdp, "Journals recovered: %d\n", recovered);
> +}
> +
>  int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
>  {
> +       struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
>         int rv;
>
>         if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
>                 return -EBUSY;
>
>         /* we have JDF_RECOVERY, queue should always succeed */
> -       rv = queue_work(gfs_recovery_wq, &jd->jd_work);
> -       BUG_ON(!rv);
> +       rv = queue_work(gfs_recovery_wq, &sdp->sd_recovery_work);
>
>         if (wait)
>                 wait_on_bit(&jd->jd_flags, JDF_RECOVERY,

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2020-12-22 23:27   ` Andreas Gruenbacher
@ 2020-12-23  2:47     ` Bob Peterson
  0 siblings, 0 replies; 10+ messages in thread
From: Bob Peterson @ 2020-12-23  2:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> On Tue, Dec 22, 2020 at 9:39 PM Bob Peterson <rpeterso@redhat.com> wrote:
> >
> > Hi,
> >
> > Before this patch, journal recovery was done by a workqueue function that
> > operated on a per-journal basis. The problem is, these could run
> > simultaneously
> > which meant that they could all use the same bio, sd_log_bio, to do their
> > writing to all the various journals. These operations overwrote one another
> > eventually causing memory corruption.
> >
> > This patch makes the recovery workqueue operate on a per-superblock basis,
> > which means a mount point using, for example journal0, could do recovery
> > for all journals that need recovery. This is done consecutively so the
> > sd_log_bio is only referenced by one recovery at a time, thus avoiding the
> > chaos.
> >
> > Since the journal recovery requests can come in any order, and
> > unpredictably,
> > the new work func loops until there are no more journals to be recovered.
> >
> > Since multiple processes may request recovery of a journal, and since they
> > all now use the same sdp-based workqueue, it's okay for them to get an
> > error from queue_work: Queueing work while there's already work queued.
> >
> > Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> 
> Can't this be a simple loop like below?
> 
> repeat:
>         spin_lock(&sdp->sd_jindex_spin);
>         list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
>                 if (test_bit(JDF_RECOVERY, &jd->jd_flags)) {
>                         spin_unlock(&sdp->sd_jindex_spin);
>                         gfs2_recover_one(jd);
>                         goto repeat;
>                 }
>         }
>         spin_unlock(&sdp->sd_jindex_spin);
> 

Yes, that's just as effective. I just hate gotos. Whichever you prefer is fine with me.
I'm okay with not reporting the count too.

Bob



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2020-12-22 20:38 ` [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal Bob Peterson
  2020-12-22 23:27   ` Andreas Gruenbacher
@ 2021-01-04  9:13   ` Steven Whitehouse
  2021-01-04 16:09     ` Bob Peterson
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Whitehouse @ 2021-01-04  9:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 22/12/2020 20:38, Bob Peterson wrote:
> Hi,
>
> Before this patch, journal recovery was done by a workqueue function that
> operated on a per-journal basis. The problem is, these could run simultaneously
> which meant that they could all use the same bio, sd_log_bio, to do their
> writing to all the various journals. These operations overwrote one another
> eventually causing memory corruption.

Why not just add more bios so that this issue goes away? It would make 
more sense than preventing recovery from running in parallel. In general 
recovery should be spread amoung nodes anyway, so the case of having 
multiple recoveries running on the same node in parallel should be 
fairly rare too,

Steve.


>
> This patch makes the recovery workqueue operate on a per-superblock basis,
> which means a mount point using, for example journal0, could do recovery
> for all journals that need recovery. This is done consecutively so the
> sd_log_bio is only referenced by one recovery at a time, thus avoiding the
> chaos.
>
> Since the journal recovery requests can come in any order, and unpredictably,
> the new work func loops until there are no more journals to be recovered.
>
> Since multiple processes may request recovery of a journal, and since they
> all now use the same sdp-based workqueue, it's okay for them to get an
> error from queue_work: Queueing work while there's already work queued.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/incore.h     |  2 +-
>   fs/gfs2/ops_fstype.c |  2 +-
>   fs/gfs2/recovery.c   | 32 ++++++++++++++++++++++++++++----
>   3 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 8e1ab8ed4abc..b393cbf9efeb 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -529,7 +529,6 @@ struct gfs2_jdesc {
>   	struct list_head jd_list;
>   	struct list_head extent_list;
>   	unsigned int nr_extents;
> -	struct work_struct jd_work;
>   	struct inode *jd_inode;
>   	unsigned long jd_flags;
>   #define JDF_RECOVERY 1
> @@ -746,6 +745,7 @@ struct gfs2_sbd {
>   	struct completion sd_locking_init;
>   	struct completion sd_wdack;
>   	struct delayed_work sd_control_work;
> +	struct work_struct sd_recovery_work;
>   
>   	/* Inode Stuff */
>   
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 61fce59cb4d3..3d9a6d6d42cb 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -93,6 +93,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
>   	init_completion(&sdp->sd_locking_init);
>   	init_completion(&sdp->sd_wdack);
>   	spin_lock_init(&sdp->sd_statfs_spin);
> +	INIT_WORK(&sdp->sd_recovery_work, gfs2_recover_func);
>   
>   	spin_lock_init(&sdp->sd_rindex_spin);
>   	sdp->sd_rindex_tree.rb_node = NULL;
> @@ -586,7 +587,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
>   		INIT_LIST_HEAD(&jd->extent_list);
>   		INIT_LIST_HEAD(&jd->jd_revoke_list);
>   
> -		INIT_WORK(&jd->jd_work, gfs2_recover_func);
>   		jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
>   		if (IS_ERR_OR_NULL(jd->jd_inode)) {
>   			if (!jd->jd_inode)
> diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
> index c26c68ebd29d..cd3e66cdb560 100644
> --- a/fs/gfs2/recovery.c
> +++ b/fs/gfs2/recovery.c
> @@ -399,9 +399,8 @@ static void recover_local_statfs(struct gfs2_jdesc *jd,
>   	return;
>   }
>   
> -void gfs2_recover_func(struct work_struct *work)
> +static void gfs2_recover_one(struct gfs2_jdesc *jd)
>   {
> -	struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
>   	struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
>   	struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
>   	struct gfs2_log_header_host head;
> @@ -562,16 +561,41 @@ void gfs2_recover_func(struct work_struct *work)
>   	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
>   }
>   
> +void gfs2_recover_func(struct work_struct *work)
> +{
> +	struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd,
> +					    sd_recovery_work);
> +	struct gfs2_jdesc *jd;
> +	int count, recovered = 0;
> +
> +	do {
> +		count = 0;
> +		spin_lock(&sdp->sd_jindex_spin);
> +		list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
> +			if (test_bit(JDF_RECOVERY, &jd->jd_flags)) {
> +				spin_unlock(&sdp->sd_jindex_spin);
> +				gfs2_recover_one(jd);
> +				spin_lock(&sdp->sd_jindex_spin);
> +				count++;
> +				recovered++;
> +			}
> +		}
> +		spin_unlock(&sdp->sd_jindex_spin);
> +	} while (count);
> +	if (recovered > 1)
> +		fs_err(sdp, "Journals recovered: %d\n", recovered);
> +}
> +
>   int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
>   {
> +	struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
>   	int rv;
>   
>   	if (test_and_set_bit(JDF_RECOVERY, &jd->jd_flags))
>   		return -EBUSY;
>   
>   	/* we have JDF_RECOVERY, queue should always succeed */
> -	rv = queue_work(gfs_recovery_wq, &jd->jd_work);
> -	BUG_ON(!rv);
> +	rv = queue_work(gfs_recovery_wq, &sdp->sd_recovery_work);
>   
>   	if (wait)
>   		wait_on_bit(&jd->jd_flags, JDF_RECOVERY,
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-04  9:13   ` Steven Whitehouse
@ 2021-01-04 16:09     ` Bob Peterson
  2021-01-19 15:23       ` Andreas Gruenbacher
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-01-04 16:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
> Hi,
> 
> On 22/12/2020 20:38, Bob Peterson wrote:
> > Hi,
> >
> > Before this patch, journal recovery was done by a workqueue function that
> > operated on a per-journal basis. The problem is, these could run
> > simultaneously
> > which meant that they could all use the same bio, sd_log_bio, to do their
> > writing to all the various journals. These operations overwrote one another
> > eventually causing memory corruption.
> 
> Why not just add more bios so that this issue goes away? It would make
> more sense than preventing recovery from running in parallel. In general
> recovery should be spread amoung nodes anyway, so the case of having
> multiple recoveries running on the same node in parallel should be
> fairly rare too,
> 
> Steve.

As I understand it, if we allocate a bio from the same bio_set (as bio_alloc does)
we need to submit the previous bio before getting the next one, which means
recovery processes cannot work in parallel, even if they use different bio pointers.

We can, of course, allocate several bio_sets, one for each journal, but I
remember Jeff Moyer telling me it would use 1MB per bio_set of memory,
which seems high. (I've not verified that.) I'm testing up to 60 mounts
times 5 cluster nodes (5 journals) which would add up to 300MB of memory.
That's not horrible but I remember we decided not to allocate separate
per-mount rb_trees for glock indexing because of the memory needed, and 
that seems much less by comparison.

We could also introduce new locking (and multiple bio pointers) to prevent
the bio from being used by multiple recoveries at the same time. I actually
tried that on an earlier attempt and immediately ran into deadlock issues,
probably because our journal writes also use the same bio.

This way is pretty simple and there are fewer recovery processes to worry
about when analyzing vmcores.

Bob



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-04 16:09     ` Bob Peterson
@ 2021-01-19 15:23       ` Andreas Gruenbacher
  2021-01-19 15:44         ` Bob Peterson
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2021-01-19 15:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 4, 2021 at 5:09 PM Bob Peterson <rpeterso@redhat.com> wrote:
>
> Hi,
>
> ----- Original Message -----
> > Hi,
> >
> > On 22/12/2020 20:38, Bob Peterson wrote:
> > > Hi,
> > >
> > > Before this patch, journal recovery was done by a workqueue function that
> > > operated on a per-journal basis. The problem is, these could run
> > > simultaneously
> > > which meant that they could all use the same bio, sd_log_bio, to do their
> > > writing to all the various journals. These operations overwrote one another
> > > eventually causing memory corruption.
> >
> > Why not just add more bios so that this issue goes away? It would make
> > more sense than preventing recovery from running in parallel. In general
> > recovery should be spread amoung nodes anyway, so the case of having
> > multiple recoveries running on the same node in parallel should be
> > fairly rare too,
> >
> > Steve.
>
> As I understand it, if we allocate a bio from the same bio_set (as bio_alloc does)
> we need to submit the previous bio before getting the next one, which means
> recovery processes cannot work in parallel, even if they use different bio pointers.

Each recovery worker submits the current bio before allocating the
next, so in the worst possible case, the recovery workers will end up
getting serialized (that is, they will sleep in bio_alloc until they
get their turn).

Andreas



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-19 15:23       ` Andreas Gruenbacher
@ 2021-01-19 15:44         ` Bob Peterson
  2021-01-19 17:36           ` Andreas Gruenbacher
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-01-19 15:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> On Mon, Jan 4, 2021 at 5:09 PM Bob Peterson <rpeterso@redhat.com> wrote:
> >
> > Hi,
> >
> > ----- Original Message -----
> > > Hi,
> > >
> > > On 22/12/2020 20:38, Bob Peterson wrote:
> > > > Hi,
> > > >
> > > > Before this patch, journal recovery was done by a workqueue function
> > > > that
> > > > operated on a per-journal basis. The problem is, these could run
> > > > simultaneously
> > > > which meant that they could all use the same bio, sd_log_bio, to do
> > > > their
> > > > writing to all the various journals. These operations overwrote one
> > > > another
> > > > eventually causing memory corruption.
> > >
> > > Why not just add more bios so that this issue goes away? It would make
> > > more sense than preventing recovery from running in parallel. In general
> > > recovery should be spread amoung nodes anyway, so the case of having
> > > multiple recoveries running on the same node in parallel should be
> > > fairly rare too,
> > >
> > > Steve.
> >
> > As I understand it, if we allocate a bio from the same bio_set (as
> > bio_alloc does)
> > we need to submit the previous bio before getting the next one, which means
> > recovery processes cannot work in parallel, even if they use different bio
> > pointers.
> 
> Each recovery worker submits the current bio before allocating the
> next, so in the worst possible case, the recovery workers will end up
> getting serialized (that is, they will sleep in bio_alloc until they
> get their turn).
> 
> Andreas

Sure, the recovery workers' bio allocations and submitting may be serialized,
but that's where it ends. The recovery workers don't prevent races with each
other when using the variable common to all of them: sdp->sd_log_bio.
This is the case when there are, for example, 5 journals with 5 different
recovery workers, all trying to use the same sdp->sd_log_bio at the same time.
My choices were between using 5 different pointers or 1 single point of use.
I chose the latter. If you like, I can temporarily revert the patch and try
to somehow prove this is what happens, but it seems like a waste of time.
The patch made the problem go away.

Bob



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-19 15:44         ` Bob Peterson
@ 2021-01-19 17:36           ` Andreas Gruenbacher
  2021-01-19 18:18             ` Bob Peterson
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2021-01-19 17:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 19, 2021 at 4:44 PM Bob Peterson <rpeterso@redhat.com> wrote:
> Sure, the recovery workers' bio allocations and submitting may be serialized,
> but that's where it ends. The recovery workers don't prevent races with each
> other when using the variable common to all of them: sdp->sd_log_bio.
> This is the case when there are, for example, 5 journals with 5 different
> recovery workers, all trying to use the same sdp->sd_log_bio at the same time.

Well, sdp->sd_log_bio obviously needs to be moved to a per-journal context.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-19 17:36           ` Andreas Gruenbacher
@ 2021-01-19 18:18             ` Bob Peterson
  2021-01-19 20:14               ` Andreas Gruenbacher
  0 siblings, 1 reply; 10+ messages in thread
From: Bob Peterson @ 2021-01-19 18:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> On Tue, Jan 19, 2021 at 4:44 PM Bob Peterson <rpeterso@redhat.com> wrote:
> > Sure, the recovery workers' bio allocations and submitting may be
> > serialized,
> > but that's where it ends. The recovery workers don't prevent races with
> > each
> > other when using the variable common to all of them: sdp->sd_log_bio.
> > This is the case when there are, for example, 5 journals with 5 different
> > recovery workers, all trying to use the same sdp->sd_log_bio at the same
> > time.
> 
> Well, sdp->sd_log_bio obviously needs to be moved to a per-journal context.

I tried that and it didn't end well. If we keep multiple bio pointers, each
recovery worker still needs to make sure all the other bios are submitted
before allocating a new one. Sure, it could make sure _its_ previous bio was
submitted, and the others would be serialized, but there are cases in which
they can run out of bios. Yes, I saw this. This can happen, for example,
when you have 60 gfs2 mounts times 5 nodes, with lots of workers requesting
lots of bios at the same time. Unless, of course, we allocate unique bio_sets
that get their own slabs, etc. We can introduce spinlock locking or something
to manage this, but when I tried it, I found multiple scenarios that deadlock.
It gets ugly really fast.

In practice, when multiple nodes in a cluster go down, their journals are
recovered by several of the remaining cluster nodes, which means they happen
simultaneously anyway, and pretty quickly. In my case, I've got 5 nodes and
2 of them get shot, so the remaining 3 nodes do the journal recovery, and
I've never seen them conflict with one another. Their glocks seem to distribute
the work well.

The only time you're really going to see multiple journals recovered by the
same node (for the same file systems anyway) is when the cluster loses quorum.
Then when quorum is regained, there is often a burst of requests to recover
multiple journals on the same few nodes. Then the same node often tries to
recover several journals for several file systems.

So the circumstances are unusual to begin with. But also very recreatable.

What's wrong with a single worker that handles them all? What's your actual
concern with doing it this way? Is it performance? Who cares if journal
recovery takes 1.4 seconds rather than 1.2 seconds?

Bob



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal
  2021-01-19 18:18             ` Bob Peterson
@ 2021-01-19 20:14               ` Andreas Gruenbacher
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Gruenbacher @ 2021-01-19 20:14 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 19, 2021 at 7:18 PM Bob Peterson <rpeterso@redhat.com> wrote:
> ----- Original Message -----
> > On Tue, Jan 19, 2021 at 4:44 PM Bob Peterson <rpeterso@redhat.com> wrote:
> > > Sure, the recovery workers' bio allocations and submitting may be
> > > serialized,
> > > but that's where it ends. The recovery workers don't prevent races with
> > > each
> > > other when using the variable common to all of them: sdp->sd_log_bio.
> > > This is the case when there are, for example, 5 journals with 5 different
> > > recovery workers, all trying to use the same sdp->sd_log_bio at the same
> > > time.
> >
> > Well, sdp->sd_log_bio obviously needs to be moved to a per-journal context.
>
> I tried that and it didn't end well. If we keep multiple bio pointers, each
> recovery worker still needs to make sure all the other bios are submitted
> before allocating a new one. Sure, it could make sure _its_ previous bio was
> submitted, and the others would be serialized, but there are cases in which
> they can run out of bios. Yes, I saw this.

This doesn't make sense. If you've seen starvation, it must have been
for another reason.

> This can happen, for example,
> when you have 60 gfs2 mounts times 5 nodes, with lots of workers requesting
> lots of bios at the same time. Unless, of course, we allocate unique bio_sets
> that get their own slabs, etc. We can introduce spinlock locking or something
> to manage this, but when I tried it, I found multiple scenarios that deadlock.
> It gets ugly really fast.

As long as each worker makes sure it doesn't allocate another bio
before submitting its previous bio, it doesn't matter how many workers
there are or what state they're in. They will still make progress as
long as they can allocate at least one bio overall.

> In practice, when multiple nodes in a cluster go down, their journals are
> recovered by several of the remaining cluster nodes, which means they happen
> simultaneously anyway, and pretty quickly. In my case, I've got 5 nodes and
> 2 of them get shot, so the remaining 3 nodes do the journal recovery, and
> I've never seen them conflict with one another. Their glocks seem to distribute
> the work well.
>
> The only time you're really going to see multiple journals recovered by the
> same node (for the same file systems anyway) is when the cluster loses quorum.
> Then when quorum is regained, there is often a burst of requests to recover
> multiple journals on the same few nodes. Then the same node often tries to
> recover several journals for several file systems.
>
> So the circumstances are unusual to begin with. But also very recreatable.
>
> What's wrong with a single worker that handles them all? What's your actual
> concern with doing it this way? Is it performance? Who cares if journal
> recovery takes 1.4 seconds rather than 1.2 seconds?

It was Steve who questioned if serializing recovery in that way was a
reasonable change. I don't know if recovering multiple journals on the
same node in parallel is very useful. But I also don't buy your bio
starvation argument.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-01-19 20:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <290202568.38904309.1608669529163.JavaMail.zimbra@redhat.com>
2020-12-22 20:38 ` [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal Bob Peterson
2020-12-22 23:27   ` Andreas Gruenbacher
2020-12-23  2:47     ` Bob Peterson
2021-01-04  9:13   ` Steven Whitehouse
2021-01-04 16:09     ` Bob Peterson
2021-01-19 15:23       ` Andreas Gruenbacher
2021-01-19 15:44         ` Bob Peterson
2021-01-19 17:36           ` Andreas Gruenbacher
2021-01-19 18:18             ` Bob Peterson
2021-01-19 20:14               ` Andreas Gruenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.