All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 29/45] xfs:_introduce xlog_write_partial()
Date: Wed, 19 May 2021 14:49:03 +1000	[thread overview]
Message-ID: <20210519044903.GN2893@dread.disaster.area> (raw)
In-Reply-To: <YFNUALXWnRFFF8J7@bfoster>

On Thu, Mar 18, 2021 at 09:22:08AM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:27PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Handle writing of a logvec chain into an iclog that doesn't have
> > enough space to fit it all. The iclog has already been changed to
> > WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> > in the iclog is exclusively owned by this logvec chain.
> > 
> > The difference between the single and partial cases is that
> > we end up with partial iovec writes in the iclog and have to split
> > a log vec regions across two iclogs. The state handling for this is
> > currently awful and so we're building up the pieces needed to
> > handle this more cleanly one at a time.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> FWIW, git --patience mode generates a more readable diff for this patch
> than what it generates by default. I'm referring to that locally and
> will try to leave feedback in the appropriate points here.
> 
> >  fs/xfs/xfs_log.c | 525 ++++++++++++++++++++++-------------------------
> >  1 file changed, 251 insertions(+), 274 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 590c1e6db475..10916b99bf0f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2099,166 +2099,250 @@ xlog_print_trans(
> >  	}
> >  }
> >  
> > -static xlog_op_header_t *
> > -xlog_write_setup_ophdr(
> > -	struct xlog_op_header	*ophdr,
> > -	struct xlog_ticket	*ticket)
> > -{
> > -	ophdr->oh_clientid = XFS_TRANSACTION;
> > -	ophdr->oh_res2 = 0;
> > -	ophdr->oh_flags = 0;
> > -	return ophdr;
> > -}
> > -
> >  /*
> > - * Set up the parameters of the region copy into the log. This has
> > - * to handle region write split across multiple log buffers - this
> > - * state is kept external to this function so that this code can
> > - * be written in an obvious, self documenting manner.
> > + * Write whole log vectors into a single iclog which is guaranteed to have
> > + * either sufficient space for the entire log vector chain to be written or
> > + * exclusive access to the remaining space in the iclog.
> > + *
> > + * Return the number of iovecs and data written into the iclog, as well as
> > + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> > + * end of the chain.
> >   */
> > -static int
> > -xlog_write_setup_copy(
> > +static struct xfs_log_vec *
> > +xlog_write_single(
> > +	struct xfs_log_vec	*log_vector,
> 
> So xlog_write_single() was initially for single CIL xlog_write() calls
> and now it appears to be slightly different in that it writes as many
> full log vectors that fit in the current iclog and cycles through
> xlog_write_partial() (and back) to process log vectors that span iclogs
> differently from those that don't.

Yes, that is what it does, but no, you've got the process and
meaning backwards. I wrote xlog_write_single() it as it appears in
this patch first, then split it out backwards to ease review. IOWs,
"single" means "write everything that fits within this single
iclog", not "only call this function if the entire lv chain fits
inside a single iclog".

The latter is what I split out to make it simpler to review, but it
was not the reason it was called xlog_write_single()....

> > +		do {
> > +			/*
> > +			 * Account for the continuation opheader before we get
> > +			 * a new iclog. This is necessary so that we reserve
> > +			 * space in the iclog for it.
> > +			 */
> > +			if (ophdr->oh_flags & XLOG_CONTINUE_TRANS) {
> 
> (Is this ever not true here?)

It is now, wasn't always. Fixed.

> 
> > +				*len += sizeof(struct xlog_op_header);
> > +				ticket->t_curr_res -= sizeof(struct xlog_op_header);
> > +			}
> > +			error = xlog_write_get_more_iclog_space(log, ticket,
> > +					&iclog, log_offset, *len, record_cnt,
> > +					data_cnt, contwr);
> > +			if (error)
> > +				return ERR_PTR(error);
> > +			ptr = iclog->ic_datap + *log_offset;
> > +
> > +			ophdr = ptr;
> >  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> > -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> > +			ophdr->oh_clientid = XFS_TRANSACTION;
> > +			ophdr->oh_res2 = 0;
> > +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> > +
> > +			xlog_write_adv_cnt(&ptr, len, log_offset,
> >  						sizeof(struct xlog_op_header));
> > -			memcpy(ptr, reg->i_addr, reg->i_len);
> > -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> > -			record_cnt++;
> > -		}
> > +			*data_cnt += sizeof(struct xlog_op_header);
> > +
> 
> ... which switches to the next iclog, writes the continuation header...
> 
> > +			/*
> > +			 * If rlen fits in the iclog, then end the region
> > +			 * continuation. Otherwise we're going around again.
> > +			 */
> > +			reg_offset += rlen;
> > +			rlen = reg->i_len - reg_offset;
> > +			if (rlen <= iclog->ic_size - *log_offset)
> > +				ophdr->oh_flags |= XLOG_END_TRANS;
> > +			else
> > +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> > +
> > +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> > +			ophdr->oh_len = cpu_to_be32(rlen);
> > +
> > +			xlog_verify_dest_ptr(log, ptr);
> > +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> > +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> > +			(*record_cnt)++;
> > +			*data_cnt += rlen;
> > +
> > +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
> 
> ... writes more of the region (iclog space permitting), and then
> determines whether we need further continuations (and partial writes of
> the same region) or can move onto the next region, until we're done with
> the lv.

Yup.

> I think I follow the high level flow and it seems reasonable from a
> functional standpoint, but this also seems like quite a bit of churn for
> not much reduction in overall complexity. The higher level loop is much
> more simple and I think the per lv/vector iteration is an improvement,
> but we also seem to have duplicate functionality throughout the updated
> code and have introduced new forms of complexity around the state
> expectations for the transitions between the different write modes and
> between each write mode and the higher level loop.

Just getting untangling the code to get it to this point
has been hard enough. I've held off doing more factoring and
changing this code so I can actaully test it and find the bugs I
might have left in it.

Yes, it can be further improved by factoring the region copying
stuff, but that's secondary to the major work of refactoring this
code in the first place. The fact that you actually understood this
fairly easily indicates just how much better this code already is
compared to what is currently upstream....

> I.e., xlog_write_single() implements a straighforward loop to write out
> full log vectors. That seems fine, but the outer loop of
> xlog_write_partial() reimplements nearly the same per-region
> functionality with some added flexibility to handle op header flags and
> the special iclog processing associated with the continuation case. The
> inner loop factors out the continuation iclog management bits and op
> header injection, which I think is an improvement, but then duplicates
> region copying (yet again) pretty much only to implement partial copies,
> which really just involves offset management (i.e., fairly trivial
> relative to the broader complexity of the function).
> 
> I dunno. I'd certainly need to stare more at this to cover all of the
> details, but given the amount of swizzling going on in a single patch
> I'm kind of wondering if/why we couldn't land on a single iterator in
> the spirit of xlog_write_partial() in that it primarily iterates on
> regions and factors out the grotty reservation and continuation
> management bits, but doesn't unroll as much and leave so much duplicate
> functionality around.
> 
> For example, it looks to me that xlog_write_partial() almost nearly
> already supports a high level algorithm along the lines of the following
> (pseudocode):
> 
> xlog_write(len)
> {
> 	get_iclog_space(len)
> 
> 	for_each_lv() {
> 		for_each_reg() {
> 			reg_offset = 0;
> cont_write:
> 			/* write as much as will fit in the iclog, return count,
> 			 * and set ophdr cont flag based on write result */
> 			reg_offset += write_region(reg, &len, &reg_offset, ophdr, ...);
> 
> 			/* handle continuation writes */
> 			if (reg_offset != reg->i_len) {
> 				get_more_iclog_space(len);
> 				/* stamp a WAS_CONT op hdr, set END if rlen fits
> 				 * into new space, then continue with the same region */
> 				stamp_cont_op_hdr();
> 				goto cont_write;
> 			}
> 
> 			if (need_more_iclog_space(len))
> 				get_more_iclog_space(len);
> 		}
> 	}
> }

Yeah, na. That is exactly the mess that I've just untangled.

I don't want to rewrite this code again, and I don't want it more
tightly tied to iclogs than it already is - I'm trying to move the
code towards a common, simple fast path that knows nothing about
iclogs and a slow path that handles the partial regions and
obtaining a new buffer to write into. I want the two cases
completely separate logic, because that makes both cases simpler to
modify and reason about.

Indeed, I want xlog_write to move away from iclogs because I want to
use this code with direct mapped pmem regions, not just fixed memory
buffers held in iclogs.

IOWs, the code as it stands is a beginning, not an end. And even as
a beginning, it works, is much better and faster than the current
code, has been tested for some time now, can be further factored to
make it simpler, easier to understand and provide infrastructure for
new features.


> That puts the whole thing back into a single high level walk and thus
> reintroduces the need for some of the continuation vs. non-continuation
> tracking wrt to the op header and iclog, but ISTM that complexity can be
> managed by the continuation abstraction you've already started to
> introduce (as opposed to the current scheme of conditionally
> accumulating data_cnt). It might even be fine to dump some of the
> requisite state into a context struct to carry between iclog reservation
> and copy finish processing rather than pass around so many independent
> and poorly named variables like the current upstream implementation
> does, but that's probably getting too deep into the weeds.
> 
> FWIW, I can also see an approach of moving from the implementation in
> this patch toward something like the above, but I'm not sure I'd want to
> subject to the upstream code to that process...

This is exactly what upstream is for - iterative improvement via
small steps. This is the first step of many, and what you propose
takes the code in the wrong direction for the steps I've already
taken and are planning to take.

Perfect is the enemy of good, and if upstream is not the place to
make iterative improvements like this that build towards a bigger
picture goal, then where the hell are we supposed to do them?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-05-19  4:49 UTC|newest]

Thread overview: 145+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-05  5:10 [PATCH 00/45 v3] xfs: consolidated log and optimisation changes Dave Chinner
2021-03-05  5:10 ` [PATCH 01/45] xfs: initialise attr fork on inode create Dave Chinner
2021-03-08 22:20   ` Darrick J. Wong
2021-03-16  8:35   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 02/45] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-03-05  5:11 ` [PATCH 03/45] xfs: separate CIL commit record IO Dave Chinner
2021-03-08  8:34   ` Chandan Babu R
2021-03-15 14:40   ` Brian Foster
2021-03-16  8:40   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 04/45] xfs: remove xfs_blkdev_issue_flush Dave Chinner
2021-03-08  9:31   ` Chandan Babu R
2021-03-08 22:21   ` Darrick J. Wong
2021-03-15 14:40   ` Brian Foster
2021-03-16  8:41   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 05/45] xfs: async blkdev cache flush Dave Chinner
2021-03-08  9:48   ` Chandan Babu R
2021-03-08 22:24     ` Darrick J. Wong
2021-03-15 14:41       ` Brian Foster
2021-03-15 16:32         ` Darrick J. Wong
2021-03-16  8:43           ` Christoph Hellwig
2021-03-08 22:26   ` Darrick J. Wong
2021-03-15 14:42   ` Brian Foster
2021-03-05  5:11 ` [PATCH 06/45] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-03-15 14:43   ` Brian Foster
2021-03-16  8:47   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 07/45] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-03-15 14:45   ` Brian Foster
2021-03-16 14:15   ` Christoph Hellwig
2021-03-05  5:11 ` [PATCH 08/45] xfs: journal IO cache flush reductions Dave Chinner
2021-03-08 10:49   ` Chandan Babu R
2021-03-08 12:25   ` Brian Foster
2021-03-09  1:13     ` Dave Chinner
2021-03-10 20:49       ` Brian Foster
2021-03-10 21:28         ` Dave Chinner
2021-03-05  5:11 ` [PATCH 09/45] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-03-05  5:11 ` [PATCH 10/45] xfs: reduce buffer log item shadow allocations Dave Chinner
2021-03-15 14:52   ` Brian Foster
2021-03-05  5:11 ` [PATCH 11/45] xfs: xfs_buf_item_size_segment() needs to pass segment offset Dave Chinner
2021-03-05  5:11 ` [PATCH 12/45] xfs: optimise xfs_buf_item_size/format for contiguous regions Dave Chinner
2021-03-05  5:11 ` [PATCH 13/45] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
2021-03-08 22:53   ` Darrick J. Wong
2021-03-11  0:26     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 14/45] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-03-08 23:45   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 15/45] xfs: CIL work is serialised, not pipelined Dave Chinner
2021-03-08 23:14   ` Darrick J. Wong
2021-03-08 23:38     ` Dave Chinner
2021-03-09  1:55       ` Darrick J. Wong
2021-03-09 22:35         ` Andi Kleen
2021-03-10  6:11           ` Dave Chinner
2021-03-05  5:11 ` [PATCH 16/45] xfs: type verification is expensive Dave Chinner
2021-03-05  5:11 ` [PATCH 17/45] xfs: No need for inode number error injection in __xfs_dir3_data_check Dave Chinner
2021-03-05  5:11 ` [PATCH 18/45] xfs: reduce debug overhead of dir leaf/node checks Dave Chinner
2021-03-05  5:11 ` [PATCH 19/45] xfs: factor out the CIL transaction header building Dave Chinner
2021-03-08 23:47   ` Darrick J. Wong
2021-03-16 14:50   ` Brian Foster
2021-03-05  5:11 ` [PATCH 20/45] xfs: only CIL pushes require a start record Dave Chinner
2021-03-09  0:07   ` Darrick J. Wong
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 21/45] xfs: embed the xlog_op_header in the unmount record Dave Chinner
2021-03-09  0:15   ` Darrick J. Wong
2021-03-11  2:54     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 22/45] xfs: embed the xlog_op_header in the commit record Dave Chinner
2021-03-09  0:17   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 23/45] xfs: log tickets don't need log client id Dave Chinner
2021-03-09  0:21   ` Darrick J. Wong
2021-03-09  1:19     ` Dave Chinner
2021-03-09  1:48       ` Darrick J. Wong
2021-03-11  3:01         ` Dave Chinner
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 24/45] xfs: move log iovec alignment to preparation function Dave Chinner
2021-03-09  2:14   ` Darrick J. Wong
2021-03-16 14:51   ` Brian Foster
2021-03-05  5:11 ` [PATCH 25/45] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
2021-03-09  2:21   ` Darrick J. Wong
2021-03-11  3:29     ` Dave Chinner
2021-03-11  3:41       ` Darrick J. Wong
2021-03-16 14:54         ` Brian Foster
2021-03-16 14:53   ` Brian Foster
2021-05-19  3:18     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 26/45] xfs: log ticket region debug is largely useless Dave Chinner
2021-03-09  2:31   ` Darrick J. Wong
2021-03-16 14:55   ` Brian Foster
2021-05-19  3:27     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 27/45] xfs: pass lv chain length into xlog_write() Dave Chinner
2021-03-09  2:36   ` Darrick J. Wong
2021-03-11  3:37     ` Dave Chinner
2021-03-16 18:38   ` Brian Foster
2021-03-05  5:11 ` [PATCH 28/45] xfs: introduce xlog_write_single() Dave Chinner
2021-03-09  2:39   ` Darrick J. Wong
2021-03-11  4:19     ` Dave Chinner
2021-03-16 18:39   ` Brian Foster
2021-05-19  3:44     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 29/45] xfs:_introduce xlog_write_partial() Dave Chinner
2021-03-09  2:59   ` Darrick J. Wong
2021-03-11  4:33     ` Dave Chinner
2021-03-18 13:22   ` Brian Foster
2021-05-19  4:49     ` Dave Chinner [this message]
2021-05-20 12:33       ` Brian Foster
2021-05-27 18:03         ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 30/45] xfs: xlog_write() no longer needs contwr state Dave Chinner
2021-03-09  3:01   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 31/45] xfs: CIL context doesn't need to count iovecs Dave Chinner
2021-03-09  3:16   ` Darrick J. Wong
2021-03-11  5:03     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 32/45] xfs: use the CIL space used counter for emptiness checks Dave Chinner
2021-03-10 23:01   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 33/45] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
2021-03-10 23:25   ` Darrick J. Wong
2021-03-11  5:42     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 34/45] xfs: rework per-iclog header CIL reservation Dave Chinner
2021-03-11  0:03   ` Darrick J. Wong
2021-03-11  6:03     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 35/45] xfs: introduce per-cpu CIL tracking sructure Dave Chinner
2021-03-11  0:11   ` Darrick J. Wong
2021-03-11  6:33     ` Dave Chinner
2021-03-11  6:42       ` Dave Chinner
2021-03-05  5:11 ` [PATCH 36/45] xfs: implement percpu cil space used calculation Dave Chinner
2021-03-11  0:20   ` Darrick J. Wong
2021-03-11  6:51     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 37/45] xfs: track CIL ticket reservation in percpu structure Dave Chinner
2021-03-11  0:26   ` Darrick J. Wong
2021-03-12  0:47     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 38/45] xfs: convert CIL busy extents to per-cpu Dave Chinner
2021-03-11  0:36   ` Darrick J. Wong
2021-03-12  1:15     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 39/45] xfs: Add order IDs to log items in CIL Dave Chinner
2021-03-11  1:00   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 40/45] xfs: convert CIL to unordered per cpu lists Dave Chinner
2021-03-11  1:15   ` Darrick J. Wong
2021-03-12  2:18     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 41/45] xfs: move CIL ordering to the logvec chain Dave Chinner
2021-03-11  1:34   ` Darrick J. Wong
2021-03-12  2:29     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 42/45] xfs: __percpu_counter_compare() inode count debug too expensive Dave Chinner
2021-03-11  1:36   ` Darrick J. Wong
2021-03-05  5:11 ` [PATCH 43/45] xfs: avoid cil push lock if possible Dave Chinner
2021-03-11  1:47   ` Darrick J. Wong
2021-03-12  2:36     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 44/45] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
2021-03-11  2:00   ` Darrick J. Wong
2021-03-16  3:04     ` Dave Chinner
2021-03-05  5:11 ` [PATCH 45/45] xfs: expanding delayed logging design with background material Dave Chinner
2021-03-11  2:30   ` Darrick J. Wong
2021-03-16  3:28     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210519044903.GN2893@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.