linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v24 00/11] Log Attribute Replay
Date: Wed, 1 Sep 2021 16:29:43 +1000	[thread overview]
Message-ID: <20210901062943.GX3657114@dread.disaster.area> (raw)
In-Reply-To: <2b6b0478-b0df-7e35-b0ac-f02298ccf727@oracle.com>

On Tue, Aug 31, 2021 at 11:27:48AM -0700, Allison Henderson wrote:
> 
> 
> On 8/30/21 5:20 PM, Dave Chinner wrote:
> > On Tue, Aug 24, 2021 at 03:44:23PM -0700, Allison Henderson wrote:
> > > Hi all,
> > > 
> > > This set is a subset of a larger series parent pointers. Delayed attributes allow
> > > attribute operations (set and remove) to be logged and committed in the same
> > > way that other delayed operations do. This allows more complex operations (like
> > > parent pointers) to be broken up into multiple smaller transactions. To do
> > > this, the existing attr operations must be modified to operate as a delayed
> > > operation.  This means that they cannot roll, commit, or finish transactions.
> > > Instead, they return -EAGAIN to allow the calling function to handle the
> > > transaction.  In this series, we focus on only the delayed attribute portion.
> > > We will introduce parent pointers in a later set.
> > > 
> > > The set as a whole is a bit much to digest at once, so I usually send out the
> > > smaller sub series to reduce reviewer burn out.  But the entire extended series
> > > is visible through the included github links.
> 
> Some of this we worked out in the chat last night, but I will echo it here
> for the archives
> 
> > 
> > Ok, so like I did with Darrick's deferred inactivation series, the
> > first thing I'm doing here is throwing this patchset at
> > scalability/performance worklaods and finding out what is different.
> > 
> > I've merged this series with 5.14 + xfs/for-next + xfs-cil-scale and
> > then run some tests on it. First up is fsmark creating zero length
> > files w/ 64 byte xattrs. This should stress only shortform attribute
> > manipulations.
> > 
> > I have not enabled delayed attributes yet (i.e.
> > /sys/fs/xfs/debug/larp = 0)
> > 
> > First thing I notice is the transaction commit rate during create is
> > up around 900k/s, so we are doing 3 transactions per inode - 1 for
> > create, 2 for attributes. That looks like a regression - existing
> > shortform attribute creation only takes a single transaction commit,
> > so this workload prior to this patchset only ran at 600k commits/s.
> > 
> > Note that hte only reason I'm getting 900k transactions/s is the
> > CIL scalability patchset - without that the system tops out at ~800k
> > transactions/s and so this would be a significant performance
> > regression (20%) vs the current xfs/for-next code.
> > 
> > Essentially, this looks like we are doing an extra transaction
> > commit to defer the creation of the attribute, then doing another
> > transaction to actually modify the attribute. i.e.:
> > 
> >   - 11.04% xfs_attr_set
> >      - 8.70% xfs_trans_commit
> >         - 8.69% __xfs_trans_commit
> > 	  - 5.10% xfs_defer_finish_noroll
> > 	     - 3.74% xfs_defer_trans_roll
> > 		- 3.57% xfs_trans_roll
> > 		   - 3.13% __xfs_trans_commit
> > 		      - 3.01% xlog_cil_commit
> > 			   0.66% down_read
> > 			   0.63% xfs_log_ticket_regrant
> > 	     - 1.16% xfs_attr_finish_item
> > 		- 1.06% xfs_trans_attr_finish_update
> > 		   - 1.03% xfs_attr_set_iter
> > 		      - 1.01% xfs_attr_sf_addname
> > 			 - 0.99% xfs_attr_try_sf_addname
> > 			    - 0.61% xfs_attr_shortform_addname
> > 				 0.55% xfs_attr_shortform_add
> > 
> > 
> > AFAICT, for non-delayed attributes, this first transaction commit
> > logs the inode but does not create intent or intent done items
> > (returns NULL for both operations), so just rolls and runs the
> > ->finish_item. So it would seem that the first transaction just
> > changes the inode timestamps and does nothing else.
> > 
> > Firstly, this means the inode timestamp change is not atomic w.r.t.
> > the attribute change the timestamp change relates to and it's
> > essentially new overhead for the non-delayed path.
> > 
> > Looking at the unlink path, I see the same thing - there's an extra
> > transaction for the attr remove path, the same as the attr set path.
> > This drives the unlink path to 1.1 million transaction commits/sec
> > instead of 800k/s, so it's likely that there's a substantial
> > performance regression here on a kernel without the CIL scalability
> > patchset.
> > 
> > IOWs, there's significant behavioural changes with the non-delayed
> > logging version of this patchset, both in terms of performance and
> > the atomicity of changes that appear in the journal and hence
> > recovery behaviour.
> > 
> > At this point I have to ask: why are we trying to retain the "old"
> > way of doing things (even for testing) if it is substantially
> > changing behaviour and on-disk journal contents for attribute
> > modifications?
> Per the chat discussion, we have to keep both methods since sb v4 would not
> use the new log entries.

Yup, but we likely have better reasons than this. Tests I've run
today have indicated that as the xattr size goes up, the performance
regression of the delayed attrs gets worse. At max sized xattrs,
performance is only ~25% of the existing attr implementation.

Largely the problem is the number of copies of the xattr value we
end up in memory when we are using delalyed attrs. In the current
case, we have this all in a single syscall context:

	- xattr value allocated by VFS
	- attached to xfs_da_args
	- xfs_da_args passed to xfs_attr_rmtval_set_value() where we
	  allocate buffer(s) to hold the encoded name.
	- buffer is written to disk.
	- xattr VFS value freed.

And the xfs_buf is freed when memory pressure occurs.

So, essentially we have two copies of the xattr value and two
memcpy()s of it to get it to disk.

Now, with delayed attrs we have this in the syscall context:

	- xattr value allocated by VFS
	- attached to xfs_da_args
	- attached to xfs_attri_item
	- attri item intent gets created, allocates a 64k+ shadow
	  buffer and memcpy()s the value into it.
	- intent gets committed and attached to the CIL
	- xfs_da_args passed to xfs_attr_rmtval_set_value() where we
	  allocate buffer(s) to hold the encoded name.
	- buffer is written to disk.
	- xattr VFS value freed.

Now userspace runs more xattr creation syscalls, all queuing up
shadow buffers on the CIL. memory footprint goes up massively.

Some time later, the CIL commits and we:
	- shadow buffer attached to CIL commit
	- memcpy() the value from the shadow buffer to the iclog
	- commit the CIL checkpoint
	- on CIL checkpoint completion, shadow buffer is freed.

And the xfs_bufs is freed when memory pressure occurs.

The result is that with delayed attrs, large attr creation hammers
the page allocator really badly - more than 50% of the CPU time is
spent trying to allocate pages and compact memory into contiguous
regions and so performance goes way down. e.g.

- 51.90% xlog_cil_commit
 - 36.96% kvmalloc_node
    - 36.51% __kmalloc_node
       - 36.32% kmalloc_large_node
	  - 36.24% __alloc_pages
	     - 34.53% __alloc_pages_slowpath.constprop.0
		- 33.80% __alloc_pages_direct_compact
		   - 33.62% try_to_compact_pages
		      - compact_zone_order
			 - 30.58% compact_zone
			      14.72% PageHuge

I suspect we need to look at our use of kvmalloc() and the flags we
pass it because I think if we can't get contiguous ranges
immediately, we should go straight to vmalloc rather than burn a
dozen CPUs trying to compact memory...

The amount we log also goes way up.  At 16 threads and 4kB xattrs,
the log alone is sustaining more than 1GB/s throughput at 50k xattrs
creates/s.  There is over 2GB/s going to disk.

With 64k xattrs, there is over 2.5GB/s being written to the log and
just over 3GB/s being written to disk. Performance is about 9000
xattr creates/s.

In comparison, with delayed attrs turned off for 64k xattrs, log
throughput is roughly 500MB/s, disk throughput is between 3.5-4GB/s
and the create rate is around 45,000 xattrs/s. So you can see that
logging large xattrs really hurts performance.

We want delayed attrs for modifying attributes atomically
with other inode modifications (i.e. parent pointers, security
labels, ACLs, etc) and none of these use cases use large xattrs -
they are all relatively small and so the logging and memory overhead
of delayed attrs isn't a huge deal for them. For anything else,
small or large, a 10-15% regression is a deal breaker (think ceph,
samba, etc).

So I think we're going to need to select attribute modification
behaviour at the call site. e.g. from xattr syscalls we don't get
any real integrity benefit from logging xattrs, so maybe this path
always uses the unlogged path. The we can change the internal create
path to use delayed attrs for parent pointers and security attrs.
As Darrick said, this is likely what the current larp debug knob
should turn into - caller selected behaviour.

> > Ok, there's the first failure.
> > 
> > This looks like it's a problem with xfs_attri_item_{size,format} in
> > calculating the number of bytes to log. They use ATTR_NVEC_SIZE() to
> > calculate the number of bytes of copy from the attribute item which
> > rounds up the length to copy to 4 byte aligned values. I'm not sure
> > what this function is calculating:
> > 
> > /* iovec length must be 32-bit aligned */
> > static inline size_t ATTR_NVEC_SIZE(size_t size)
> > {
> >          return size == sizeof(int32_t) ? size :
> > 	               sizeof(int32_t) + round_up(size, sizeof(int32_t));
> > }
> > 
> > It appears to be saying if the size == 4, then return 4, otherwise
> > return 4 + roundup(size)... which leads me to struct
> > xfs_attri_log_format:
> > 
> > struct xfs_attri_log_format {
> >          uint16_t        alfi_type;      /* attri log item type */
> >          uint16_t        alfi_size;      /* size of this item */
> >          uint32_t        __pad;          /* pad to 64 bit aligned */
> >          uint64_t        alfi_id;        /* attri identifier */
> >          uint64_t        alfi_ino;       /* the inode for this attr operation */
> >          uint32_t        alfi_op_flags;  /* marks the op as a set or remove */
> >          uint32_t        alfi_name_len;  /* attr name length */
> >          uint32_t        alfi_value_len; /* attr value length */
> >          uint32_t        alfi_attr_flags;/* attr flags */
> > };
> > 
> > I don't see where the extra 4 bytes for the attribute vector size
> > comes from. It's not needed to store the length, so this could
> > oversize the amount of data to be copied from the source
> > buffer by up to 7 bytes?
> > 
> > I can see that it might need rounding with the existing
> > log code (because the formatter is responsible for 32 bit alignment
> > of log vectors), but that goes away with the CIL scalability
> > patchset that always aligns iovecs to 4 byte alignment so the
> > formatters do not need to do that.
> 
> I think we figured this out last night, initially this was here for an
> assertion check in the log code, but I think just the round up will suffice
> for the check.

I've got a cleaner solution in the works - we should be hiding the
alignment behind the log iovec API, not force callers to know about
it and always get it right.

> > Hiding it in a "macro" is not necessary, either - look at how
> > xfs_inode_item_{data,attr}_fork_size handle the rounding up of the
> > local format fork size. They round up the fork byte count to 4
> > directly, and the format code copies those bytes because
> > xfs_idata_realloc() allocates those bytes.
> > 
> > However, for the attribute buffers, this isn't guaranteed. Look at
> > xfs_xattr_set():
> > 
> >          struct xfs_da_args      args = {
> >                  .dp             = XFS_I(inode),
> >                  .attr_filter    = handler->flags,
> >                  .attr_flags     = flags,
> >                  .name           = name,
> >                  .namelen        = strlen(name),
> >                  .value          = (void *)value,
> >                  .valuelen       = size,
> >          };
> > 
> > There is no rounding up of the name or value lengths, and these end
> > up directly referenced by the deferred logging via xfs_attr_log_item()
> > and attrip->da_args->...
> > 
> >          attrip->attri_name = (void *)attr->xattri_da_args->name;
> >          attrip->attri_value = attr->xattri_da_args->value;
> >          attrip->attri_name_len = attr->xattri_da_args->namelen;
> >          attrip->attri_value_len = attr->xattri_da_args->valuelen;
> > 
> > We then pass those pointers directly to xlog_iovec_copy() but with a
> > rounded up length that is longer than the source buffer:
> > 
> >          xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME,
> >                          attrip->attri_name,
> >                          ATTR_NVEC_SIZE(attrip->attri_name_len));
> >          if (attrip->attri_value_len > 0)
> >                  xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE,
> >                                  attrip->attri_value,
> >                                  ATTR_NVEC_SIZE(attrip->attri_value_len));
> > 
> > So while this might not be the source of the above crash, it's
> > certainly a bug that needs fixing.
> > 
> > At this point, I'm just going to hack on the code to make it work,
> > and we can go from there...
> 
> I think when we left off last night, we are more concerned with the extra
> transaction overhead in the perf captures.  I will see if I can replicate
> what you are seeing with perf and maybe we can work out some short cuts.
> Thank for your help here!

Yup, there's a few issues. I've hacked out the initial transaction
for non-delayed attrs and that brings the commit count down as
expected. it might be acceptible, but we'll see.

I've also added intent whiteouts to try to avoid having to log
intents that are completed before the intent has been written to the
log. With that in place, the above large xattr creation workloads
have relatively consistent log throughput of around 600-700MB/s, and
performance for 4k and 64kB xattrs roughly doubles. It's still way
behind the current implementation, though, because it hammers the
page allocator even harder now. However, intent whiteouts are
generic, so I'll write this up into a separate patchset that
addresses all the other types of intents we have as well.

There's some other things we can potentially do to minimise the
memory footprint and the number of memcpy()s of the logged values,
so I'll look into these tomorrow.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-09-01  6:29 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24 22:44 Allison Henderson
2021-08-24 22:44 ` [PATCH v24 01/11] xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process Allison Henderson
2021-08-24 22:44 ` [PATCH v24 02/11] xfs: Capture buffers for delayed ops Allison Henderson
2021-08-27 11:33   ` Chandan Babu R
2021-08-31 18:03     ` Allison Henderson
2021-08-30 17:44   ` Darrick J. Wong
2021-08-31 17:50     ` Allison Henderson
2021-09-18  0:02       ` Darrick J. Wong
2021-08-24 22:44 ` [PATCH v24 03/11] xfs: Set up infrastructure for log atrribute replay Allison Henderson
2021-08-27 14:17   ` Chandan Babu R
2021-08-31 18:03     ` Allison Henderson
2021-08-31  0:48   ` Dave Chinner
2021-08-31 19:03     ` Allison Henderson
2021-08-31 21:52   ` Dave Chinner
2021-09-01  5:34     ` Allison Henderson
2021-08-24 22:44 ` [PATCH v24 04/11] xfs: Implement attr logging and replay Allison Henderson
2021-08-30  7:47   ` Chandan Babu R
2021-08-31 18:11     ` Allison Henderson
2021-08-24 22:44 ` [PATCH v24 05/11] RFC xfs: Skip flip flags for delayed attrs Allison Henderson
2021-08-30 10:15   ` Chandan Babu R
2021-08-31 18:11     ` Allison Henderson
2021-08-24 22:44 ` [PATCH v24 06/11] xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred Allison Henderson
2021-08-30 10:27   ` Chandan Babu R
2021-08-31 18:12     ` Allison Henderson
2021-09-01  3:47   ` Dave Chinner
2021-09-01  5:34     ` Allison Henderson
2021-08-24 22:44 ` [PATCH v24 07/11] xfs: Remove unused xfs_attr_*_args Allison Henderson
2021-08-24 22:44 ` [PATCH v24 08/11] xfs: Add log attribute error tag Allison Henderson
2021-08-24 22:44 ` [PATCH v24 09/11] xfs: Add larp debug option Allison Henderson
2021-08-30 12:03   ` Chandan Babu R
2021-08-31 18:12     ` Allison Henderson
2021-08-24 22:44 ` [PATCH v24 10/11] xfs: Merge xfs_delattr_context into xfs_attr_item Allison Henderson
2021-08-24 22:44 ` [PATCH v24 11/11] xfs: Add helper function xfs_attr_leaf_addname Allison Henderson
2021-08-30 14:17   ` Chandan Babu R
2021-08-31 18:13     ` Allison Henderson
2021-08-31  0:20 ` [PATCH v24 00/11] Log Attribute Replay Dave Chinner
2021-08-31 18:27   ` Allison Henderson
2021-09-01  6:29     ` Dave Chinner [this message]
2021-09-01  7:30 ` [PATCH 0/5] xfs: various logged attribute fixes Dave Chinner
2021-09-01  7:30   ` [PATCH 1/5] xfs: fix fallthrough annotations in xfs_attr_set_iter() Dave Chinner
2021-09-01  7:30   ` [PATCH 2/5] xfs: fix flags passed to kvmalloc() by xfs_attri_init() Dave Chinner
2021-09-01  7:30   ` [PATCH 3/5] xfs: hide log iovec alignment constraints Dave Chinner
2021-09-01  7:30   ` [PATCH 4/5] xfs: fix attribute log iovec sizing Dave Chinner
2021-09-01  7:30   ` [PATCH 5/5] [RFC] xfs: don't commit the first deferred transaction without intents Dave Chinner
2021-09-01 21:43   ` [PATCH 0/5] xfs: various logged attribute fixes Allison Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210901062943.GX3657114@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=allison.henderson@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    --subject='Re: [PATCH v24 00/11] Log Attribute Replay' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).