All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 8/8] xfs: cache unlinked pointers in an rhashtable
Date: Fri, 1 Feb 2019 20:31:59 -0800	[thread overview]
Message-ID: <20190202043159.GN5761@magnolia> (raw)
In-Reply-To: <20190201235900.GX6173@dastard>

On Sat, Feb 02, 2019 at 10:59:00AM +1100, Dave Chinner wrote:
> On Fri, Feb 01, 2019 at 12:03:43AM -0800, Christoph Hellwig wrote:
> > On Thu, Jan 31, 2019 at 03:18:40PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Use a rhashtable to cache the unlinked list incore.  This should speed
> > > up unlinked processing considerably when there are a lot of inodes on
> > > the unlinked list because iunlink_remove no longer has to traverse an
> > > entire bucket list to find which inode points to the one being removed.
> > 
> > This sounds pretty reasonable and a real review will follow.  But can
> > you quantify the considerably speedups for real life workloads?
> 
> When you have more than a few hundred open-but-unlinked inodes in an
> AG, the unlinked list walking starts to take a significant amount of
> time when the final reference to an inode goes away. It's
> essentially an O(N) buffer walk, so takes a lot of CPU time when a
> list gets more than a few inode long.
> 
> With darrick's background inactivation, the lists grow to tens of
> thousands of inodes very quickly. I was seeing >95% of CPU time
> being spend in unlinked list walking on large scale 'rm -rf'
> workloads and performance absolutely dropped off a cliff. My initial
> code (from years and years ago) used a
> double linked list, and when I forward ported that and ran it up,
> the CPU overhead of unlink list removal was cut to almost nothing and
> all the performance regressions with background inactivation went
> away. i.e. Unlink list removal went from an O(N) overhead to always
> being O(1), so the length of the list didnt' affect performance any
> more.
> 
> Darrick's code replaces the list_head in each inode I used with a
> rhashtable - it removes the 16 bytes per-inode memory footprint my
> implementation required, but otherwise is identical.  Hence I expect
> that it'll show exactly the same performance characteristics as the
> existing code.
> 
> IOWs, it's really required for backgrounding and parallelising inode
> inactivation, but otherwise it won't make any noticable/measurable
> difference to most workloads because the unlinked lists almost never
> grows more than one inode in length and so O(N) ~= O(1) almost all
> the time...

I wrote a silly program to open as many O_TMPFILE files as it can and
then close them all.  I wrapped that in a script to crank up ulimit -n
to fs.file_max (which on this VM was about 194000 files) and came up
with this on a 5.0-rc4 kernel with deferred inactivation and iunlink
backref caching:

+ /d/t/tmpfile/tmpfile
Opened 193519 files in 6.44s.
Closed 193519 files in 1.37s

real    0m7.818s
user    0m0.083s
sys     0m7.373s
+ cd /
+ umount /mnt

real    0m5.681s
user    0m0.008s
sys     0m0.000s

I then repeated the experiment with a vanilla 5.0-rc2 kernel I had lying
around and saw much slower results:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real    12m38.853s
user    0m0.084s
sys     12m34.470s
+ cd /
+ umount /mnt

real    0m0.086s
user    0m0.000s
sys     0m0.060s

The VM had 2G of RAM and a 3GB SCSI disk backed by a tmpfs file.

--D

Wrapper script:

#!/bin/bash -x

dir="$(dirname "$0")"

umount /mnt
rmmod xfs
mkfs.xfs -f /dev/sda
mount /dev/sda /mnt

ulimit -n $(cat /proc/sys/fs/file-max)

cd /mnt
time $dir/tmpfile
cd /
time umount /mnt

and tmpfile.c:

/* Open files and leak them. */
#define _GNU_SOURCE
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

static int min_fd = -1;
static int max_fd = -1;
static unsigned int nr_opened = 0;
static float start_time;

void clock_time(float *time)
{
	static clockid_t clkid = CLOCK_MONOTONIC;
	struct timespec ts;
	int ret;

retry:
	ret = clock_gettime(clkid, &ts);
	if (ret) {
		if (clkid == CLOCK_MONOTONIC) {
			clkid = CLOCK_REALTIME;
			goto retry;
		}
		perror("clock_gettime");
		exit(2);
	}
	*time = ts.tv_sec + ((float)ts.tv_nsec / 1000000000);
}

/*
 * Exit the program due to an error.
 *
 * If we've exhausted all the file descriptors, make sure we close all the
 * open fds in the order we received them in order to exploit a quirk of ext4
 * and xfs where the oldest unlinked inodes are at the /end/ of the unlinked
 * lists, which will make removing the unlinked files maximally painful.
 *
 * If it's some other error, just die and let the kernel sort it out.
 */
void die(void)
{
	float end_time;
	int fd;

	switch (errno) {
	case EMFILE:
	case ENFILE:
	case ENOSPC:
		clock_time(&end_time);
		printf("Opened %u files in %.2fs.\n", nr_opened,
				end_time - start_time);
		clock_time(&start_time);

		for (fd = min_fd; fd <= max_fd; fd++)
			close(fd);
		clock_time(&end_time);
		printf("Closed %u files in %.2fs\n", nr_opened,
				end_time - start_time);
		exit(0);
		break;
	default:
		perror("open?");
		exit(2);
		break;
	}
}

/* Remember how many file we open and all that. */
void remember_fd(int fd)
{
	if (min_fd == -1 || min_fd > fd)
		min_fd = fd;
	if (max_fd == -1 || max_fd < fd)
		max_fd = fd;
	nr_opened++;
}

/* Put an opened file on the unlinked list and leak the fd. */
void leak_tmpfile(void)
{
	int fd = -1;
	int ret;

	/* Try to create an O_TMPFILE and leak the fd. */
#ifdef O_TMPFILE
	fd = open(".", O_TMPFILE | O_RDWR, 0644);
	if (fd >= 0) {
		remember_fd(fd);
		return;
	}
	if (fd < 0)
		die();
#endif

	/* Oh well, create a new file, unlink it, and leak the fd. */
	fd = open("./moo", O_CREAT | O_RDWR, 0644);
	if (fd < 0)
		die();
	ret = unlink("./moo");
	if (ret)
		die();
	remember_fd(fd);
}

/* Try to put as many files on the unlinked list and then kill them. */
int main(int argc, char *argv[])
{
	clock_time(&start_time);
	while (1)
		leak_tmpfile();
	return 0;
}

  reply	other threads:[~2019-02-02  4:32 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-31 23:17 [PATCH 0/8] xfs: incore unlinked list Darrick J. Wong
2019-01-31 23:17 ` [PATCH 1/8] xfs: clean up iunlink functions Darrick J. Wong
2019-02-01  8:01   ` Christoph Hellwig
2019-02-02 19:15     ` Darrick J. Wong
2019-01-31 23:18 ` [PATCH 2/8] xfs: track unlinked inode counts in per-ag data Darrick J. Wong
2019-02-01 18:59   ` Brian Foster
2019-02-01 19:33     ` Darrick J. Wong
2019-02-02 16:14       ` Christoph Hellwig
2019-02-02 19:28         ` Darrick J. Wong
2019-01-31 23:18 ` [PATCH 3/8] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
2019-02-01 19:00   ` Brian Foster
2019-02-02 19:50     ` Darrick J. Wong
2019-02-02 16:21   ` Christoph Hellwig
2019-02-02 19:51     ` Darrick J. Wong
2019-01-31 23:18 ` [PATCH 4/8] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
2019-02-01 19:00   ` Brian Foster
2019-02-02 16:22   ` Christoph Hellwig
2019-01-31 23:18 ` [PATCH 5/8] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
2019-02-01 19:01   ` Brian Foster
2019-02-02 22:00     ` Darrick J. Wong
2019-02-02 16:27   ` Christoph Hellwig
2019-02-02 20:29     ` Darrick J. Wong
2019-01-31 23:18 ` [PATCH 6/8] xfs: hoist unlinked list search and mapping to a separate function Darrick J. Wong
2019-02-01 19:01   ` Brian Foster
2019-02-02 20:46     ` Darrick J. Wong
2019-02-04 13:18       ` Brian Foster
2019-02-04 16:31         ` Darrick J. Wong
2019-02-02 16:30   ` Christoph Hellwig
2019-02-02 20:42     ` Darrick J. Wong
2019-02-02 16:51   ` Christoph Hellwig
2019-01-31 23:18 ` [PATCH 7/8] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
2019-02-01 19:01   ` Brian Foster
2019-02-01 19:14     ` Darrick J. Wong
2019-01-31 23:18 ` [PATCH 8/8] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
2019-02-01  8:03   ` Christoph Hellwig
2019-02-01 23:59     ` Dave Chinner
2019-02-02  4:31       ` Darrick J. Wong [this message]
2019-02-02 16:07         ` Christoph Hellwig
2019-02-01 19:29   ` Brian Foster
2019-02-01 19:40     ` Darrick J. Wong
2019-02-02 17:01   ` Christoph Hellwig
2019-02-01  7:57 ` [PATCH 0/8] xfs: incore unlinked list Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190202043159.GN5761@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.