linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Miklos Szeredi <miklos@szeredi.hu>
To: Ian Kent <raven@themaw.net>
Cc: Lennart Poettering <mzxreary@0pointer.de>,
	David Howells <dhowells@redhat.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	dray@redhat.com, Karel Zak <kzak@redhat.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Jeff Layton <jlayton@redhat.com>,
	andres@anarazel.de, keyrings@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksa Sarai <cyphar@cyphar.com>
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()
Date: Tue, 7 Apr 2020 15:59:10 +0200	[thread overview]
Message-ID: <CAJfpegvYGB01i9eqCH-95Ynqy0P=CuxPCSAbSpBPa-TV8iXN0Q@mail.gmail.com> (raw)
In-Reply-To: <a4b5828d73ff097794f63f5f9d0fd1532067941c.camel@themaw.net>

[-- Attachment #1: Type: text/plain, Size: 2797 bytes --]

On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <raven@themaw.net> wrote:
> > Right now, when you have n mounts, and any mount changes, or one is
> > added or removed then we have to parse the whole mount table again,
> > asynchronously, processing all n entries again, every frickin
> > time. This means the work to process n mounts popping up at boot is
> > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > fixed, by some mount API that can send us minimal notifications about
> > what happened and where, then this becomes O(n), which is totally OK.

Something's not right with the above statement.  Hint: if there are
lots of events in quick succession, you can batch them quite easily to
prevent overloading the system.

Wrote a pair of utilities to check out the capabilities of the current
API.   The first one just creates N mounts, optionally sleeping
between each.  The second one watches /proc/self/mountinfo and
generates individual (add/del/change) events based on POLLPRI and
comparing contents with previous instance.

First use case: create 10,000 mounts, then start the watcher and
create 1000 mounts with a 50ms sleep between them.  Total time (user +
system) consumed by the watcher: 25s.  This is indeed pretty dismal,
and a per-mount query will help tremendously.  But it's still "just"
25ms per mount, so if the mounts are far apart (which is what this
test is about), this won't thrash the system.  Note, how this is self
regulating: if the load is high, it will automatically batch more
requests, preventing overload.  It is also prone to lose pairs of add
+ remove in these case (and so is the ring buffer based one from
David).

Second use case: start the watcher and create 50,000 mounts with no
sleep between them.   Total time consumed by the watcher: 0.154s or
3.08us/event.    Note, the same test case adds about 5ms for the
50,000 umount events, which is 0.1us/event.

Real life will probably be between these extremes, but it's clear that
there's room for improvement in userspace as well as kernel
interfaces.  The current kernel interface is very efficient in
retrieving a lot of state in one go.  It is not efficient in handling
small differences.

> > Anyway, I have the suspicion this discussion has stopped being
> > useful. I think you are trying to fix problems that userspce actually
> > doesn't have. I can just tell you what we understand the problems
> > are,
> > but if you are out trying to fix other percieved ones, then great,
> > but
> > I mostly lost interest.

I was, and still am, trying to see the big picture.

Whatever.   I think it's your turn to show some numbers about how the
new API improves performance of systemd with a large number of mounts.

Thanks,
Miklos

[-- Attachment #2: many-mounts.c --]
[-- Type: text/x-csrc, Size: 1155 bytes --]

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <err.h>
#include <sys/stat.h>
#include <sys/mount.h>

int main(int argc, char *argv[])
{
	char *base_path = argv[1];
	char name[4096];
	int nr_mounts, i, sleep_ms = 0;

	if (argc < 3 || argc > 4)
		errx(1, "usage: %s base_path nr_mounts [sleep_ms]", argv[0]);

	nr_mounts = atoi(argv[2]);
	if (argc > 3)
		sleep_ms = atoi(argv[3]);

	fprintf(stderr, "Mounting...\n");
	if (mount("none", base_path, "tmpfs", 0, NULL) == -1)
		err(1, "mount/tmpfs");
	if (mount("none", base_path, NULL, MS_PRIVATE, NULL) == -1)
		err(1, "mount/MS_PRIVATE");
	for (i = 0; i < nr_mounts; i++) {
		sprintf(name, "%s/%d", base_path, i);
		if (mkdir(name, 0755) == -1)
			err(1, "mkdir");
		if (mount("none", name, "tmpfs", 0, NULL) == -1)
			err(1, "mount/tmpfs");
		if (mount("none", name, NULL, MS_PRIVATE, NULL) == -1)
			err(1, "mount/MS_PRIVATE");
		if (sleep_ms)
			usleep(sleep_ms * 1000);
	}
	fprintf(stderr, "Press ENTER\n");
	getchar();

	fprintf(stderr, "Unmounting...\n");
	if (umount2(base_path, MNT_DETACH) == -1)
		err(1, "umount");

	fprintf(stderr, "Done\n");

	return 0;
}

[-- Attachment #3: watch_mounts.c --]
[-- Type: text/x-csrc, Size: 3380 bytes --]

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <poll.h>
#include <err.h>

struct index {
	struct index *next;
	struct index *prev;
	const char *line;
};

struct state {
	size_t bufsize;
	char *buf;
	size_t index_size;
	struct index *index;
	struct index head;
};

static void read_mountinfo(struct pollfd *pfd, char *buf, size_t bufsize)
{
	int readcnt, backoff = 0, retry = 0;
	size_t len;
	ssize_t res;

retry:
	if (lseek(pfd->fd, 0, SEEK_SET) == (off_t) -1)
		err(1, "lseek");
	len = 0;
	readcnt = 0;
	do {
		if (len >= bufsize - 4096)
			errx(1, "buffer overrun");
		res = read(pfd->fd, buf + len, bufsize - len);
		if (res == -1)
			err(1, "read");
		len += res;

		if (!res || !(++readcnt % 16)) {
			if (poll(pfd, 1, 0) == -1)
				err(1, "poll/0");
			if (pfd->revents & POLLPRI) {
				if (!backoff) {
					backoff++;
					goto retry;
				}
				if (!retry) {
					fprintf(stderr, "retry.");
					retry = 1;
				}
				do {
					usleep(backoff * 1000);
					if (backoff < 128)
						backoff *= 2;
					if (poll(pfd, 1, 0) == -1)
						err(1, "poll/0");
				} while (pfd->revents & POLLPRI);
				goto retry;
			}
		}
	} while (res);
	buf[len] = '\0';

	if (retry) {
		fprintf(stderr, "..\n");
		retry = 0;
	}
}

static void add_index(struct state *s, struct index *this, const char *line)
{
	struct index *prev = s->head.prev, *next = &s->head;

	if (this->line)
		errx(1, "index corruption");

	this->line = line;
	this->next = next;
	this->prev = prev;
	prev->next = next->prev = this;
}

static void del_index(struct index *this)
{
	struct index *prev = this->prev, *next = this->next;

	this->line = NULL;
	prev->next = next;
	next->prev = prev;
}

static void diff_mountinfo(struct state *old, struct state *cur)
{
	char *line, *end;
	struct index *this;
	int mntid;

	cur->head.next = cur->head.prev = &cur->head;
	for (line = cur->buf; line[0]; line = end + 1) {
		end = strchr(line, '\n');
		if (!end)
			errx(1, "parsing (1)");
		*end = '\0';
		if (sscanf(line, "%i", &mntid) != 1)
			errx(1, "parsing (2)");
		if (mntid < 0 || (size_t) mntid >= cur->index_size)
			errx(1, "index overflow");
		add_index(cur, &cur->index[mntid], line);

		this = &old->index[mntid];
		if (this->line) {
			if (strcmp(this->line, line))
				printf("* %s\n", line);
			del_index(this);
		} else {
			printf("+ %s\n", line);
		}
	}
	while (old->head.next != &old->head) {
		this = old->head.next;
		printf("- %s\n", this->line);
		del_index(this);
	}
	fflush(stdout);
}

int main(void)
{
	struct state state[2], *old = &state[0], *cur = &state[1], *tmp;
	struct pollfd pfd = { .events = POLLPRI };

	old->index_size = cur->index_size = 131072;
	old->bufsize = cur->bufsize = cur->index_size * 128;
	old->index = calloc(old->index_size, sizeof(struct index));
	cur->index = calloc(cur->index_size, sizeof(struct index));
	old->buf = malloc(old->bufsize);
	cur->buf = malloc(cur->bufsize);
	if (!old->index || !cur->index || !old->buf || !cur->buf)
		err(1, "allocating buffers");

	old->buf[0] = '\0';
	old->head.prev = old->head.next = &old->head;

	pfd.fd = open("/proc/self/mountinfo", O_RDONLY);
	if (pfd.fd == -1)
		err(1, "open");

	while (1) {
		read_mountinfo(&pfd, cur->buf, cur->bufsize);
		diff_mountinfo(old, cur);

		tmp = old;
		old = cur;
		cur = tmp;

		if (poll(&pfd, 1, -1) == -1)
			err(1, "poll/inf");
	}
}

  reply	other threads:[~2020-04-07 13:59 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-30 13:58 Upcoming: Notifications, FS notifications and fsinfo() David Howells
2020-03-30 14:31 ` [GIT PULL] General notification queue and key notifications David Howells
2020-03-31  6:51   ` Stephen Rothwell
2020-03-30 14:36 ` [GIT PULL] Mount and superblock notifications David Howells
2020-04-04 21:13   ` Linus Torvalds
2020-04-05 22:52     ` Andres Freund
2020-03-30 14:43 ` [GIT PULL] fsinfo: Filesystem information query David Howells
2020-03-30 20:28 ` Upcoming: Notifications, FS notifications and fsinfo() Miklos Szeredi
2020-03-31  9:21   ` Karel Zak
2020-03-30 21:17 ` Christian Brauner
2020-03-31  5:11   ` Miklos Szeredi
2020-03-31  8:15     ` Christian Brauner
2020-03-31  8:34       ` Miklos Szeredi
2020-03-31  8:34     ` Karel Zak
2020-03-31  8:56       ` Miklos Szeredi
2020-03-31  9:49         ` Karel Zak
2020-03-31 12:25         ` Lennart Poettering
2020-03-31 15:10           ` Miklos Szeredi
2020-03-31 15:24             ` Lennart Poettering
2020-03-31 21:56         ` David Howells
2020-03-31 21:54     ` David Howells
2020-04-01  8:43       ` Karel Zak
2020-03-31  7:22   ` Lennart Poettering
2020-03-31 17:31 ` David Howells
2020-03-31 19:42   ` Miklos Szeredi
2020-03-31 19:47   ` David Howells
2020-03-31 21:14   ` David Howells
2020-03-31 21:23   ` David Howells
2020-03-31 21:52 ` David Howells
2020-04-01  9:04   ` Karel Zak
2020-04-01 13:34     ` Miklos Szeredi
2020-04-01 13:55     ` David Howells
2020-04-01 13:58     ` David Howells
2020-04-01 15:25       ` Miklos Szeredi
2020-04-03  9:11         ` Karel Zak
2020-04-01 16:01       ` David Howells
2020-04-01 16:30         ` Miklos Szeredi
2020-04-02 15:22         ` David Howells
2020-04-02 15:24           ` Miklos Szeredi
2020-04-02 15:42           ` David Howells
2020-04-02 15:24         ` David Howells
2020-04-01 14:41   ` Lennart Poettering
2020-04-01 15:33     ` Miklos Szeredi
2020-04-01 16:06     ` David Howells
2020-04-01 16:40       ` Miklos Szeredi
2020-04-02  2:52         ` Ian Kent
2020-04-02 13:52           ` Miklos Szeredi
2020-04-02 14:36             ` Lennart Poettering
2020-04-02 15:22               ` Miklos Szeredi
2020-04-02 15:28                 ` Lennart Poettering
2020-04-02 15:35                   ` Miklos Szeredi
2020-04-02 15:50                     ` Lennart Poettering
2020-04-02 17:20                       ` Miklos Szeredi
2020-04-03 11:08                         ` Lennart Poettering
2020-04-03 11:48                           ` Miklos Szeredi
2020-04-03 15:01                             ` Lennart Poettering
2020-04-06  9:22                               ` Miklos Szeredi
2020-04-06 17:29                                 ` Lennart Poettering
2020-04-07  2:21                                   ` Ian Kent
2020-04-07 13:59                                     ` Miklos Szeredi [this message]
2020-04-07 15:53                                       ` Lennart Poettering
2020-04-07 16:06                                         ` Miklos Szeredi
2020-04-02 15:51                 ` David Howells
2020-04-02 15:56                 ` David Howells
2020-04-03  1:44             ` Ian Kent
2020-04-03 11:11               ` Lennart Poettering
2020-04-03 11:38                 ` Miklos Szeredi
2020-04-03 12:05                   ` Richard Weinberger
2020-04-03 15:12                   ` Lennart Poettering
2020-04-03 20:30                     ` J. Bruce Fields
2020-04-06  8:35                       ` Miklos Szeredi
2020-04-06 16:07                         ` J. Bruce Fields
2020-04-06  9:17                       ` Karel Zak
2020-04-06 16:34                         ` Linus Torvalds
2020-04-06 18:46                           ` J. Bruce Fields
2020-04-06 18:48                           ` Lennart Poettering
2020-04-08  3:36                             ` Linus Torvalds
2020-04-03 15:36                   ` David Howells
2020-04-03 15:41                     ` Lennart Poettering

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJfpegvYGB01i9eqCH-95Ynqy0P=CuxPCSAbSpBPa-TV8iXN0Q@mail.gmail.com' \
    --to=miklos@szeredi.hu \
    --cc=andres@anarazel.de \
    --cc=christian.brauner@ubuntu.com \
    --cc=cyphar@cyphar.com \
    --cc=dhowells@redhat.com \
    --cc=dray@redhat.com \
    --cc=jlayton@redhat.com \
    --cc=keyrings@vger.kernel.org \
    --cc=kzak@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=mzxreary@0pointer.de \
    --cc=raven@themaw.net \
    --cc=swhiteho@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).