some more information on multi-hour data rollbacks, fsync, and superblock updates or lack thereof

* some more information on multi-hour data rollbacks, fsync, and superblock updates or lack thereof
@ 2020-02-12  4:01 Zygo Blaxell
  2020-02-12  4:23 ` Zygo Blaxell
  0 siblings, 1 reply; 2+ messages in thread
From: Zygo Blaxell @ 2020-02-12  4:01 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7908 bytes --]

While trying to understand how btrfs rolls back data several hours on a
crash, I deployed some observation tools:  one to measure fsync latency,
and one to measure the interval between updates to the filesystem
superblock's generation (transid) field.  I wanted to see if this
phenomenon can be detected so I can figure out how to replicate it.
I put these tools on some test and production machines.

The results so far are somewhat disturbing:

5.0 can do multi-hour transactions too, in fact it's even easier to do
on 5.0 than on 5.4.  Not that I've figured out how to replicate this
behavior on demand yet, just that the data I've collected so far has more
and longer long transactions occurring "naturally" on 5.0 than on 5.4.

There is no obvious correlation between the running times of fsync,
superblock updates, and other metadata operations.  Without looking
directly at the superblock, there's no obvious sign of a problem:
applications are reading and writing and even calling fsync() normally,
but the superblock's generation field stubbornly does not update.
A machine that I previously thought had been healthy and performing
excellently turned out to be my example case of a trivially reproduced,
worst-case scenario.

Everything seems fine as long as a host _never_ runs fsync.  I had one
host that was up for days without incident with transactions updating
at reasonable intervals, until I started running the fsync test, and had
an immediate 45-minute transaction stall.  The 5-hour data loss event I
reported a few days ago and the 14-hour transaction event before that
also started almost immediately following fsync() calls.  Maybe not
a coincidence?

fsync returns ... well, not immediately, but almost always within 10
minutes or less, even at times when the filesystem's superblock generation
number is not updated for an hour.

The good news is that fsync does work.  I caught a machine in the middle
of one of these multi-hour transactions, made two files, fsynced one,
and forced a crash, with the result:

	# cat /proc/version
	Linux version 5.0.21-zb64-079f865c6d4d+ (root@waya) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Wed Jan 22 12:06:32 EST 2020
	# cp /boot/vmlinuz test-no-fsync
	# cp /boot/vmlinuz test-fsync
	# sync test-fsync
	# ls -l
	total 16812
	-rw-r--r-- 1 root root  102458 Feb 11 22:09 fsync.log
	-rw------- 1 root root 8550048 Feb 11 22:09 test-fsync
	-rw------- 1 root root 8550048 Feb 11 22:09 test-no-fsync
	-rw-r--r-- 1 root root    3710 Feb 11 21:22 transid.log
	# echo b > /proc/sysrq-trigger

One reboot later:

	# ls -l
	total 8448
	-rw-r--r-- 1 root root   92256 Feb 11 20:56 fsync.log
	-rw------- 1 root root 8550048 Feb 11 22:09 test-fsync
	-rw-r--r-- 1 root root    3652 Feb 11 19:57 transid.log
	# cmp test-fsync /boot/vmlinuz
	#

So the file that was fsynced is complete and correct, but the file
that was not fsynced--and 73 minutes of other writes throughout the
filesystem--neatly disappeared.

This was the third of three consecutive hour-plus transactions on
this host:

        end of transaction timestamp   seconds with same transid
        2020-02-11-19-57-39 1581469059 4315.585493582
        2020-02-11-21-22-06 1581474126 5067.863739094
        [third transaction aborted by reboot 47 minutes later]

The timestamps don't quite line up here--there is data on the filesystem
after the last superblock update (at 21:22), but still far from current
(73 minutes lost).  Maybe btrfs is updating one superblock at a time,
so I only see one of every 3 transid updates with this method?  But the
transid increment is always 1, and I'd expect to see increments 3 at a
time if I was missing two thirds of them.

If you would like to play at home, the fsync latency testing tool is:

	#define _GNU_SOURCE 1

	#include <errno.h>
	#include <fcntl.h>
	#include <math.h>
	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <sys/random.h>
	#include <time.h>
	#include <unistd.h>

	static char buf[128 * 1024 * 1024];

	void
	fillrandom(void *buf, size_t size, unsigned flags)
	{
		while (size) {
			const ssize_t rv = getrandom(buf, size, flags);
			if (rv < 0) {
				perror("getrandom");
				exit(1);
			}
			if ((size_t)rv > size) {
				fprintf(stderr, "WTF: rv %zd > size %zu\n", rv, size);
				exit(1);
			}
			buf += rv;
			size -= rv;
		}
	}

	int
	main(int argc, char **argv)
	{
		double sleep_time = 1;
		if (argc == 2) {
			sleep_time = atof(argv[1]);
		}
		const int fd = open(".", O_TMPFILE | O_RDWR, 0666);
		if (fd < 0) {
			perror("open");
			exit(1);
		}
		const struct timespec sleep_ts = {
			.tv_sec = sleep_time,
			.tv_nsec = 1000000000 * (sleep_time - floor(sleep_time)),
		};

		fprintf(stderr, "Waiting for random...");
		fflush(stderr);
		fillrandom(buf, sizeof(buf), 0);
		fprintf(stderr, "\n");
		fflush(stderr);

		struct timespec next_time;
		clock_gettime(CLOCK_MONOTONIC, &next_time);

		while (true) {

			struct timespec start;
			clock_gettime(CLOCK_MONOTONIC, &start);

			write(2, "w\b", 2);
			unsigned wr_pos = 0, wr_len = 256 * 1024;
			fillrandom(&wr_pos, sizeof(wr_pos), 0);
			fillrandom(&wr_len, sizeof(wr_len), 0);
			wr_pos %= sizeof(buf) * 8;
			wr_len %= sizeof(buf);
			const int rvw = pwrite(fd, buf, wr_len, wr_pos);
			if (rvw < 0) {
				perror("pwrite");
				exit(1);
			}

			write(2, "f\b", 2);
			const int rvf = fsync(fd);
			if (rvf) {
				perror("fsync");
				exit(1);
			}
			write(2, "s\b", 2);

			struct timespec end;
			clock_gettime(CLOCK_MONOTONIC, &end);
			char ft[1024];
			struct timespec now;
			clock_gettime(CLOCK_REALTIME, &now);
			const struct tm *const tm = localtime(&now.tv_sec);
			strftime(ft, sizeof(ft), "%Y-%m-%d-%H-%M-%S", tm);
			const double start_ts = start.tv_sec + (start.tv_nsec / 1000000000.0);
			const double end_ts = end.tv_sec + (end.tv_nsec / 1000000000.0);
			printf("%s %ld.%09ld %.3g\n", ft, now.tv_sec, now.tv_nsec, end_ts - start_ts);
			fflush(stdout);

			next_time.tv_sec += sleep_ts.tv_sec;
			next_time.tv_nsec += sleep_ts.tv_nsec;
			if (next_time.tv_nsec >= 1000000000) {
				next_time.tv_nsec -= 1000000000;
				++next_time.tv_sec;
			}
			clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next_time, NULL);
		}
	}

Run it in any directory on the target filesystem, it will create a
temporary file, write data to it, fsync, repeat.  The first argument is
the number of seconds between write/fsync, e.g. to fsync every 1.5 seconds:

	fsync-log 1.5

The superblock generation monitoring tool is:

	#!/bin/sh
	devPath="$1"
	shift

	getTransid () {
		btrfs ins dump-super "$devPath" | sed -n 's/^generation *//p'
	}

	now () {
		date +%s.%N
	}

	lastTransTime="$(now)"
	lastTransid="$(getTransid)"

	while sleep 1; do
		nextTransid="$(getTransid)"
		nextTransTime="$(now)"
		if [ "$nextTransid" != "$lastTransid" ]; then
			echo "$(date +%Y-%m-%d-%H-%M-%S) $nextTransTime $(echo $nextTransTime - $lastTransTime | bc -l) $((nextTransid-lastTransid))"
			lastTransid="$nextTransid"
			lastTransTime="$nextTransTime"
		fi
	done

This one needs the name of a device in the btrfs filesystem as an argument,
and will call btrfs-progs to extract the generation field.  e.g.

	transid-log /dev/mapper/testfs

Both write a log like this (without the headers):

        timestamp           seconds-since-epoch  latency in seconds     number of txns
        2020-02-11-18-38-17 1581464297.795520161 53.726077219 1
        2020-02-11-18-39-11 1581464351.590758110 53.795237949 1
        2020-02-11-18-40-15 1581464415.425477144 63.834719034 1
        2020-02-11-18-45-43 1581464743.422229067 327.996751923 1
        2020-02-11-19-57-39 1581469059.007722649 4315.585493582 1
        2020-02-11-21-22-06 1581474126.871461743 5067.863739094 1

which can be fed to your favorite plotting program to make a horrifying
graph on which the area under the curve is the amount of data you'll
lose in a crash or power failure.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread