All of lore.kernel.org
 help / color / mirror / Atom feed
* PROBLEM: relay - stale data copied to user space
@ 2009-03-18 15:07 Martin Peschke
  2009-03-19  4:19 ` Tom Zanussi
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Peschke @ 2009-03-18 15:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tom Zanussi, linux-s390

Hi,

Tracing block I/Os using blktrace and blkiomon results in "bad trace"
messages, meaning that user space has come across a trace which is
broken:

bad trace magic 100
bad trace magic 10e7
bad trace magic 1
bad trace magic 1
bad trace magic 374
bad trace magic 7000010
bad trace magic 11000010
bad trace magic 1000
bad trace magic 10e3
bad trace magic 0
bad trace magic 1000
bad trace magic 10de
bad trace magic 8b3
bad trace magic 0

This condition can't be recovered from because blktrace has lost track
of where the next valid trace is.

I gathered hexdumps of blktrace data in user space. Looking at the data
as read by blktrace from relay files, I found fragments of old traces
partially overlaying other traces:

(In the hexdump-ed trace below, I have added "|"s as delimiters between
traces. The corrupted part of the trace - a fragment containing the
sequence number 0x4d1d3 - is in parenthesis.)

12d8ca0 |6561 7407 0005 6440 0000 00fb ef9d 6496  -> sequence 0x56440
12d8cb0  0000 0000 000f 4790 0002 0000 4001 0011
12d8cc0  0000 1445 0080 0020 0000 0001 0000 0018
12d8cd0  0000 0001 0001 007d 0000 0000 0001 3e40
12d8ce0  0000 0000 001b badc|6561 7407 0005 6441  -> sequence 0x56441
12d8cf0  0000 00fb ef9d 79e3 0000 0000 000f 4790
12d8d00  0002 0000 0181 0008 0000 1445 0080 0020
12d8d10  0000 0001 0000 0000|6561 7407 0005 6442  -> sequence 0x56442
12d8d20 (0000 1444 0080 0020 0000 0001 0000 0000
12d8d30 |6561 7407 0004 d1d3 0000 00e3 4a8b 1544  -> sequence 0x4d1d3
12d8d40  0000 0000 0015 5728 0003 0000 4001 0011     old trace!!
12d8d50  0000 1447 0080 0020 0000 0001 0000 0018)
12d8d60 |6561 7407 0005 6443 0000 00fb f04d c32a  -> sequence 0x56443
12d8d70  0000 0000 000f 4928 0000 8000 4001 0011
12d8d80  0000 1445 0080 0020 0000 0001 0000 0018
12d8d90  0000 0001 0001 007f 0000 0000 0000 55d8
12d8da0  0000 0000 0008 1070|6561 7407 0005 6444  -> sequence 0x56444

The same fragment containing sequence 0x4d1d3 originally appeared in
this context:

10d8d70  0000 0000 0013 970e|6561 7407 0004 d1d2  -> sequence 0x4d1d2
10d8d80  0000 00e3 4a82 8773 0000 0000 0039 4d20
10d8d90  0001 0000 0181 0008(0000 1444 0080 0020
10d8da0  0000 0001 0000 0000|6561 7407 0004 d1d3  -> sequence 0x4d1d3
10d8db0  0000 00e3 4a8b 1544 0000 0000 0015 5728
10d8dc0  0003 0000 4001 0011 0000 1447 0080 0020
10d8dd0  0000 0001 0000 0018)0000 0001 0001 0080
10d8de0  0000 0000 0001 156c 0000 0000 001a 10c8
10d8df0 |6561 7407 0004 d1d4 0000 00e3 4a8b 239c  -> sequence 0x4d1d4
10d8e00  0000 0000 0015 5728 0003 0000 0181 0008
10d8e10  0000 1447 0080 0020 0000 0001 0000 0000

Blktrace data is handled by relay, which uses ringbuffers...

I decided to use the following debug patch:

debug patch: poison various kinds of unused relay buffers
---
 kernel/relay.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -735,6 +735,8 @@ size_t relay_switch_subbuf(struct rchan_
 		old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
 		buf->padding[old_subbuf] = buf->prev_padding;
 		buf->subbufs_produced++;
+		memset((char *)buf->data + buf->offset, 0x66,
+		       buf->prev_padding);
 		if (buf->dentry)
 			buf->dentry->d_inode->i_size +=
 				buf->chan->subbuf_size -
@@ -767,6 +769,8 @@ size_t relay_switch_subbuf(struct rchan_
 	if (unlikely(length + buf->offset > buf->chan->subbuf_size))
 		goto toobig;
 
+	memset(buf->data, 0x55, buf->chan->subbuf_size);
+
 	return length;
 
 toobig:
@@ -1112,6 +1116,7 @@ static int subbuf_read_actor(size_t read
 		desc->error = -EFAULT;
 		ret = 0;
 	}
+	memset(from, 0x77, ret);
 	desc->arg.data += ret;
 	desc->written += ret;
 	desc->count -= ret;


I found out that the trace data is already corrupted when it is being
copied to user space by copy_to_user() in kernel/relay.c. My 0x55
patterns marking invalidated buffers showed up in user space:

magic    0x65617407
sequence 0x00003baf
time     0x5555555555555555
sector   0x5555555555555555
bytes    0x55555555
action   0x55555555
pid      0x55555555
device   0x55555555
cpu      0x55555555
error    0x5555
pdu_len  0x5555

65617407 00003baf 55555555 55555555 55555555 55555555 55555555 55555555
55555555 55555555 55555555 55555555


Next I added another debug patch to catch poisoned buffers, that is,
stale data prior to copy_to_user().


debug patch: detect relay buffer poisoning when copying data to user
space
---
 kernel/relay.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -1098,6 +1098,45 @@ static size_t relay_file_read_end_pos(st
        return end_pos;
 }

+static inline void hexdump(char *buf, int len, char *descr)
+{
+       int i;
+
+       printk("<----- %s, buffer at %p, length %d, first <=4096 bytes:" ,
+              descr, buf, len);
+       for (i = 0; i < min(len, 4096); i++) {
+               if (i % 4 == 0)
+                       printk(" ");
+               if (i % 32 == 0)
+                       printk("\n");
+               printk("%02x", buf[i]);
+       }
+       printk("----->\n\n");
+}
+
+static inline void * memsearch(char *buf, char c, size_t buf_len, size_t c_len)
+{
+       char *p = buf;
+       int i;
+
+       do {
+               p = memchr(p, c, buf_len - (p - buf));
+               if (!p)
+                       return NULL;
+               for (i = 0; i < c_len - 1; i++, p++) {
+                       if (p >= buf + buf_len)
+                               return NULL;
+                       if (*p != c)
+                               break;
+               }
+               if (i == c_len - 1) {
+                       hexdump(p - (c_len + 32), c_len + 64, "memsearch");
+/*
  *     subbuf_read_actor - read up to one subbuf's worth of data
  */
@@ -1112,6 +1151,12 @@ static int subbuf_read_actor(size_t read

        from = buf->start + read_start;
        ret = avail;
+
+       if (memsearch(from, 0x55, avail, 16)/* ||
+           memsearch(from, 0x66, avail, 8) ||
+           memsearch(from, 0x77, avail, 8)*/)
+               hexdump(from, avail, "relay: found buffer poisoning");
+
        if (copy_to_user(desc->arg.buf, from, avail)) {
                desc->error = -EFAULT;
                ret = 0;                       return (p - c_len);



After some tests with and without detection of buffer poison in kernel
space, I think that the time required by my kernel debug patch to scan
for buffer poison reduces the chances of hitting bad traces in user
space. 

Another observation: when scanning buffers that are about to be copied
to user space, hexdumps of poisoned buffers do not show buffer poison
where it has been found just a split second before. A typical example:

77777777 77777777 77777777 77777777 77777777 77777777 77656174 070001c3
37000001 26a13312 80000000 00003bf9 d0000080 00014100 0700000f 46008000
10000000 00000000 00555555 55555555

The second line should contain a 0x55 pattern. It's gone.

What else do we see.... Scanning backward there are bytes already
(recently?) copied to user space (0x77 pattern). Scanning forward there
is room for more traces (0x55 pattern).


This is my theory:
Timing matters. It's a race caused by improper protection of critical
sections in a producer-consumer scenario. A bug in the bookkeeping
allows a reader to read at a position that is just being written to.

I have got no real evidence. I don't understand yet, how reader-writer
serialisation is implemented in relay.


I use FCP disks attached to a Linux guest on a System z and an internal
tool for I/O workload generation. I have stripped my environment down to
1 disk with 4 paths. Data is written to and read from a device mapper
device. Blktrace is run for the 4 underlying SCSI devices:

blktrace -a issue -a complete -o - /dev/sda /dev/sdu /dev/sdao \
/dev/sdbi | blkiomon -I 100 -b blkiomon.out

I run a recent git kernel and a recent git blktrace. The problem has
also been seen with older code, though.


Any comment is welcome.

Please keep me on to: or cc: since I am not subscribed to linux-kernel.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PROBLEM: relay - stale data copied to user space
  2009-03-18 15:07 PROBLEM: relay - stale data copied to user space Martin Peschke
@ 2009-03-19  4:19 ` Tom Zanussi
  2009-03-19 17:50   ` Martin Peschke
  0 siblings, 1 reply; 4+ messages in thread
From: Tom Zanussi @ 2009-03-19  4:19 UTC (permalink / raw)
  To: Martin Peschke; +Cc: linux-kernel, linux-s390

Hi,

On Wed, 2009-03-18 at 16:07 +0100, Martin Peschke wrote:

> 
> This is my theory:
> Timing matters. It's a race caused by improper protection of critical
> sections in a producer-consumer scenario. A bug in the bookkeeping
> allows a reader to read at a position that is just being written to.
> 

It does look consistent with a reader reading an event that's been
reserved but not yet written, or partially written e.g. if an event
being written on one cpu was read by another before the first one
finished.  Can you see if the below patch to blktrace userspace helps?

Or failing that, explicitly using gettid() in place of getpid() in
sched_setaffinity().  Or, failing that, you had mentioned previously
that you would try to reproduce the problem on your laptop - were you
able to do that?  If so, it would help in debugging it further...

Tom

diff --git a/blktrace.c b/blktrace.c
index 26b3afd..656ab7a 100644
--- a/blktrace.c
+++ b/blktrace.c
@@ -610,7 +610,7 @@ static int lock_on_cpu(int cpu)
 
 	CPU_ZERO(&cpu_mask);
 	CPU_SET(cpu, &cpu_mask);
-	if (sched_setaffinity(getpid(), sizeof(cpu_mask), &cpu_mask) < 0)
+	if (sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask) < 0)
 		return errno;
 
 	return 0;



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: PROBLEM: relay - stale data copied to user space
  2009-03-19  4:19 ` Tom Zanussi
@ 2009-03-19 17:50   ` Martin Peschke
  2009-03-20  5:21     ` Tom Zanussi
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Peschke @ 2009-03-19 17:50 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: linux-kernel, linux-s390


On Wed, 2009-03-18 at 23:19 -0500, Tom Zanussi wrote:
> On Wed, 2009-03-18 at 16:07 +0100, Martin Peschke wrote
> > This is my theory:
> > Timing matters. It's a race caused by improper protection of critical
> > sections in a producer-consumer scenario. A bug in the bookkeeping
> > allows a reader to read at a position that is just being written to.
> > 
> 
> It does look consistent with a reader reading an event that's been
> reserved but not yet written, or partially written e.g. if an event
> being written on one cpu was read by another before the first one
> finished.

So this is part of relay's design, and it's up to user space to make
sure that reader and writer are on the same CPU?

> Can you see if the below patch to blktrace userspace helps?

It appears to fix it. I will give it more testing in a larger
environment.

> Or failing that, explicitly using gettid() in place of getpid() in
> sched_setaffinity().  Or, failing that, you had mentioned previously
> that you would try to reproduce the problem on your laptop - were you
> able to do that?  If so, it would help in debugging it further...

This didn't work out. But then, it's a single-CPU machine.

Thanks,
Martin



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PROBLEM: relay - stale data copied to user space
  2009-03-19 17:50   ` Martin Peschke
@ 2009-03-20  5:21     ` Tom Zanussi
  0 siblings, 0 replies; 4+ messages in thread
From: Tom Zanussi @ 2009-03-20  5:21 UTC (permalink / raw)
  To: Martin Peschke; +Cc: linux-kernel, linux-s390

Hi,

On Thu, 2009-03-19 at 18:50 +0100, Martin Peschke wrote:
> On Wed, 2009-03-18 at 23:19 -0500, Tom Zanussi wrote:
> > On Wed, 2009-03-18 at 16:07 +0100, Martin Peschke wrote
> > > This is my theory:
> > > Timing matters. It's a race caused by improper protection of critical
> > > sections in a producer-consumer scenario. A bug in the bookkeeping
> > > allows a reader to read at a position that is just being written to.
> > > 
> > 
> > It does look consistent with a reader reading an event that's been
> > reserved but not yet written, or partially written e.g. if an event
> > being written on one cpu was read by another before the first one
> > finished.
> 
> So this is part of relay's design, and it's up to user space to make
> sure that reader and writer are on the same CPU?
> 

No, that wasn't originally part of relay's design, but for all practical
purposes that's what you need to do if you're using the read()
interface.

The original design was meant to be used with mmap() and poll() on
complete sub-buffers i.e. you grab complete sub-buffers from the mmaped
buffer only when poll() says they're ready.  If used this way, it
doesn't matter which cpu is doing the reading or writing.  relay still
works this way if you use mmap() as in e.g. the klog example.

The read() was added later, on top of something that wasn't designed
with it in mind, the main difference being that a read() can be done at
any time and it expects whatever's available to be returned, regardless
of whether it's reading from a complete sub-buffer or not (if you could
only read from complete sub-buffers, you might as well use the mmap()
method).

So if a reader can read from a sub-buffer at any time, and there can
simultaneously be a writer in that sub-buffer, if they're on different
cpus, you can get what you were seeing.  One way to prevent that is to
keep the reader on the same cpu, and disable preemption for the writer,
so when the reader on that cpu finally reads, it can be sure there are
no in-progress events up to the endpoint of what it's reading.

Hope that helps,

Tom

> > Can you see if the below patch to blktrace userspace helps?
> 
> It appears to fix it. I will give it more testing in a larger
> environment.
> 
> > Or failing that, explicitly using gettid() in place of getpid() in
> > sched_setaffinity().  Or, failing that, you had mentioned previously
> > that you would try to reproduce the problem on your laptop - were you
> > able to do that?  If so, it would help in debugging it further...
> 
> This didn't work out. But then, it's a single-CPU machine.
> 
> Thanks,
> Martin
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-03-20  5:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-18 15:07 PROBLEM: relay - stale data copied to user space Martin Peschke
2009-03-19  4:19 ` Tom Zanussi
2009-03-19 17:50   ` Martin Peschke
2009-03-20  5:21     ` Tom Zanussi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.