linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] tracing: user_event fix for 6.4
@ 2023-06-23 19:29 Steven Rostedt
  2023-06-23 23:21 ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2023-06-23 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming


Linus,

Tracing fix to user events return value on write

Before user events become an ABI, fix the return value of the write
operation when tracing is disabled. It should not return an error, but
simply report it wrote zero bytes. Just like any other write operation
that doesn't write data but does not "fail".

This also includes test cases for this use case.


Please pull the latest trace-v6.4-rc7 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v6.4-rc7

Tag SHA1: ec925ce3a8456d2a527572e940ed52123a6083cd
Head SHA1: 131966049074b8025a8cba0e99d3745311fcef5b


sunliming (3):
      tracing/user_events: Fix incorrect return value for writing operation when events are disabled
      selftests/user_events: Enable the event before write_fault test in ftrace self-test
      selftests/user_events: Add test cases when event is disabled

----
 kernel/trace/trace_events_user.c                  | 3 ++-
 tools/testing/selftests/user_events/ftrace_test.c | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)
---------------------------
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 8df0550415e7..902cae005df4 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -2096,7 +2096,8 @@ static ssize_t user_events_write_core(struct file *file, struct iov_iter *i)
 
 		if (unlikely(faulted))
 			return -EFAULT;
-	}
+	} else
+		return 0;
 
 	return ret;
 }
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index eb6904d89f14..f42a20520175 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -324,6 +324,9 @@ TEST_F(user, write_events) {
 	io[0].iov_base = &reg.write_index;
 	io[0].iov_len = sizeof(reg.write_index);
 
+	/* Write should return zero when event is not enabled */
+	ASSERT_EQ(0, writev(self->data_fd, (const struct iovec *)io, 3));
+
 	/* Enable event */
 	self->enable_fd = open(enable_file, O_RDWR);
 	ASSERT_NE(-1, write(self->enable_fd, "1", sizeof("1")))
@@ -400,6 +403,10 @@ TEST_F(user, write_fault) {
 	ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, &reg));
 	ASSERT_EQ(0, reg.write_index);
 
+	/* Enable event */
+	self->enable_fd = open(enable_file, O_RDWR);
+	ASSERT_NE(-1, write(self->enable_fd, "1", sizeof("1")))
+
 	/* Write should work normally */
 	ASSERT_NE(-1, writev(self->data_fd, (const struct iovec *)io, 2));
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: user_event fix for 6.4
  2023-06-23 19:29 [GIT PULL] tracing: user_event fix for 6.4 Steven Rostedt
@ 2023-06-23 23:21 ` Linus Torvalds
  2023-06-24  2:09   ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2023-06-23 23:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming

On Fri, 23 Jun 2023 at 12:29, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Before user events become an ABI, fix the return value of the write
> operation when tracing is disabled. It should not return an error, but
> simply report it wrote zero bytes. Just like any other write operation
> that doesn't write data but does not "fail".

This makes no sense.

A write() returning 0 means "Disk full". It's most definitely an
error, and a failure.

But zero is not a particularly _good_ failure code. At least not
unless your tracing disk is full. Is it?

If tracing is disabled, and you write something to it, I would expect
to get a valid and reasonable error code back. Something like EINVAL
or EIO or something to show "you did something wrong".

I do not at all understand the sentence

   "When user_events are disabled, its write operation should return zero"

as an "explanation" for this, and my immediate reaction is "Really?
Why? That makes no sense".

                  Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: user_event fix for 6.4
  2023-06-23 23:21 ` Linus Torvalds
@ 2023-06-24  2:09   ` Steven Rostedt
  2023-06-24 17:50     ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2023-06-24  2:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming

On Fri, 23 Jun 2023 16:21:36 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 23 Jun 2023 at 12:29, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Before user events become an ABI, fix the return value of the write
> > operation when tracing is disabled. It should not return an error, but
> > simply report it wrote zero bytes. Just like any other write operation
> > that doesn't write data but does not "fail".  
> 
> This makes no sense.
> 
> A write() returning 0 means "Disk full". It's most definitely an
> error, and a failure.
> 

OK, I shouldn't have said that it was like other disk write options. I
was thinking more of writing to something that uses a buffer like
snprintf() which returns the number of bytes written.

> But zero is not a particularly _good_ failure code. At least not
> unless your tracing disk is full. Is it?

It's not a failure. It means that the buffer is disabled. This is more
in line to what the kernel trace events do when tracing is disabled,
the writes just return zero. That is, you do not want to retry.

> 
> If tracing is disabled, and you write something to it, I would expect
> to get a valid and reasonable error code back. Something like EINVAL
> or EIO or something to show "you did something wrong".
> 
> I do not at all understand the sentence
> 
>    "When user_events are disabled, its write operation should return zero"
> 
> as an "explanation" for this, and my immediate reaction is "Really?
> Why? That makes no sense".


So basically the way tracing is done is that something can enable the
event (that will then do the write) but if the trace buffer writes are
disabled, the writes should just be dropped. This is how it currently
works in the kernel, and we were trying to make it the same in user
space.

The code that does the write is basically something being monitored by
something else that will tell it to start writing. But that something
else could have the buffer disabled for writes. The use case for this
is to disable the buffer, enable all the trace events you care about,
and then enabled the buffer which is the same as enabling all events at
the same time to get a more coherent trace.

But we don't want the writes to return errors when the buffer is
disabled. The writes should just be dropped. An error code means
something isn't connected properly, but a zero return just means
everything is connected properly, but it is not ready to accept content
yet.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: user_event fix for 6.4
  2023-06-24  2:09   ` Steven Rostedt
@ 2023-06-24 17:50     ` Steven Rostedt
  2023-06-25 23:04       ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2023-06-24 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming


Linus,

On Fri, 23 Jun 2023 22:09:59 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The code that does the write is basically something being monitored by
> something else that will tell it to start writing. But that something
> else could have the buffer disabled for writes. The use case for this
> is to disable the buffer, enable all the trace events you care about,
> and then enabled the buffer which is the same as enabling all events at
> the same time to get a more coherent trace.

Are you fine with the above explanation? The thing is, writes that are
dropped due to the buffer being disabled shouldn't be a concern of the
producer. It's the consumer that controls that. This is the same
behavior that is used by internal kernel trace events, and we wanted to
keep it consistent.

Would you like me to resend the pull request with an updated change log?

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: user_event fix for 6.4
  2023-06-24 17:50     ` Steven Rostedt
@ 2023-06-25 23:04       ` Linus Torvalds
  2023-06-26  6:07         ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2023-06-25 23:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming

On Sat, 24 Jun 2023 at 10:51, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 23 Jun 2023 22:09:59 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > The code that does the write is basically something being monitored by
> > something else that will tell it to start writing. But that something
> > else could have the buffer disabled for writes. The use case for this
> > is to disable the buffer, enable all the trace events you care about,
> > and then enabled the buffer which is the same as enabling all events at
> > the same time to get a more coherent trace.
>
> Are you fine with the above explanation?

Honestly, I still think that returning zero is completely crazy.

It's exactly the wrong thing to do.

"write()" returning 0 is only normal if the passed-in length was zero
to begin with, and then it can actually have special meaning (ie
zero-sized writes have historically been used for STREAMS etc to do
EOF signalling to the other side, and/or are used to detect pending
errors).

There is one "return zero" case I know about, and it's the "disk full"
signalling (kind of EOF) for block devices.

If you are reading the man-pages, and your eyes found this:

   "In the absence of errors, or if error detection is not performed,
    the write() function shall return zero and have no other results"

I'd like to point out that that is exactly the "pass in a zero length
to write" case. That verbiage is all about "write(fd, buf, 0)"
possibly returning zero _or_ an error if an error was pending (and was
detected).

So returning zero from a write is basically insanity. It's not a valid
error case.

And it's a *dangerous* return value, because if you have some write
wrapper that is trying to deal with partial writes (think pipes etc),
I would not be surprised if there are cases of that just looping
forever on a zero return.

In fact, I went to Debian code search, and looked for "= write(" as a pattern.

The VERY FIRST hit on Debian Code search was this from LibreOffice:

    bool safeWrite(int fd, void* data, std::size_t dataSize)
    {
        auto nToWrite = dataSize;
        unsigned char* dataToWrite = static_cast<unsigned char *>(data);

        while ( nToWrite ) {
            auto nWritten = write(fd, dataToWrite, cap_ssize_t(nToWrite));
            if ( nWritten < 0 ) {
                if ( errno == EINTR )
                    continue;

                return false;

            }

            assert(nWritten > 0);
            nToWrite -= nWritten;
            dataToWrite += nWritten;
        }

        return true;
    }

which loops forever if NDEBUG is set, otherwise it will print an error
message and abort.

See why I think returning zero is such a bad bad idea?

There is NO EXPLANATION that makes it ok. Really. It's a huge bug, and
wrong on all possible levels.

The fact that your explanation is "we have that bug in other places"
does *not* make me any happier.

              Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: user_event fix for 6.4
  2023-06-25 23:04       ` Linus Torvalds
@ 2023-06-26  6:07         ` Steven Rostedt
  0 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2023-06-26  6:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Masami Hiramatsu, Mark Rutland, Beau Belgrave,
	Andrew Morton, sunliming

On Sun, 25 Jun 2023 16:04:32 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The fact that your explanation is "we have that bug in other places"
> does *not* make me any happier.

Actually it's just the behavior inside the kernel. I just wrote a quick
program to see the behavior of /sys/kernel/tracing/trace_marker, and
yeah, it actually returns an error when writing if the ring buffer is
disabled.

I was trying to imitate the internal behavior in the kernel, but I
shouldn't have, as we control all the instances of that.

Now testing the other user space interface, it does behave the way you
described (something I implemented years ago!). So I will drop this
version of the patch.

From the code for trace_marker:

        if (unlikely(!event))
                /* Ring buffer disabled, return as if not open for write */
                return -EBADF;


sunliming's original patch had it return an error. I'll have him redo
that one, because the current behavior is still wrong as it returns as
if it wrote something when it did not.

 https://lore.kernel.org/all/20230608011554.1181097-2-sunliming@kylinos.cn/

We were the ones to tell him not to return an error, so I take the
blame on that. But if we are going to return an error, I rather follow
what trace_marker does and return -EBADF and not -EFAULT.

sunliming,

Care to send a v3 that returns -EBADF instead?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-06-26  6:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-23 19:29 [GIT PULL] tracing: user_event fix for 6.4 Steven Rostedt
2023-06-23 23:21 ` Linus Torvalds
2023-06-24  2:09   ` Steven Rostedt
2023-06-24 17:50     ` Steven Rostedt
2023-06-25 23:04       ` Linus Torvalds
2023-06-26  6:07         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).