linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* broken userland ABI in configfs binary attributes
@ 2019-08-26  2:48 Al Viro
  2019-08-26 16:29 ` [RFC] " Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-08-26  2:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Octavian Purdila, Pantelis Antoniou

	In commit 03607ace807b (configfs: implement binary attributes)
we have serious trouble:

+* Binary attributes, which are somewhat similar to sysfs binary attributes,
+but with a few slight changes to semantics.  The PAGE_SIZE limitation does not
+apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
+The write(2) calls from user space are buffered, and the attributes'
+write_bin_attribute method will be invoked on the final close, therefore it is
+imperative for user-space to check the return code of close(2) in order to
+verify that the operation finished successfully.

	This is completely broken.  ->release() is too late to return any errors -
they won't reach the caller of close(2).  ->flush() _is_ called early enough to
pass return value to userland, but it's called every time descriptor is removed
from descriptor table.  IOW, if userland e.g. python code from hell has written
some data to the attribute in question, then called a function that has ended
up calling something in some misbegotten library that spawned a child, had it
run and waited for it to exit, your ->flush() will be called twice.  Which is
fine for something like NFS sending the dirty data to server and checking that
there's nothing left, but not for this kind of "gather data, then commit the
entire thing at once" kind of interfaces.

	AFAICS, there's only one user in the tree right now (acpi/table/*/aml);
no idea what drives the userland side.

	We might be able to paper over that mess by doing what /dev/st does -
checking that file_count(file) == 1 in ->flush() instance and doing commit
there in such case.  It's not entirely reliable, though, and it's definitely
not something I'd like to see spreading.

	Folks, please don't do that kind of userland ABIs; that kind of
implicit commit on the final close is OK only if there's no error to
report (e.g. if all checks can be done at write() time).  Otherwise it's
an invitation for trouble.

	And *ANYTHING* that tries to return an error from ->release() is
very suspicious, no matter what.  Again, in ->release() it's too late
to return an error.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26  2:48 broken userland ABI in configfs binary attributes Al Viro
@ 2019-08-26 16:29 ` Al Viro
  2019-08-26 18:20   ` Matthew Wilcox
  2019-08-26 18:34   ` "Kai Mäkisara (Kolumbus)"
  0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-08-26 16:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Octavian Purdila, Pantelis Antoniou,
	Linus Torvalds, Kai Mäkisara, linux-scsi

On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:

> 	We might be able to paper over that mess by doing what /dev/st does -
> checking that file_count(file) == 1 in ->flush() instance and doing commit
> there in such case.  It's not entirely reliable, though, and it's definitely
> not something I'd like to see spreading.

	This "not entirely reliable" turns out to be an understatement.
If you have /proc/*/fdinfo/* being read from at the time of final close(2),
you'll get file_count(file) > 1 the last time ->flush() is called.  In other
words, we'd get the data not committed at all.

	And that problem is shared with /dev/st*, unfortunately ;-/
We could somewhat mitigate that by having fs/proc/fd.c:seq_show() call
->flush() before fput(), but that would still hide errors from close(2)
(and still have close(2) return before the data is flushed).

	read() on /proc/*/fdinfo/* does the following:

find the task_struct
grab its descriptor table, drop task_struct
lock the table, pick struct file out of it
bump struct file refcount, unlock the table
        seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
                   (long long)file->f_pos, f_flags,
                   real_mount(file->f_path.mnt)->mnt_id);
        show_fd_locks(m, file, files);
        if (seq_has_overflowed(m))
                goto out;
        if (file->f_op->show_fdinfo)
                file->f_op->show_fdinfo(m, file);
drop the file reference (with fput()).

	Before "procfs: Convert /proc/pid/fdinfo/ handling routines to
seq-file v2" (in 2012), we did just snprintf() while under the lock on
descriptor table.  That commit moved the printf part from under the lock,
at the cost of grabbing and dropping file reference.  Shortly after that
"procfs: add ability to plug in auxiliary fdinfo providers" has added
->show_fdinfo() there, making it impossible to call under the descriptor
table lock - that method can block (and does so for eventpoll, idiotify,
etc.)

	We really want ->show_fdinfo() to happen before __fput() gets
anywhere near ->release().  And even the non-blocking cases can be too
costly to do under the descriptor table lock.  OTOH, it can very well be
done after or during ->flush(); the only problematic case right now
is /dev/st* that has its ->flush() do nothing in case if file_count(file)
is greater than 1.

	One kludgy way to handle that would be to have something like
FMODE_SUCKY_FLUSH that would have fs/proc/fd.c:seq_show() just do
the damn thing still under descriptor table lock and skip the rest
of it - /dev/st* has nothing in ->show_fdinfo(), and show_fd_locks()
is not too terribly costly.  Still best avoided in default case,
but...

	Another possibility is to have a secondary counter, with
__fput() waiting for it to go down to zero and fdinfo reads bumping
(and then dropping) that instead of the primary counter.  Not sure
which approach is better - adding extra logics in __fput() for the
sake of one (and not terribly common) device is not nice, but another
variant really is an ugly kludge ;-/  OTOH, this kind of "take
a secondary reference, ->release() will block until you drop it"
interface can breed deadlocks; procfs situation, AFAICS, allows to
use it safely, but it's begging to be abused...

	Ideas?  I don't like either approach, to put it very mildly,
so any cleaner suggestions would be very welcome.

PS: just dropping the check in st_flush() is probably a bad idea -
as it is, it can't overlap with st_write() and after such change it
will...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 16:29 ` [RFC] " Al Viro
@ 2019-08-26 18:20   ` Matthew Wilcox
  2019-08-26 19:28     ` Al Viro
  2019-08-26 18:34   ` "Kai Mäkisara (Kolumbus)"
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-08-26 18:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, Kai Mäkisara, linux-scsi

On Mon, Aug 26, 2019 at 05:29:49PM +0100, Al Viro wrote:
> On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:
> 
> > 	We might be able to paper over that mess by doing what /dev/st does -
> > checking that file_count(file) == 1 in ->flush() instance and doing commit
> > there in such case.  It's not entirely reliable, though, and it's definitely
> > not something I'd like to see spreading.
> 
> 	This "not entirely reliable" turns out to be an understatement.
> If you have /proc/*/fdinfo/* being read from at the time of final close(2),
> you'll get file_count(file) > 1 the last time ->flush() is called.  In other
> words, we'd get the data not committed at all.

How about always doing the write in ->flush instead of ->release?
Yes, that means that calling close(dup(fd)) is going to flush the
write, but you shouldn't be doing that.  I think there'll also be
extra flushes done if you fork() during one of these writes ... but,
again, don't do that.  It's not like these are common things.

Why does the prototype of file_operations::release suggest that it can
return an int?  __fput doesn't pay any attention to the return value.
Changing that to return void might help some future programmers avoid
this mistake.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 16:29 ` [RFC] " Al Viro
  2019-08-26 18:20   ` Matthew Wilcox
@ 2019-08-26 18:34   ` "Kai Mäkisara (Kolumbus)"
  2019-08-26 19:32     ` Al Viro
  1 sibling, 1 reply; 22+ messages in thread
From: "Kai Mäkisara (Kolumbus)" @ 2019-08-26 18:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi



> On 26 Aug 2019, at 19.29, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:
> 
>> 	We might be able to paper over that mess by doing what /dev/st does -
>> checking that file_count(file) == 1 in ->flush() instance and doing commit
>> there in such case.  It's not entirely reliable, though, and it's definitely
>> not something I'd like to see spreading.
> 
> 	This "not entirely reliable" turns out to be an understatement.
> If you have /proc/*/fdinfo/* being read from at the time of final close(2),
> you'll get file_count(file) > 1 the last time ->flush() is called.  In other
> words, we'd get the data not committed at all.
> 
...
> PS: just dropping the check in st_flush() is probably a bad idea -
> as it is, it can't overlap with st_write() and after such change it
> will…
Yes, don’t just drop it. The tape semantics require that a file mark is written when the last opener closes this sequential device. This is why the check is there. Of course, it is good if someone finds a better solution for this.

Thanks,
Kai


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 18:20   ` Matthew Wilcox
@ 2019-08-26 19:28     ` Al Viro
  2019-08-27  8:51       ` Miklos Szeredi
  2019-08-31  8:32       ` Christoph Hellwig
  0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-08-26 19:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, Kai Mäkisara, linux-scsi

On Mon, Aug 26, 2019 at 11:20:17AM -0700, Matthew Wilcox wrote:
> On Mon, Aug 26, 2019 at 05:29:49PM +0100, Al Viro wrote:
> > On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:
> > 
> > > 	We might be able to paper over that mess by doing what /dev/st does -
> > > checking that file_count(file) == 1 in ->flush() instance and doing commit
> > > there in such case.  It's not entirely reliable, though, and it's definitely
> > > not something I'd like to see spreading.
> > 
> > 	This "not entirely reliable" turns out to be an understatement.
> > If you have /proc/*/fdinfo/* being read from at the time of final close(2),
> > you'll get file_count(file) > 1 the last time ->flush() is called.  In other
> > words, we'd get the data not committed at all.
> 
> How about always doing the write in ->flush instead of ->release?
> Yes, that means that calling close(dup(fd)) is going to flush the
> write, but you shouldn't be doing that.  I think there'll also be
> extra flushes done if you fork() during one of these writes ... but,
> again, don't do that.  It's not like these are common things.

For configfs bin_attr it won't work, simply because it wants the entire
thing to be present - callback parses the data.  For SCSI tape...  Maybe,
but you'll need to take care of the overlaps with ->write().  Right now
it can't happen (the last reference, about to be dropped right after
st_flush() returns); if we do that on each ->flush(), we will have to
cope with that fun and we'll need to keep an error (if any) for the
next call of st_flush() to pick and return.  I'm not saying it can't
be done, but that's really a question for SCSI folks.

> Why does the prototype of file_operations::release suggest that it can
> return an int?  __fput doesn't pay any attention to the return value.
> Changing that to return void might help some future programmers avoid
> this mistake.

Hysterical raisins.  It's doable, the main question is how much do we
aim for and whether it's worth the amount of churn.

It has been discussed (last time about 6 years ago), didn't go anywhere.
Boggled down in discussing how much churn which cleanups are worth;
I wanted to make them
	void (*some_sane_name)(struct file *)
(except that the name I'd used hadn't been sane).  Linus wanted
	void (*release)(struct file *, struct inode *)
and suggested to do a big change replacing int with void, basically,
then followups fixing the resulting warnings.  

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 18:34   ` "Kai Mäkisara (Kolumbus)"
@ 2019-08-26 19:32     ` Al Viro
  2019-08-27 15:01       ` Boaz Harrosh
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-08-26 19:32 UTC (permalink / raw)
  To: "Kai Mäkisara (Kolumbus)"
  Cc: Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Mon, Aug 26, 2019 at 09:34:37PM +0300, "Kai Mäkisara (Kolumbus)" wrote:
> 
> 
> > On 26 Aug 2019, at 19.29, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:
> > 
> >> 	We might be able to paper over that mess by doing what /dev/st does -
> >> checking that file_count(file) == 1 in ->flush() instance and doing commit
> >> there in such case.  It's not entirely reliable, though, and it's definitely
> >> not something I'd like to see spreading.
> > 
> > 	This "not entirely reliable" turns out to be an understatement.
> > If you have /proc/*/fdinfo/* being read from at the time of final close(2),
> > you'll get file_count(file) > 1 the last time ->flush() is called.  In other
> > words, we'd get the data not committed at all.
> > 
> ...
> > PS: just dropping the check in st_flush() is probably a bad idea -
> > as it is, it can't overlap with st_write() and after such change it
> > will…
> Yes, don’t just drop it. The tape semantics require that a file mark is written when the last opener closes this sequential device. This is why the check is there. Of course, it is good if someone finds a better solution for this.

D'oh...  OK, that settles it; exclusion with st_write() would've been
painful, but playing with the next st_write() on the same struct file
rewinding the damn thing to overwrite what st_flush() had spewed is
an obvious no-go.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 19:28     ` Al Viro
@ 2019-08-27  8:51       ` Miklos Szeredi
  2019-08-27 11:58         ` Al Viro
  2019-08-31  8:32       ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2019-08-27  8:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Mon, Aug 26, 2019 at 08:28:19PM +0100, Al Viro wrote:
> On Mon, Aug 26, 2019 at 11:20:17AM -0700, Matthew Wilcox wrote:
> > On Mon, Aug 26, 2019 at 05:29:49PM +0100, Al Viro wrote:
> > > On Mon, Aug 26, 2019 at 03:48:38AM +0100, Al Viro wrote:
> > > 
> > > > 	We might be able to paper over that mess by doing what /dev/st does -
> > > > checking that file_count(file) == 1 in ->flush() instance and doing commit
> > > > there in such case.  It's not entirely reliable, though, and it's definitely
> > > > not something I'd like to see spreading.
> > > 
> > > 	This "not entirely reliable" turns out to be an understatement.
> > > If you have /proc/*/fdinfo/* being read from at the time of final close(2),
> > > you'll get file_count(file) > 1 the last time ->flush() is called.  In other
> > > words, we'd get the data not committed at all.

How about something like this:

#if BITS_PER_LONG == 32
#define F_COUNT_SHORTTERM ((1UL << 24) + 1)
#else
#define F_COUNT_SHORTTERM ((1UL << 48) + 1)
#endif

static inline void get_file_shortterm(struct file *f)
{
	atomic_long_add(F_COUNT_SHORTTERM, &f->f_count);
}

static inline void put_file_shortterm(struct file *f)
{
	fput_many(f, F_COUNT_SHORTTERM);
}

static inline bool file_is_last_longterm(struct file *f)
{
	return atomic_long_read(&f->f_count) % F_COUNT_SHORTTERM == 1;
}

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27  8:51       ` Miklos Szeredi
@ 2019-08-27 11:58         ` Al Viro
  2019-08-27 12:21           ` Miklos Szeredi
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-08-27 11:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Tue, Aug 27, 2019 at 10:51:44AM +0200, Miklos Szeredi wrote:
 
> How about something like this:
> 
> #if BITS_PER_LONG == 32
> #define F_COUNT_SHORTTERM ((1UL << 24) + 1)
> #else
> #define F_COUNT_SHORTTERM ((1UL << 48) + 1)
> #endif
> 
> static inline void get_file_shortterm(struct file *f)
> {
> 	atomic_long_add(F_COUNT_SHORTTERM, &f->f_count);
> }
> 
> static inline void put_file_shortterm(struct file *f)
> {
> 	fput_many(f, F_COUNT_SHORTTERM);
> }
> 
> static inline bool file_is_last_longterm(struct file *f)
> {
> 	return atomic_long_read(&f->f_count) % F_COUNT_SHORTTERM == 1;
> }

So 256 threads boinking on the same fdinfo at the same time
and struct file can be freed right under them?  Or a bit over
million of dup(), then forking 15 more children, for that matter...

Seriously, it might be OK on 64bit (with something like "no more
than one reference held by a thread", otherwise you'll run
into overflows even there - 65536 of your shortterm references
aren't that much).  On 32bit it's a non-starter - too easy to
overflow.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27 11:58         ` Al Viro
@ 2019-08-27 12:21           ` Miklos Szeredi
  2019-08-27 12:53             ` Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Miklos Szeredi @ 2019-08-27 12:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Tue, Aug 27, 2019 at 1:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Aug 27, 2019 at 10:51:44AM +0200, Miklos Szeredi wrote:
>
> > How about something like this:
> >
> > #if BITS_PER_LONG == 32
> > #define F_COUNT_SHORTTERM ((1UL << 24) + 1)
> > #else
> > #define F_COUNT_SHORTTERM ((1UL << 48) + 1)
> > #endif
> >
> > static inline void get_file_shortterm(struct file *f)
> > {
> >       atomic_long_add(F_COUNT_SHORTTERM, &f->f_count);
> > }
> >
> > static inline void put_file_shortterm(struct file *f)
> > {
> >       fput_many(f, F_COUNT_SHORTTERM);
> > }
> >
> > static inline bool file_is_last_longterm(struct file *f)
> > {
> >       return atomic_long_read(&f->f_count) % F_COUNT_SHORTTERM == 1;
> > }
>
> So 256 threads boinking on the same fdinfo at the same time
> and struct file can be freed right under them?

Nope, 256 threads booking short term refs will result in f_count = 256
(note the +1 in .F_COUNT_SHORTTERM).  Which can result in false
negative returned by file_is_last_longterm() but no false freeing.

>  Or a bit over
> million of dup(), then forking 15 more children, for that matter...

Can give false positive for file_is_last_longterm() but no false freeing.

255 short term refs + ~16M long term refs together can result in false
freeing, true.

>
> Seriously, it might be OK on 64bit (with something like "no more
> than one reference held by a thread", otherwise you'll run
> into overflows even there - 65536 of your shortterm references
> aren't that much).  On 32bit it's a non-starter - too easy to
> overflow.

No, 64bit would be impossible to overflow.  But if we have to special
case 32bit then it's not worth it...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27 12:21           ` Miklos Szeredi
@ 2019-08-27 12:53             ` Al Viro
  0 siblings, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-08-27 12:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Tue, Aug 27, 2019 at 02:21:50PM +0200, Miklos Szeredi wrote:
> On Tue, Aug 27, 2019 at 1:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Tue, Aug 27, 2019 at 10:51:44AM +0200, Miklos Szeredi wrote:
> >
> > > How about something like this:
> > >
> > > #if BITS_PER_LONG == 32
> > > #define F_COUNT_SHORTTERM ((1UL << 24) + 1)
> > > #else
> > > #define F_COUNT_SHORTTERM ((1UL << 48) + 1)
> > > #endif
> > >
> > > static inline void get_file_shortterm(struct file *f)
> > > {
> > >       atomic_long_add(F_COUNT_SHORTTERM, &f->f_count);
> > > }
> > >
> > > static inline void put_file_shortterm(struct file *f)
> > > {
> > >       fput_many(f, F_COUNT_SHORTTERM);
> > > }
> > >
> > > static inline bool file_is_last_longterm(struct file *f)
> > > {
> > >       return atomic_long_read(&f->f_count) % F_COUNT_SHORTTERM == 1;
> > > }
> >
> > So 256 threads boinking on the same fdinfo at the same time
> > and struct file can be freed right under them?
> 
> Nope, 256 threads booking short term refs will result in f_count = 256
> (note the +1 in .F_COUNT_SHORTTERM).  Which can result in false
> negative returned by file_is_last_longterm() but no false freeing.

Point (sorry, should've grabbed some coffee to wake up properly before replying)

> >  Or a bit over
> > million of dup(), then forking 15 more children, for that matter...
> 
> Can give false positive for file_is_last_longterm() but no false freeing.
> 
> 255 short term refs + ~16M long term refs together can result in false
> freeing, true.

Yes.  No matter how you slice it, the main problem with f_count
overflows (and the reason for atomic_long_t for f_count) is that
we *can* have a lot of references to struct file, held just by
descriptor tables.  Those are almost pure arrays of pointers (well,
that and a couple of bitmaps), so "it would be impossible to fit
into RAM" is not that much of a limitation.  512M references to
the same struct file are theoretically doable; 256M *are* doable,
and the (32bit) hardware doesn't have to be all that beefy.

So you need to distinguish 2^28 possible states on the long-term
references alone.  Which leaves you 4 bits for anything else,
no matter how you encode that.  And that's obviously too little.
 
> > Seriously, it might be OK on 64bit (with something like "no more
> > than one reference held by a thread", otherwise you'll run
> > into overflows even there - 65536 of your shortterm references
> > aren't that much).  On 32bit it's a non-starter - too easy to
> > overflow.
> 
> No, 64bit would be impossible to overflow.  But if we have to special
> case 32bit then it's not worth it...

Agreed and agreed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 19:32     ` Al Viro
@ 2019-08-27 15:01       ` Boaz Harrosh
  2019-08-27 17:27         ` Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Boaz Harrosh @ 2019-08-27 15:01 UTC (permalink / raw)
  To: Al Viro, Kai Mäkisara (Kolumbus)
  Cc: Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On 26/08/2019 22:32, Al Viro wrote:
<>
> D'oh...  OK, that settles it; exclusion with st_write() would've been
> painful, but playing with the next st_write() on the same struct file
> rewinding the damn thing to overwrite what st_flush() had spewed is
> an obvious no-go.
> 

So what are the kind of errors current ->release implementation is trying to return?
Is it actual access the HW errors or its more of a resource allocations errors?
If the later then maybe the allocations can be done before hand, say at ->flush but
are not redone on redundant flushes?

If the former then yes looks like troubles. That said I believe the 
->last_close_with_error() is a very common needed pattern which few use exactly
because it does not work. But which I wanted/needed many times before.

So I would break some eggs which ever is the most elegant way, and perhaps add a
new parameter to ->flush(bool last) or some other easy API.
[Which is BTW the worst name ever, while at it lets rename it to ->close() which
 is what it is. "flush" is used elsewhere to mean sync.
]

So yes please lets fix VFS API with drivers so to have an easy and safe way
to execute very last close, all the while still being able to report errors to
close(2).

My $0.017 Thanks
Boaz

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27 15:01       ` Boaz Harrosh
@ 2019-08-27 17:27         ` Al Viro
  2019-08-27 17:59           ` Boaz Harrosh
  2019-08-29 22:22           ` Al Viro
  0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-08-27 17:27 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Tue, Aug 27, 2019 at 06:01:27PM +0300, Boaz Harrosh wrote:
> On 26/08/2019 22:32, Al Viro wrote:
> <>
> > D'oh...  OK, that settles it; exclusion with st_write() would've been
> > painful, but playing with the next st_write() on the same struct file
> > rewinding the damn thing to overwrite what st_flush() had spewed is
> > an obvious no-go.
> > 
> 
> So what are the kind of errors current ->release implementation is trying to return?
> Is it actual access the HW errors or its more of a resource allocations errors?
> If the later then maybe the allocations can be done before hand, say at ->flush but
> are not redone on redundant flushes?

Most of them are actually pure bollocks - "it can never happen, but if it does,
let's return -EWHATEVER to feel better".  Some are crap like -EINTR, which is
also bollocks - for one thing, process might've been closing files precisely
because it's been hit by SIGKILL.  For another, it's a destructor.  It won't
be retried by the caller - there's nothing called for that object afterwards.
What you don't do in it won't be done at all.

And some are "commit on final close" kind of thing, both with the hardware
errors and parsing errors.

> If the former then yes looks like troubles. That said I believe the 
> ->last_close_with_error() is a very common needed pattern which few use exactly
> because it does not work. But which I wanted/needed many times before.
> 
> So I would break some eggs which ever is the most elegant way, and perhaps add a
> new parameter to ->flush(bool last) or some other easy API.
> [Which is BTW the worst name ever, while at it lets rename it to ->close() which
>  is what it is. "flush" is used elsewhere to mean sync.
> ]

It *is* flush.  And nothing else, really - it makes sure that dirty data is
pushed to destination, with any errors reported.

> So yes please lets fix VFS API with drivers so to have an easy and safe way
> to execute very last close, all the while still being able to report errors to
> close(2).

What the hell is "very last close()"?  Note, BTW, that you can have one thread
call close() in the middle of write() by another thread.  And that will succeed,
with actual ->release() happening no earlier than write() is done; however,
descriptor will be gone when close(2) is done.

You are assuming the model that doesn't match the basic userland ABI for
descriptors.  And no, close(2) is not going to wait for write(2) to finish.
Neither it is going to interrupt said write(2).

Note, BTW, that an error returned by close(2) does *NOT* leave the descriptor
in place - success or failure, it's gone.

If you want to express something like "data packet formed; now you can commit
it and tell me if there'd been any errors", use something explicit.  close()
simply isn't suitable for that.  writev() for datagram-like semantics might
be; fsync() or fdatasync() could serve for "commit now".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27 17:27         ` Al Viro
@ 2019-08-27 17:59           ` Boaz Harrosh
  2019-08-29 22:22           ` Al Viro
  1 sibling, 0 replies; 22+ messages in thread
From: Boaz Harrosh @ 2019-08-27 17:59 UTC (permalink / raw)
  To: Al Viro, Boaz Harrosh
  Cc: Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On 27/08/2019 20:27, Al Viro wrote:
<>
> If you want to express something like "data packet formed; now you can commit
> it and tell me if there'd been any errors", use something explicit.  close()
> simply isn't suitable for that.  writev() for datagram-like semantics might
> be; fsync() or fdatasync() could serve for "commit now".
> 

Yes! I change my mind you are right. close() should stay with void semantics.
I always thought the IO error reporting on close was a bad POSIX decision and
fsync should be the final resting bed, and if you do not call fsync then you
don't care about the error.

Sigh, looks like the error was for ever ignored from day one. Maybe the Kernel
guys felt the errors were important. But application users of configfs, did any
actually care and check? Is there really a regression here? maybe the current imp
needs to just be documented.
(Or the more blasphemous, change the ABI and force people to call fsync or something)

I feel the frustration too
Boaz

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-27 17:27         ` Al Viro
  2019-08-27 17:59           ` Boaz Harrosh
@ 2019-08-29 22:22           ` Al Viro
  2019-08-29 23:32             ` Al Viro
  2019-08-30  4:10             ` Dave Chinner
  1 sibling, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-08-29 22:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Tue, Aug 27, 2019 at 06:27:35PM +0100, Al Viro wrote:

> Most of them are actually pure bollocks - "it can never happen, but if it does,
> let's return -EWHATEVER to feel better".  Some are crap like -EINTR, which is
> also bollocks - for one thing, process might've been closing files precisely
> because it's been hit by SIGKILL.  For another, it's a destructor.  It won't
> be retried by the caller - there's nothing called for that object afterwards.
> What you don't do in it won't be done at all.
> 
> And some are "commit on final close" kind of thing, both with the hardware
> errors and parsing errors.

FWIW, here's the picture for fs/*: 6 instances.

afs_release():
	 calls vfs_fsync() if file had been opened for write, tries to pass
	the return value to caller.  Job for ->flush(), AFAICS.

coda_psdev_release():
	returns -1 in situation impossible after successful ->open().
	Can't happen without memory corruption.

configfs_release_bin_file():
	discussed upthread

dlm device_close():
	returns -ENOENT if dlm_find_lockspace_local(proc->lockspace) fails.
No idea if that can happen.

reiserfs_file_release():
	tries to return an error if it can't free preallocated blocks.

xfs_release():
	similar to the previous case.

In kernel/*: ftrace_graph_release() might return -ENOMEM.  No idea whether it's
actually possible.

In net/*: none

In sound/*: 4 instances.

snd_pcm_oss_release():
        if (snd_BUG_ON(!substream))
                return -ENXIO;
	IOW, whine in impossible case.

snd_pcm_release():
	ditto.

sq_release():
        if (file->f_mode & FMODE_WRITE) {
                if (write_sq.busy)
                        rc = sq_fsync();
	subsequently returns rc; sq_fsync() can return an error, both on timeout
	and in case of interrupted wait.

snd_hwdep_release():
	passes the return value of hwdep ->release() method; two of those
	can return an error.   snd_asihpi_hpi_release() is, AFAICS, impossible,
	unless you manage to flip this module_param(enable_hpi_hwdep, bool, 0644);
	off after opening a file.  And snd_usX2Y_hwdep_pcm_release() calls
	usX2Y_pcms_busy_check() and passes its return value out.  No idea
	whether that can be triggered.


In other words, the real mess is hidden in drivers/*...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-29 22:22           ` Al Viro
@ 2019-08-29 23:32             ` Al Viro
  2019-08-30  4:10             ` Dave Chinner
  1 sibling, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-08-29 23:32 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Thu, Aug 29, 2019 at 11:22:58PM +0100, Al Viro wrote:
> On Tue, Aug 27, 2019 at 06:27:35PM +0100, Al Viro wrote:
> 
> > Most of them are actually pure bollocks - "it can never happen, but if it does,
> > let's return -EWHATEVER to feel better".  Some are crap like -EINTR, which is
> > also bollocks - for one thing, process might've been closing files precisely
> > because it's been hit by SIGKILL.  For another, it's a destructor.  It won't
> > be retried by the caller - there's nothing called for that object afterwards.
> > What you don't do in it won't be done at all.
> > 
> > And some are "commit on final close" kind of thing, both with the hardware
> > errors and parsing errors.

[snip]

> In other words, the real mess is hidden in drivers/*...

BTW, here's a live example - v4l stuff.  There we have
static int v4l2_release(struct inode *inode, struct file *filp)
{
...
        if (vdev->fops->release) {
                if (v4l2_device_supports_requests(vdev->v4l2_dev)) {
                        mutex_lock(&vdev->v4l2_dev->mdev->req_queue_mutex);
                        ret = vdev->fops->release(filp);
                        mutex_unlock(&vdev->v4l2_dev->mdev->req_queue_mutex);
                } else {
                        ret = vdev->fops->release(filp);
                }
        }
...
        return ret;
}

	OK, so we have a secondary method, also called "release".  It lives in
struct v4l2_file_operations and its return value is passed to caller of
v4l2_release() (and discarded by it).  There is a sodding plenty of instances,
most of them explicitly initialized (for values of "sodding plenty" well over
a hundred).  Quite a few of those have .release initialized with vb2_fop_release
or v4l2_fh_release, so it's not _that_ horrible.  And these two helpers are
returning 0 in all cases.  Many instances return 0, vb2_fop_release(...),
or v4l2_fh_release(...), so they are also fine.  However, there are exceptions.

1) drivers/media/radio/wl128x/fmdrv_v4l2.c:fm_v4l2_fops_release()
...
        ret = fmc_set_mode(fmdev, FM_MODE_OFF);  
        if (ret < 0) {
                fmerr("Unable to turn off the chip\n");
                goto release_unlock;
        }
...
release_unlock:
        mutex_unlock(&fmdev->mutex);
        return ret;
}

2) drivers/media/platform/omap/omap_vout.c:omap_vout_release()
...
        /* Turn off the pipeline */
        ret = omapvid_apply_changes(vout);
        if (ret)   
                v4l2_warn(&vout->vid_dev->v4l2_dev,
                                "Unable to apply changes\n");
...
        return ret;
}

3) drivers/media/platform/pxa_camera.c:pxac_fops_camera_release()
...
        if (fh_singular)
                ret = pxac_sensor_set_power(pcdev, 0);
        mutex_unlock(&pcdev->mlock);

        return ret;
}

4) drivers/media/platform/sti/bdisp/bdisp-v4l2.c:bdisp_release()
...
        if (mutex_lock_interruptible(&bdisp->lock))
                return -ERESTARTSYS;

5) drivers/media/radio/radio-wl1273.c:wl1273_fm_fops_release()
...
                        if (mutex_lock_interruptible(&core->lock))
                                return -EINTR;

                        radio->irq_flags &= ~WL1273_RDS_EVENT;

                        if (core->mode == WL1273_MODE_RX) {
                                r = core->write(core,
                                                WL1273_INT_MASK_SET,
                                                radio->irq_flags);
                                if (r) {
                                        mutex_unlock(&core->lock);
                                        goto out;
                                }
...
out:
        return r;
}

... and then it gets even better: in drivers/media/pci/ttpci/av7110_v4l.c
we have struct v4l2_file_operations embedded into struct saa7146_ext_vv,
with
        .vbi_fops.open  = av7110_vbi_reset,
        .vbi_fops.release = av7110_vbi_reset,
in initializers.  Yes, the same instance for ->open() and ->release().
Both currently take struct file * and return int.  And it certainly
can return an error.

So we have
	* 3 simple cases of returning an error that gets seen by nobody.
	* 2 outright bugs - ERESTARTSYS is particularly cute, seeing that
restarting close(2) would not do the right thing even if it had been passed
to userland.  As it is, it simply leaks.  IMO they should switch to plain
mutex_lock; bdisp becomes regular, wl1273 becomes another "reporting error
that goes nowhere" case.
	* av7110, where we get basically the same "lost error, maybe we
care, maybe not" with a twist - we can't just convert the ->release() to
return void, since the same function is used for both ->open and ->release.
Not hard to produce a wrapper, of course...

I've no idea if v4l userland would want to see those errors; currently
they are simply lost.

All of the above is from grep; build coverage is not going to be great,
seeing how much in there is embedded stuff...  Granted, v4l is probably
the most hairy subsystem in that respect, but there's enough isolated
fun cases where file_operations ->release() in drivers/* is trying to
return errors, without any calls of subsystem methods...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-29 22:22           ` Al Viro
  2019-08-29 23:32             ` Al Viro
@ 2019-08-30  4:10             ` Dave Chinner
  2019-08-30  4:44               ` Al Viro
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2019-08-30  4:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Boaz Harrosh, Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Thu, Aug 29, 2019 at 11:22:58PM +0100, Al Viro wrote:
> On Tue, Aug 27, 2019 at 06:27:35PM +0100, Al Viro wrote:
> 
> > Most of them are actually pure bollocks - "it can never happen, but if it does,
> > let's return -EWHATEVER to feel better".  Some are crap like -EINTR, which is
> > also bollocks - for one thing, process might've been closing files precisely
> > because it's been hit by SIGKILL.  For another, it's a destructor.  It won't
> > be retried by the caller - there's nothing called for that object afterwards.
> > What you don't do in it won't be done at all.
> > 
> > And some are "commit on final close" kind of thing, both with the hardware
> > errors and parsing errors.
> 
> FWIW, here's the picture for fs/*: 6 instances.
> 
> afs_release():
> 	 calls vfs_fsync() if file had been opened for write, tries to pass
> 	the return value to caller.  Job for ->flush(), AFAICS.
> 
> coda_psdev_release():
> 	returns -1 in situation impossible after successful ->open().
> 	Can't happen without memory corruption.
> 
> configfs_release_bin_file():
> 	discussed upthread
> 
> dlm device_close():
> 	returns -ENOENT if dlm_find_lockspace_local(proc->lockspace) fails.
> No idea if that can happen.
> 
> reiserfs_file_release():
> 	tries to return an error if it can't free preallocated blocks.
> 
> xfs_release():
> 	similar to the previous case.

Not quite right. XFS only returns an error if there is data
writeback failure or filesystem corruption or shutdown detected
during whatever operation it is performing.

We don't really care what is done with the error that we return;
we're just returning an error because that's what the function
prototype indicates we should do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-30  4:10             ` Dave Chinner
@ 2019-08-30  4:44               ` Al Viro
  2019-08-31  8:28                 ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-08-30  4:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Boaz Harrosh, Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Fri, Aug 30, 2019 at 02:10:42PM +1000, Dave Chinner wrote:

> > reiserfs_file_release():
> > 	tries to return an error if it can't free preallocated blocks.
> > 
> > xfs_release():
> > 	similar to the previous case.
> 
> Not quite right. XFS only returns an error if there is data
> writeback failure or filesystem corruption or shutdown detected
> during whatever operation it is performing.
> 
> We don't really care what is done with the error that we return;
> we're just returning an error because that's what the function
> prototype indicates we should do...

I thought that xfs_release() and friends followed the prototypes
you had on IRIX, while xfs_file_release() et.al. were the
impedance-matching layer for Linux.  Oh, well...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-30  4:44               ` Al Viro
@ 2019-08-31  8:28                 ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2019-08-31  8:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Boaz Harrosh, Kai Mäkisara (Kolumbus),
	Christoph Hellwig, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, linux-scsi

On Fri, Aug 30, 2019 at 05:44:39AM +0100, Al Viro wrote:
> > Not quite right. XFS only returns an error if there is data
> > writeback failure or filesystem corruption or shutdown detected
> > during whatever operation it is performing.
> > 
> > We don't really care what is done with the error that we return;
> > we're just returning an error because that's what the function
> > prototype indicates we should do...
> 
> I thought that xfs_release() and friends followed the prototypes
> you had on IRIX, while xfs_file_release() et.al. were the
> impedance-matching layer for Linux.  Oh, well...

That is how it started out originally.  Since then a lot changed,
including the prototypes.  We could easily do something like the
patch below.  Additionally the read-only check probably should
check FMODE_WRITE instead, but that should be left for a separate
patch:

---
From fbdf4e24ad1505129fb4db38644d11fa9b7e11f0 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Sat, 31 Aug 2019 10:24:55 +0200
Subject: xfs: remove xfs_release

We can just move the code directly to xfs_file_release.  Additionally
remove the pointless i_mode verification, and the error returns that
are entirely ignored by the calller of ->release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 63 ++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.c | 80 ----------------------------------------------
 fs/xfs/xfs_inode.h |  1 -
 3 files changed, 60 insertions(+), 84 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d952d5962e93..cbaba0cc1fa4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1060,10 +1060,67 @@ xfs_dir_open(
 
 STATIC int
 xfs_file_release(
-	struct inode	*inode,
-	struct file	*filp)
+	struct inode		*inode,
+	struct file		*file)
 {
-	return xfs_release(XFS_I(inode));
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if ((mp->m_flags & XFS_MOUNT_RDONLY) || XFS_FORCED_SHUTDOWN(mp))
+		return 0;
+
+	/*
+	 * If we previously truncated this file and removed old data in the
+	 * process, we want to initiate "early" writeout on the last close.
+	 * This is an attempt to combat the notorious NULL files problem which
+	 * is particularly noticeable from a truncate down, buffered (re-)write
+	 * (delalloc), followed by a crash.  What we are effectively doing here
+	 * is significantly reducing the time window where we'd otherwise be
+	 * exposed to that problem.
+	 */
+	if (xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
+		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
+		if (ip->i_delayed_blks > 0)
+			filemap_flush(inode->i_mapping);
+		return 0;
+	}
+
+	if (inode->i_nlink == 0 || !xfs_can_free_eofblocks(ip, false))
+		return 0;
+
+	/*
+	 * Check if the inode is being opened, written and closed frequently and
+	 * we have delayed allocation blocks outstanding (e.g. streaming writes
+	 * from the NFS server), truncating the blocks past EOF will cause
+	 * fragmentation to occur.
+	 *
+	 * In this case don't do the truncation, but we have to be careful how
+	 * we detect this case. Blocks beyond EOF show up as i_delayed_blks even
+	 * when the inode is clean, so we need to truncate them away first
+	 * before checking for a dirty release.  Hence on the first dirty close
+	 * we will still remove the speculative allocation, but after that we
+	 * will leave it in place.
+	 */
+	if (xfs_iflags_test(ip, XFS_IDIRTY_RELEASE))
+		return 0;
+
+	/*
+	 * If we can't get the iolock just skip truncating the blocks past EOF
+	 * because we could deadlock with the mmap_sem otherwise.  We'll get
+	 * another chance to drop them once the last reference to the inode is
+	 * dropped, so we'll never leak blocks permanently.
+	 */
+	if (xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
+		xfs_free_eofblocks(ip);
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	}
+
+	/*
+	 * Delalloc blocks after truncation means it really is dirty.
+	 */
+	if (ip->i_delayed_blks)
+		xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
+	return 0;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index cdb97fa027fa..b0e85b7b8dc3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1590,86 +1590,6 @@ xfs_itruncate_extents_flags(
 	return error;
 }
 
-int
-xfs_release(
-	xfs_inode_t	*ip)
-{
-	xfs_mount_t	*mp = ip->i_mount;
-	int		error;
-
-	if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
-		return 0;
-
-	/* If this is a read-only mount, don't do this (would generate I/O) */
-	if (mp->m_flags & XFS_MOUNT_RDONLY)
-		return 0;
-
-	if (!XFS_FORCED_SHUTDOWN(mp)) {
-		int truncated;
-
-		/*
-		 * If we previously truncated this file and removed old data
-		 * in the process, we want to initiate "early" writeout on
-		 * the last close.  This is an attempt to combat the notorious
-		 * NULL files problem which is particularly noticeable from a
-		 * truncate down, buffered (re-)write (delalloc), followed by
-		 * a crash.  What we are effectively doing here is
-		 * significantly reducing the time window where we'd otherwise
-		 * be exposed to that problem.
-		 */
-		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
-		if (truncated) {
-			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
-			if (ip->i_delayed_blks > 0) {
-				error = filemap_flush(VFS_I(ip)->i_mapping);
-				if (error)
-					return error;
-			}
-		}
-	}
-
-	if (VFS_I(ip)->i_nlink == 0)
-		return 0;
-
-	if (xfs_can_free_eofblocks(ip, false)) {
-
-		/*
-		 * Check if the inode is being opened, written and closed
-		 * frequently and we have delayed allocation blocks outstanding
-		 * (e.g. streaming writes from the NFS server), truncating the
-		 * blocks past EOF will cause fragmentation to occur.
-		 *
-		 * In this case don't do the truncation, but we have to be
-		 * careful how we detect this case. Blocks beyond EOF show up as
-		 * i_delayed_blks even when the inode is clean, so we need to
-		 * truncate them away first before checking for a dirty release.
-		 * Hence on the first dirty close we will still remove the
-		 * speculative allocation, but after that we will leave it in
-		 * place.
-		 */
-		if (xfs_iflags_test(ip, XFS_IDIRTY_RELEASE))
-			return 0;
-		/*
-		 * If we can't get the iolock just skip truncating the blocks
-		 * past EOF because we could deadlock with the mmap_sem
-		 * otherwise. We'll get another chance to drop them once the
-		 * last reference to the inode is dropped, so we'll never leak
-		 * blocks permanently.
-		 */
-		if (xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
-			error = xfs_free_eofblocks(ip);
-			xfs_iunlock(ip, XFS_IOLOCK_EXCL);
-			if (error)
-				return error;
-		}
-
-		/* delalloc blocks after truncation means it really is dirty */
-		if (ip->i_delayed_blks)
-			xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
-	}
-	return 0;
-}
-
 /*
  * xfs_inactive_truncate
  *
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 558173f95a03..4299905135b2 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -410,7 +410,6 @@ enum layout_break_reason {
 	(((pip)->i_mount->m_flags & XFS_MOUNT_GRPID) || \
 	 (VFS_I(pip)->i_mode & S_ISGID))
 
-int		xfs_release(struct xfs_inode *ip);
 void		xfs_inactive(struct xfs_inode *ip);
 int		xfs_lookup(struct xfs_inode *dp, struct xfs_name *name,
 			   struct xfs_inode **ipp, struct xfs_name *ci_name);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-26 19:28     ` Al Viro
  2019-08-27  8:51       ` Miklos Szeredi
@ 2019-08-31  8:32       ` Christoph Hellwig
  2019-08-31 13:35         ` Al Viro
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2019-08-31  8:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Mon, Aug 26, 2019 at 08:28:19PM +0100, Al Viro wrote:
> For configfs bin_attr it won't work, simply because it wants the entire
> thing to be present - callback parses the data.  For SCSI tape...  Maybe,
> but you'll need to take care of the overlaps with ->write().  Right now
> it can't happen (the last reference, about to be dropped right after
> st_flush() returns); if we do that on each ->flush(), we will have to
> cope with that fun and we'll need to keep an error (if any) for the
> next call of st_flush() to pick and return.  I'm not saying it can't
> be done, but that's really a question for SCSI folks.

So for the one real life example of the configfs attribute life
actually is simpler.  acpi_table_aml_write verifies early on that
the size matches what it expects.  So if we document that any future
instance needs to be able to do that as well we should be able to
get away with just writing it from ->flush.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-31  8:32       ` Christoph Hellwig
@ 2019-08-31 13:35         ` Al Viro
  2019-08-31 14:44           ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-08-31 13:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, Kai Mäkisara, linux-scsi

On Sat, Aug 31, 2019 at 10:32:41AM +0200, Christoph Hellwig wrote:
> On Mon, Aug 26, 2019 at 08:28:19PM +0100, Al Viro wrote:
> > For configfs bin_attr it won't work, simply because it wants the entire
> > thing to be present - callback parses the data.  For SCSI tape...  Maybe,
> > but you'll need to take care of the overlaps with ->write().  Right now
> > it can't happen (the last reference, about to be dropped right after
> > st_flush() returns); if we do that on each ->flush(), we will have to
> > cope with that fun and we'll need to keep an error (if any) for the
> > next call of st_flush() to pick and return.  I'm not saying it can't
> > be done, but that's really a question for SCSI folks.
> 
> So for the one real life example of the configfs attribute life
> actually is simpler.  acpi_table_aml_write verifies early on that
> the size matches what it expects.  So if we document that any future
> instance needs to be able to do that as well we should be able to
> get away with just writing it from ->flush.

I'm not sure I understand what you mean...  Do you want them to recognize
incomplete data and quietly bugger off when called on too early ->flush()?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-31 13:35         ` Al Viro
@ 2019-08-31 14:44           ` Christoph Hellwig
  2019-08-31 15:58             ` Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2019-08-31 14:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Matthew Wilcox, linux-fsdevel,
	Octavian Purdila, Pantelis Antoniou, Linus Torvalds,
	Kai Mäkisara, linux-scsi

On Sat, Aug 31, 2019 at 02:35:37PM +0100, Al Viro wrote:
> > So for the one real life example of the configfs attribute life
> > actually is simpler.  acpi_table_aml_write verifies early on that
> > the size matches what it expects.  So if we document that any future
> > instance needs to be able to do that as well we should be able to
> > get away with just writing it from ->flush.
> 
> I'm not sure I understand what you mean...  Do you want them to recognize
> incomplete data and quietly bugger off when called on too early ->flush()?

That is what the only user does anyway, take a look at
acpi_table_aml_write.  So yes, change the documentation to say it
gets written on every close, and the implementation has to deal with
incomplete data by either returning an error or ignoring it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Re: broken userland ABI in configfs binary attributes
  2019-08-31 14:44           ` Christoph Hellwig
@ 2019-08-31 15:58             ` Al Viro
  0 siblings, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-08-31 15:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-fsdevel, Octavian Purdila,
	Pantelis Antoniou, Linus Torvalds, Kai Mäkisara, linux-scsi

On Sat, Aug 31, 2019 at 04:44:48PM +0200, Christoph Hellwig wrote:
> On Sat, Aug 31, 2019 at 02:35:37PM +0100, Al Viro wrote:
> > > So for the one real life example of the configfs attribute life
> > > actually is simpler.  acpi_table_aml_write verifies early on that
> > > the size matches what it expects.  So if we document that any future
> > > instance needs to be able to do that as well we should be able to
> > > get away with just writing it from ->flush.
> > 
> > I'm not sure I understand what you mean...  Do you want them to recognize
> > incomplete data and quietly bugger off when called on too early ->flush()?
> 
> That is what the only user does anyway, take a look at
> acpi_table_aml_write.  So yes, change the documentation to say it
> gets written on every close, and the implementation has to deal with
> incomplete data by either returning an error or ignoring it.

That, or we could just let it have the sucker called on each write().
As in "for such files writes are accumulated in a buffer, until the
method finally decides it has enough to process"...

In that case method should return something recognizable for "need more
input", as opposed to "stop, it's already an unacceptable garbage".

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-08-31 15:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26  2:48 broken userland ABI in configfs binary attributes Al Viro
2019-08-26 16:29 ` [RFC] " Al Viro
2019-08-26 18:20   ` Matthew Wilcox
2019-08-26 19:28     ` Al Viro
2019-08-27  8:51       ` Miklos Szeredi
2019-08-27 11:58         ` Al Viro
2019-08-27 12:21           ` Miklos Szeredi
2019-08-27 12:53             ` Al Viro
2019-08-31  8:32       ` Christoph Hellwig
2019-08-31 13:35         ` Al Viro
2019-08-31 14:44           ` Christoph Hellwig
2019-08-31 15:58             ` Al Viro
2019-08-26 18:34   ` "Kai Mäkisara (Kolumbus)"
2019-08-26 19:32     ` Al Viro
2019-08-27 15:01       ` Boaz Harrosh
2019-08-27 17:27         ` Al Viro
2019-08-27 17:59           ` Boaz Harrosh
2019-08-29 22:22           ` Al Viro
2019-08-29 23:32             ` Al Viro
2019-08-30  4:10             ` Dave Chinner
2019-08-30  4:44               ` Al Viro
2019-08-31  8:28                 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).