[Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
@ 2010-04-08  7:47 Li Dongyang
  2010-04-08 18:41 ` Sunil Mushran
  2010-04-14  1:58 ` Joel Becker
  0 siblings, 2 replies; 27+ messages in thread
From: Li Dongyang @ 2010-04-08  7:47 UTC (permalink / raw)
  To: ocfs2-devel

when we fall back to buffered write from direct write, we call
__generic_file_aio_write but that will end up doing direct write
even we are only prepared to do buffered write because the file
has O_DIRECT flag set. This is a fix for
https://bugzilla.novell.com/show_bug.cgi?id=591039


Signed-off-by: Li Dongyang <lidongyang@novell.com>
---
 fs/ocfs2/file.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index de059f4..707f2a2 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1973,18 +1973,24 @@ relock:
 	/* communicate with ocfs2_dio_end_io */
 	ocfs2_iocb_set_rw_locked(iocb, rw_level);
 
-	if (direct_io) {
-		ret = generic_segment_checks(iov, &nr_segs, &ocount,
-					     VERIFY_READ);
-		if (ret)
-			goto out_dio;
+	ret = generic_segment_checks(iov, &nr_segs, &ocount,
+				     VERIFY_READ);
+	if (ret)
+		goto out_dio;
 
-		count = ocount;
-		ret = generic_write_checks(file, ppos, &count,
+	count = ocount;
+	ret = generic_write_checks(file, ppos, &count,
 					   S_ISBLK(inode->i_mode));
-		if (ret)
-			goto out_dio;
+	if (ret)
+		goto out_dio;
+
+	ret = file_remove_suid(file);
+	if (ret)
+		goto out_dio;
 
+	file_update_time(file);
+
+	if (direct_io) {
 		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
 						    ppos, count, ocount);
 		if (written < 0) {
@@ -1999,7 +2005,8 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
+		written = generic_file_buffered_write(iocb, iov, nr_segs,
+				*ppos, ppos, count, 0);
 	}
 
 out_dio:
-- 
1.7.0.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-08  7:47 [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered Li Dongyang
@ 2010-04-08 18:41 ` Sunil Mushran
  2010-04-09  2:27   ` Li Dongyang
  2010-04-09  7:58   ` Coly Li
  2010-04-14  1:58 ` Joel Becker
  1 sibling, 2 replies; 27+ messages in thread
From: Sunil Mushran @ 2010-04-08 18:41 UTC (permalink / raw)
  To: ocfs2-devel

I cannot read the bugzilla. Now it maybe that that bz
cannot be made public. That's ok. But if that's the case,
can you explain the problem encountered. I am not qs
the fix... rather trying to understand why this has not
been reported before.

Li Dongyang wrote:
> when we fall back to buffered write from direct write, we call
> __generic_file_aio_write but that will end up doing direct write
> even we are only prepared to do buffered write because the file
> has O_DIRECT flag set. This is a fix for
> https://bugzilla.novell.com/show_bug.cgi?id=591039
>
>
> Signed-off-by: Li Dongyang <lidongyang@novell.com>
> ---
>  fs/ocfs2/file.c |   27 +++++++++++++++++----------
>  1 files changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index de059f4..707f2a2 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1973,18 +1973,24 @@ relock:
>  	/* communicate with ocfs2_dio_end_io */
>  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
>  
> -	if (direct_io) {
> -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
> -					     VERIFY_READ);
> -		if (ret)
> -			goto out_dio;
> +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
> +				     VERIFY_READ);
> +	if (ret)
> +		goto out_dio;
>  
> -		count = ocount;
> -		ret = generic_write_checks(file, ppos, &count,
> +	count = ocount;
> +	ret = generic_write_checks(file, ppos, &count,
>  					   S_ISBLK(inode->i_mode));
> -		if (ret)
> -			goto out_dio;
> +	if (ret)
> +		goto out_dio;
> +
> +	ret = file_remove_suid(file);
> +	if (ret)
> +		goto out_dio;
>  
> +	file_update_time(file);
> +
> +	if (direct_io) {
>  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
>  						    ppos, count, ocount);
>  		if (written < 0) {
> @@ -1999,7 +2005,8 @@ relock:
>  			goto out_dio;
>  		}
>  	} else {
> -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
> +		written = generic_file_buffered_write(iocb, iov, nr_segs,
> +				*ppos, ppos, count, 0);
>  	}
>  
>  out_dio:
>   

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-08 18:41 ` Sunil Mushran
@ 2010-04-09  2:27   ` Li Dongyang
  2010-04-09  2:38     ` Tao Ma
  2010-04-09  7:58   ` Coly Li
  1 sibling, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-09  2:27 UTC (permalink / raw)
  To: ocfs2-devel

This is because ocfs2_file_aio_write calls ocfs2_prepare_inode_for_write which 
sets direct_io to 0 if it finds out that direct IO would extend the file. But 
later we call __generic_file_aio_write which end's up calling 
generic_file_direct_write because the file has O_DIRECT flag.So every time we 
do a direct write extending the file, the inode->i_size gets inconsistent with 
the i_size on disk because we call generic_file_direct_write, and if we do a 
truncate after this, we will meet a bug in ocfs2_truncate_file.

On Friday 09 April 2010 02:41:26 Sunil Mushran wrote:
> I cannot read the bugzilla. Now it maybe that that bz
> cannot be made public. That's ok. But if that's the case,
> can you explain the problem encountered. I am not qs
> the fix... rather trying to understand why this has not
> been reported before.
> 
> Li Dongyang wrote:
> > when we fall back to buffered write from direct write, we call
> > __generic_file_aio_write but that will end up doing direct write
> > even we are only prepared to do buffered write because the file
> > has O_DIRECT flag set. This is a fix for
> > https://bugzilla.novell.com/show_bug.cgi?id=591039
> >
> >
> > Signed-off-by: Li Dongyang <lidongyang@novell.com>
> > ---
> >  fs/ocfs2/file.c |   27 +++++++++++++++++----------
> >  1 files changed, 17 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index de059f4..707f2a2 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -1973,18 +1973,24 @@ relock:
> >  	/* communicate with ocfs2_dio_end_io */
> >  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
> >
> > -	if (direct_io) {
> > -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
> > -					     VERIFY_READ);
> > -		if (ret)
> > -			goto out_dio;
> > +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
> > +				     VERIFY_READ);
> > +	if (ret)
> > +		goto out_dio;
> >
> > -		count = ocount;
> > -		ret = generic_write_checks(file, ppos, &count,
> > +	count = ocount;
> > +	ret = generic_write_checks(file, ppos, &count,
> >  					   S_ISBLK(inode->i_mode));
> > -		if (ret)
> > -			goto out_dio;
> > +	if (ret)
> > +		goto out_dio;
> > +
> > +	ret = file_remove_suid(file);
> > +	if (ret)
> > +		goto out_dio;
> >
> > +	file_update_time(file);
> > +
> > +	if (direct_io) {
> >  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
> >  						    ppos, count, ocount);
> >  		if (written < 0) {
> > @@ -1999,7 +2005,8 @@ relock:
> >  			goto out_dio;
> >  		}
> >  	} else {
> > -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
> > +		written = generic_file_buffered_write(iocb, iov, nr_segs,
> > +				*ppos, ppos, count, 0);
> >  	}
> >
> >  out_dio:
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  2:27   ` Li Dongyang
@ 2010-04-09  2:38     ` Tao Ma
  2010-04-09  3:00       ` Li Dongyang
  0 siblings, 1 reply; 27+ messages in thread
From: Tao Ma @ 2010-04-09  2:38 UTC (permalink / raw)
  To: ocfs2-devel

Hi Dongyang,

Li Dongyang wrote:
> This is because ocfs2_file_aio_write calls ocfs2_prepare_inode_for_write which 
> sets direct_io to 0 if it finds out that direct IO would extend the file. But 
> later we call __generic_file_aio_write which end's up calling 
> generic_file_direct_write because the file has O_DIRECT flag.So every time we 
> do a direct write extending the file, the inode->i_size gets inconsistent with 
> the i_size on disk because we call generic_file_direct_write, and if we do a 
> truncate after this, we will meet a bug in ocfs2_truncate_file.
yes we have O_DIRECT flag set and in __generic_file_aio_write it will 
call generic_file_direct_write first and then trigger to 
ocfs2_direct_IO. In this function we will check again and return 0. And 
_generic_file_aio_write will fall back to buffered write if the directIO 
can't write. Am I wrong somehow?

Regards,
Tao
> 
> On Friday 09 April 2010 02:41:26 Sunil Mushran wrote:
>> I cannot read the bugzilla. Now it maybe that that bz
>> cannot be made public. That's ok. But if that's the case,
>> can you explain the problem encountered. I am not qs
>> the fix... rather trying to understand why this has not
>> been reported before.
>>
>> Li Dongyang wrote:
>>> when we fall back to buffered write from direct write, we call
>>> __generic_file_aio_write but that will end up doing direct write
>>> even we are only prepared to do buffered write because the file
>>> has O_DIRECT flag set. This is a fix for
>>> https://bugzilla.novell.com/show_bug.cgi?id=591039
>>>
>>>
>>> Signed-off-by: Li Dongyang <lidongyang@novell.com>
>>> ---
>>>  fs/ocfs2/file.c |   27 +++++++++++++++++----------
>>>  1 files changed, 17 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index de059f4..707f2a2 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -1973,18 +1973,24 @@ relock:
>>>  	/* communicate with ocfs2_dio_end_io */
>>>  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
>>>
>>> -	if (direct_io) {
>>> -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
>>> -					     VERIFY_READ);
>>> -		if (ret)
>>> -			goto out_dio;
>>> +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
>>> +				     VERIFY_READ);
>>> +	if (ret)
>>> +		goto out_dio;
>>>
>>> -		count = ocount;
>>> -		ret = generic_write_checks(file, ppos, &count,
>>> +	count = ocount;
>>> +	ret = generic_write_checks(file, ppos, &count,
>>>  					   S_ISBLK(inode->i_mode));
>>> -		if (ret)
>>> -			goto out_dio;
>>> +	if (ret)
>>> +		goto out_dio;
>>> +
>>> +	ret = file_remove_suid(file);
>>> +	if (ret)
>>> +		goto out_dio;
>>>
>>> +	file_update_time(file);
>>> +
>>> +	if (direct_io) {
>>>  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
>>>  						    ppos, count, ocount);
>>>  		if (written < 0) {
>>> @@ -1999,7 +2005,8 @@ relock:
>>>  			goto out_dio;
>>>  		}
>>>  	} else {
>>> -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
>>> +		written = generic_file_buffered_write(iocb, iov, nr_segs,
>>> +				*ppos, ppos, count, 0);
>>>  	}
>>>
>>>  out_dio:
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  2:38     ` Tao Ma
@ 2010-04-09  3:00       ` Li Dongyang
  2010-04-09  3:32         ` Tao Ma
  0 siblings, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-09  3:00 UTC (permalink / raw)
  To: ocfs2-devel

Hi, Tao,
On Friday 09 April 2010 10:38:33 Tao Ma wrote:
> Hi Dongyang,
> 
> Li Dongyang wrote:
> > This is because ocfs2_file_aio_write calls ocfs2_prepare_inode_for_write
> > which sets direct_io to 0 if it finds out that direct IO would extend the
> > file. But later we call __generic_file_aio_write which end's up calling
> > generic_file_direct_write because the file has O_DIRECT flag.So every
> > time we do a direct write extending the file, the inode->i_size gets
> > inconsistent with the i_size on disk because we call
> > generic_file_direct_write, and if we do a truncate after this, we will
> > meet a bug in ocfs2_truncate_file.
> 
> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
> call generic_file_direct_write first and then trigger to
> ocfs2_direct_IO. In this function we will check again and return 0. And
> _generic_file_aio_write will fall back to buffered write if the directIO
> can't write. Am I wrong somehow?
> 
yes ocfs2_direct_IO has some check, but it just check if we are appending(the 
i_size <= offset), if the offset < i_size and offset + count > i_size, it will 
do direct io anyway. seems we also can fix this by adding a check to 
ocfs2_direct_IO.
Br,
Li Dongyang
> Regards,
> Tao
> 
> > On Friday 09 April 2010 02:41:26 Sunil Mushran wrote:
> >> I cannot read the bugzilla. Now it maybe that that bz
> >> cannot be made public. That's ok. But if that's the case,
> >> can you explain the problem encountered. I am not qs
> >> the fix... rather trying to understand why this has not
> >> been reported before.
> >>
> >> Li Dongyang wrote:
> >>> when we fall back to buffered write from direct write, we call
> >>> __generic_file_aio_write but that will end up doing direct write
> >>> even we are only prepared to do buffered write because the file
> >>> has O_DIRECT flag set. This is a fix for
> >>> https://bugzilla.novell.com/show_bug.cgi?id=591039
> >>>
> >>>
> >>> Signed-off-by: Li Dongyang <lidongyang@novell.com>
> >>> ---
> >>>  fs/ocfs2/file.c |   27 +++++++++++++++++----------
> >>>  1 files changed, 17 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> >>> index de059f4..707f2a2 100644
> >>> --- a/fs/ocfs2/file.c
> >>> +++ b/fs/ocfs2/file.c
> >>> @@ -1973,18 +1973,24 @@ relock:
> >>>  	/* communicate with ocfs2_dio_end_io */
> >>>  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
> >>>
> >>> -	if (direct_io) {
> >>> -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
> >>> -					     VERIFY_READ);
> >>> -		if (ret)
> >>> -			goto out_dio;
> >>> +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
> >>> +				     VERIFY_READ);
> >>> +	if (ret)
> >>> +		goto out_dio;
> >>>
> >>> -		count = ocount;
> >>> -		ret = generic_write_checks(file, ppos, &count,
> >>> +	count = ocount;
> >>> +	ret = generic_write_checks(file, ppos, &count,
> >>>  					   S_ISBLK(inode->i_mode));
> >>> -		if (ret)
> >>> -			goto out_dio;
> >>> +	if (ret)
> >>> +		goto out_dio;
> >>> +
> >>> +	ret = file_remove_suid(file);
> >>> +	if (ret)
> >>> +		goto out_dio;
> >>>
> >>> +	file_update_time(file);
> >>> +
> >>> +	if (direct_io) {
> >>>  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
> >>>  						    ppos, count, ocount);
> >>>  		if (written < 0) {
> >>> @@ -1999,7 +2005,8 @@ relock:
> >>>  			goto out_dio;
> >>>  		}
> >>>  	} else {
> >>> -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
> >>> +		written = generic_file_buffered_write(iocb, iov, nr_segs,
> >>> +				*ppos, ppos, count, 0);
> >>>  	}
> >>>
> >>>  out_dio:
> >
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  3:00       ` Li Dongyang
@ 2010-04-09  3:32         ` Tao Ma
  2010-04-09  9:20           ` Li Dongyang
  0 siblings, 1 reply; 27+ messages in thread
From: Tao Ma @ 2010-04-09  3:32 UTC (permalink / raw)
  To: ocfs2-devel

Hi Dongyang,

Li Dongyang wrote:
> Hi, Tao,
> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
>> Hi Dongyang,
>>
>> Li Dongyang wrote:
>>> This is because ocfs2_file_aio_write calls ocfs2_prepare_inode_for_write
>>> which sets direct_io to 0 if it finds out that direct IO would extend the
>>> file. But later we call __generic_file_aio_write which end's up calling
>>> generic_file_direct_write because the file has O_DIRECT flag.So every
>>> time we do a direct write extending the file, the inode->i_size gets
>>> inconsistent with the i_size on disk because we call
>>> generic_file_direct_write, and if we do a truncate after this, we will
>>> meet a bug in ocfs2_truncate_file.
>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
>> call generic_file_direct_write first and then trigger to
>> ocfs2_direct_IO. In this function we will check again and return 0. And
>> _generic_file_aio_write will fall back to buffered write if the directIO
>> can't write. Am I wrong somehow?
>>
> yes ocfs2_direct_IO has some check, but it just check if we are appending(the 
> i_size <= offset), if the offset < i_size and offset + count > i_size, it will 
> do direct io anyway. seems we also can fix this by adding a check to 
> ocfs2_direct_IO.
It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you 
will get what I mean. ;)

Regards,
Tao
> Br,
> Li Dongyang
>> Regards,
>> Tao
>>
>>> On Friday 09 April 2010 02:41:26 Sunil Mushran wrote:
>>>> I cannot read the bugzilla. Now it maybe that that bz
>>>> cannot be made public. That's ok. But if that's the case,
>>>> can you explain the problem encountered. I am not qs
>>>> the fix... rather trying to understand why this has not
>>>> been reported before.
>>>>
>>>> Li Dongyang wrote:
>>>>> when we fall back to buffered write from direct write, we call
>>>>> __generic_file_aio_write but that will end up doing direct write
>>>>> even we are only prepared to do buffered write because the file
>>>>> has O_DIRECT flag set. This is a fix for
>>>>> https://bugzilla.novell.com/show_bug.cgi?id=591039
>>>>>
>>>>>
>>>>> Signed-off-by: Li Dongyang <lidongyang@novell.com>
>>>>> ---
>>>>>  fs/ocfs2/file.c |   27 +++++++++++++++++----------
>>>>>  1 files changed, 17 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>>>> index de059f4..707f2a2 100644
>>>>> --- a/fs/ocfs2/file.c
>>>>> +++ b/fs/ocfs2/file.c
>>>>> @@ -1973,18 +1973,24 @@ relock:
>>>>>  	/* communicate with ocfs2_dio_end_io */
>>>>>  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
>>>>>
>>>>> -	if (direct_io) {
>>>>> -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
>>>>> -					     VERIFY_READ);
>>>>> -		if (ret)
>>>>> -			goto out_dio;
>>>>> +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
>>>>> +				     VERIFY_READ);
>>>>> +	if (ret)
>>>>> +		goto out_dio;
>>>>>
>>>>> -		count = ocount;
>>>>> -		ret = generic_write_checks(file, ppos, &count,
>>>>> +	count = ocount;
>>>>> +	ret = generic_write_checks(file, ppos, &count,
>>>>>  					   S_ISBLK(inode->i_mode));
>>>>> -		if (ret)
>>>>> -			goto out_dio;
>>>>> +	if (ret)
>>>>> +		goto out_dio;
>>>>> +
>>>>> +	ret = file_remove_suid(file);
>>>>> +	if (ret)
>>>>> +		goto out_dio;
>>>>>
>>>>> +	file_update_time(file);
>>>>> +
>>>>> +	if (direct_io) {
>>>>>  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
>>>>>  						    ppos, count, ocount);
>>>>>  		if (written < 0) {
>>>>> @@ -1999,7 +2005,8 @@ relock:
>>>>>  			goto out_dio;
>>>>>  		}
>>>>>  	} else {
>>>>> -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
>>>>> +		written = generic_file_buffered_write(iocb, iov, nr_segs,
>>>>> +				*ppos, ppos, count, 0);
>>>>>  	}
>>>>>
>>>>>  out_dio:
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  7:58   ` Coly Li
@ 2010-04-09  7:56     ` Tao Ma
  0 siblings, 0 replies; 27+ messages in thread
From: Tao Ma @ 2010-04-09  7:56 UTC (permalink / raw)
  To: ocfs2-devel

Hi coly,

Coly Li wrote:
> 
> On 04/09/2010 02:41 AM, Sunil Mushran Wrote:
>> I cannot read the bugzilla. Now it maybe that that bz
>> cannot be made public. That's ok. But if that's the case,
>> can you explain the problem encountered. I am not qs
>> the fix... rather trying to understand why this has not
>> been reported before.
>>
> 
> Hi Sunil,
> 
> This issue was reported by Jiaju Zhang, another Novell ocfs2/dlm developer. When he did I/O pressure test (fsstress from
> ltp package), the following dmesg was observed,
> 
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717421] (11411,2):ocfs2_truncate_file:465 ERROR: bug expression:
> le64_to_cpu(fe->i_size) != i_size_read(inode)
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717437] (11411,2):ocfs2_truncate_file:465 ERROR: Inode 241893, inode i_size =
> 1540096 != di i_size = 1535498, i_flags = 0x1
Why do you guys think this is caused by the directIO fall back?
IMHO, we should update inode->i_size and fe->i_size simultaneously. So 
do you find a place where we don't sync them? I guess that should be the 
root cause.

Regards,
Tao
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717462] ------------[ cut here]------------
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717465] kernel BUG at /usr/src/packages/BUILD/ocfs2-1.4/xen/ocfs2/file.c:465!
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717468] invalid opcode: 0000 [#2] SMP
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717471] last sysfs file: /sys/kernel/uevent_seqnum
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717474] Modules linked in: ocfs2 jbd2 ocfs2_nodemanager quota_tree
> ocfs2_stack_user ocfs2_stackglue dlm configfs sg sd_mod crc_t10dif crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
> ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod af_packet microcode softdog fuse loop
> dm_mod rtc_core rtc_lib joydev xennet ext3 mbcache jbd processor thermal_sys hwmon xenblk cdrom
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717516] Supported: Yes
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717518]
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717521] Pid: 11411, comm: fsstress Tainted: G      D      (2.6.32.9-0.5-xen #1)
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717525] EIP: 0061:[<d24701ba>] EFLAGS: 00010296 CPU: 2
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717538] EIP is at ocfs2_setattr+0xc1a/0x1d10 [ocfs2]
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717542] EAX: 00000089 EBX: cd8e25f0 ECX: c056c0ec EDX: 00000000
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717545] ESI: cc4c2000 EDI: cae4e908 EBP: 00068f02 ESP: c0a43e54
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717548]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717552] Process fsstress (pid: 11411, ti=c0a42000 task=cd8e25f0  task.ti=c0a42000)
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717555] Stack:
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717557]  d24cfc30 00002c93 00000002 d24c809c 000001d1 0003b0e5 00000000 00178000
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717564] <0> 00000000 00176e0a 00000000 00000001 00110f02 00000000 00000000 00000000
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717572] <0> 00000000 00000000 00000000 00110f02 d24628e9 00008282 c0a43f44 ca5c4000
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717582] Call Trace:
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717606]  [<c00d6191>] notify_change+0x141/0x320
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717614]  [<c00bf1a8>] do_truncate+0x68/0xa0
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717619]  [<c00bf547>] do_sys_truncate+0x177/0x220
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717624]  [<c000666d>] syscall_call+0x7/0xb
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717629]  [<f57fe424>] 0xf57fe424
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717631] Code: 69 e8 ed f6 05 82 9f d7
> d1 80 75 09 f6 05 84 9f d7 d1 01 74 16 f6 05 8a 9f d7 d1 80 75 0d f6 05 8c 9f
> d7 d1 01 0f 84 48 06 00 00 <0f> 0b eb fe 66 90 8b 44 24 68 31 c9 e8 b5 2f c9 ed
> 31 c9 89 44
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717675] EIP: [<d24701ba>] ocfs2_setattr+0xc1a/0x1d10 [ocfs2] SS:ESP 0069:c0a43e54
> Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717688] ---[ end trace cce1004f6a64f124 ]---
> 
> 
> The above error can be reproduced by Jiaju, Dongyang, and me. Dongyang also reproduced this issue on vanilla kernel. We
> find these steps is easier to reproduce the error: 1) fill the ocfs2 volume to 97%-98% full (dd a big file on ocfs2
> volume)  2) then ran fsstress
> 
> Jan Kara also helps to review Dongyang's patch, no objection from him.
> 
> Hope the explanation is informative.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-08 18:41 ` Sunil Mushran
  2010-04-09  2:27   ` Li Dongyang
@ 2010-04-09  7:58   ` Coly Li
  2010-04-09  7:56     ` Tao Ma
  1 sibling, 1 reply; 27+ messages in thread
From: Coly Li @ 2010-04-09  7:58 UTC (permalink / raw)
  To: ocfs2-devel



On 04/09/2010 02:41 AM, Sunil Mushran Wrote:
> I cannot read the bugzilla. Now it maybe that that bz
> cannot be made public. That's ok. But if that's the case,
> can you explain the problem encountered. I am not qs
> the fix... rather trying to understand why this has not
> been reported before.
> 

Hi Sunil,

This issue was reported by Jiaju Zhang, another Novell ocfs2/dlm developer. When he did I/O pressure test (fsstress from
ltp package), the following dmesg was observed,

Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717421] (11411,2):ocfs2_truncate_file:465 ERROR: bug expression:
le64_to_cpu(fe->i_size) != i_size_read(inode)
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717437] (11411,2):ocfs2_truncate_file:465 ERROR: Inode 241893, inode i_size =
1540096 != di i_size = 1535498, i_flags = 0x1
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717462] ------------[ cut here]------------
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717465] kernel BUG at /usr/src/packages/BUILD/ocfs2-1.4/xen/ocfs2/file.c:465!
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717468] invalid opcode: 0000 [#2] SMP
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717471] last sysfs file: /sys/kernel/uevent_seqnum
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717474] Modules linked in: ocfs2 jbd2 ocfs2_nodemanager quota_tree
ocfs2_stack_user ocfs2_stackglue dlm configfs sg sd_mod crc_t10dif crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod af_packet microcode softdog fuse loop
dm_mod rtc_core rtc_lib joydev xennet ext3 mbcache jbd processor thermal_sys hwmon xenblk cdrom
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717516] Supported: Yes
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717518]
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717521] Pid: 11411, comm: fsstress Tainted: G      D      (2.6.32.9-0.5-xen #1)
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717525] EIP: 0061:[<d24701ba>] EFLAGS: 00010296 CPU: 2
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717538] EIP is at ocfs2_setattr+0xc1a/0x1d10 [ocfs2]
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717542] EAX: 00000089 EBX: cd8e25f0 ECX: c056c0ec EDX: 00000000
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717545] ESI: cc4c2000 EDI: cae4e908 EBP: 00068f02 ESP: c0a43e54
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717548]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717552] Process fsstress (pid: 11411, ti=c0a42000 task=cd8e25f0  task.ti=c0a42000)
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717555] Stack:
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717557]  d24cfc30 00002c93 00000002 d24c809c 000001d1 0003b0e5 00000000 00178000
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717564] <0> 00000000 00176e0a 00000000 00000001 00110f02 00000000 00000000 00000000
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717572] <0> 00000000 00000000 00000000 00110f02 d24628e9 00008282 c0a43f44 ca5c4000
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717582] Call Trace:
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717606]  [<c00d6191>] notify_change+0x141/0x320
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717614]  [<c00bf1a8>] do_truncate+0x68/0xa0
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717619]  [<c00bf547>] do_sys_truncate+0x177/0x220
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717624]  [<c000666d>] syscall_call+0x7/0xb
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717629]  [<f57fe424>] 0xf57fe424
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717631] Code: 69 e8 ed f6 05 82 9f d7
d1 80 75 09 f6 05 84 9f d7 d1 01 74 16 f6 05 8a 9f d7 d1 80 75 0d f6 05 8c 9f
d7 d1 01 0f 84 48 06 00 00 <0f> 0b eb fe 66 90 8b 44 24 68 31 c9 e8 b5 2f c9 ed
31 c9 89 44
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717675] EIP: [<d24701ba>] ocfs2_setattr+0xc1a/0x1d10 [ocfs2] SS:ESP 0069:c0a43e54
Mar 24 15:59:07 xen-sp1-2 kernel: [ 2309.717688] ---[ end trace cce1004f6a64f124 ]---


The above error can be reproduced by Jiaju, Dongyang, and me. Dongyang also reproduced this issue on vanilla kernel. We
find these steps is easier to reproduce the error: 1) fill the ocfs2 volume to 97%-98% full (dd a big file on ocfs2
volume)  2) then ran fsstress

Jan Kara also helps to review Dongyang's patch, no objection from him.

Hope the explanation is informative.

-- 
Coly Li
SuSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  3:32         ` Tao Ma
@ 2010-04-09  9:20           ` Li Dongyang
  2010-04-09 17:36             ` Sunil Mushran
  0 siblings, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-09  9:20 UTC (permalink / raw)
  To: ocfs2-devel

On Friday 09 April 2010 11:32:10 Tao Ma wrote:
> Hi Dongyang,
> 
> Li Dongyang wrote:
> > Hi, Tao,
> >
> > On Friday 09 April 2010 10:38:33 Tao Ma wrote:
> >> Hi Dongyang,
> >>
> >> Li Dongyang wrote:
> >>> This is because ocfs2_file_aio_write calls
> >>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds out
> >>> that direct IO would extend the file. But later we call
> >>> __generic_file_aio_write which end's up calling
> >>> generic_file_direct_write because the file has O_DIRECT flag.So every
> >>> time we do a direct write extending the file, the inode->i_size gets
> >>> inconsistent with the i_size on disk because we call
> >>> generic_file_direct_write, and if we do a truncate after this, we will
> >>> meet a bug in ocfs2_truncate_file.
> >>
> >> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
> >> call generic_file_direct_write first and then trigger to
> >> ocfs2_direct_IO. In this function we will check again and return 0. And
> >> _generic_file_aio_write will fall back to buffered write if the directIO
> >> can't write. Am I wrong somehow?
> >
> > yes ocfs2_direct_IO has some check, but it just check if we are
> > appending(the i_size <= offset), if the offset < i_size and offset +
> > count > i_size, it will do direct io anyway. seems we also can fix this
> > by adding a check to ocfs2_direct_IO.
> 
> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
> will get what I mean. ;)
> 
Do you mean this section in ocfs2_direct_IO_get_blocks:?
/*
 * Any write past EOF is not allowed because we'd be extending.
 */
if (create && (iblock + max_blocks) > inode_blocks) {
	ret = -EIO;
	goto bail;
}

I was using the linus tree 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
and we don't have that check, but I can find this in the 
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git, introduced by 
commit 564f8a3228879d6962edb3432d01bcd7499a67ec

and now with this check I got what you mean, you are right, but I wonder why 
the linus tree doesn't have this check? and are we suppose to do with this?
IMHO we can just push this commit to linus tree.

Br,
Li Dongyang
> Regards,
> Tao
> 
> > Br,
> > Li Dongyang
> >
> >> Regards,
> >> Tao
> >>
> >>> On Friday 09 April 2010 02:41:26 Sunil Mushran wrote:
> >>>> I cannot read the bugzilla. Now it maybe that that bz
> >>>> cannot be made public. That's ok. But if that's the case,
> >>>> can you explain the problem encountered. I am not qs
> >>>> the fix... rather trying to understand why this has not
> >>>> been reported before.
> >>>>
> >>>> Li Dongyang wrote:
> >>>>> when we fall back to buffered write from direct write, we call
> >>>>> __generic_file_aio_write but that will end up doing direct write
> >>>>> even we are only prepared to do buffered write because the file
> >>>>> has O_DIRECT flag set. This is a fix for
> >>>>> https://bugzilla.novell.com/show_bug.cgi?id=591039
> >>>>>
> >>>>>
> >>>>> Signed-off-by: Li Dongyang <lidongyang@novell.com>
> >>>>> ---
> >>>>>  fs/ocfs2/file.c |   27 +++++++++++++++++----------
> >>>>>  1 files changed, 17 insertions(+), 10 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> >>>>> index de059f4..707f2a2 100644
> >>>>> --- a/fs/ocfs2/file.c
> >>>>> +++ b/fs/ocfs2/file.c
> >>>>> @@ -1973,18 +1973,24 @@ relock:
> >>>>>  	/* communicate with ocfs2_dio_end_io */
> >>>>>  	ocfs2_iocb_set_rw_locked(iocb, rw_level);
> >>>>>
> >>>>> -	if (direct_io) {
> >>>>> -		ret = generic_segment_checks(iov, &nr_segs, &ocount,
> >>>>> -					     VERIFY_READ);
> >>>>> -		if (ret)
> >>>>> -			goto out_dio;
> >>>>> +	ret = generic_segment_checks(iov, &nr_segs, &ocount,
> >>>>> +				     VERIFY_READ);
> >>>>> +	if (ret)
> >>>>> +		goto out_dio;
> >>>>>
> >>>>> -		count = ocount;
> >>>>> -		ret = generic_write_checks(file, ppos, &count,
> >>>>> +	count = ocount;
> >>>>> +	ret = generic_write_checks(file, ppos, &count,
> >>>>>  					   S_ISBLK(inode->i_mode));
> >>>>> -		if (ret)
> >>>>> -			goto out_dio;
> >>>>> +	if (ret)
> >>>>> +		goto out_dio;
> >>>>> +
> >>>>> +	ret = file_remove_suid(file);
> >>>>> +	if (ret)
> >>>>> +		goto out_dio;
> >>>>>
> >>>>> +	file_update_time(file);
> >>>>> +
> >>>>> +	if (direct_io) {
> >>>>>  		written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
> >>>>>  						    ppos, count, ocount);
> >>>>>  		if (written < 0) {
> >>>>> @@ -1999,7 +2005,8 @@ relock:
> >>>>>  			goto out_dio;
> >>>>>  		}
> >>>>>  	} else {
> >>>>> -		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
> >>>>> +		written = generic_file_buffered_write(iocb, iov, nr_segs,
> >>>>> +				*ppos, ppos, count, 0);
> >>>>>  	}
> >>>>>
> >>>>>  out_dio:
> >>>
> >>> _______________________________________________
> >>> Ocfs2-devel mailing list
> >>> Ocfs2-devel at oss.oracle.com
> >>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-09  9:20           ` Li Dongyang
@ 2010-04-09 17:36             ` Sunil Mushran
  0 siblings, 0 replies; 27+ messages in thread
From: Sunil Mushran @ 2010-04-09 17:36 UTC (permalink / raw)
  To: ocfs2-devel

  Li Dongyang wrote:
> On Friday 09 April 2010 11:32:10 Tao Ma wrote:
>> Hi Dongyang,
>>
>> Li Dongyang wrote:
>>> Hi, Tao,
>>>
>>> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
>>>> Hi Dongyang,
>>>>
>>>> Li Dongyang wrote:
>>>>> This is because ocfs2_file_aio_write calls
>>>>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds out
>>>>> that direct IO would extend the file. But later we call
>>>>> __generic_file_aio_write which end's up calling
>>>>> generic_file_direct_write because the file has O_DIRECT flag.So every
>>>>> time we do a direct write extending the file, the inode->i_size gets
>>>>> inconsistent with the i_size on disk because we call
>>>>> generic_file_direct_write, and if we do a truncate after this, we will
>>>>> meet a bug in ocfs2_truncate_file.
>>>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
>>>> call generic_file_direct_write first and then trigger to
>>>> ocfs2_direct_IO. In this function we will check again and return 0. And
>>>> _generic_file_aio_write will fall back to buffered write if the directIO
>>>> can't write. Am I wrong somehow?
>>> yes ocfs2_direct_IO has some check, but it just check if we are
>>> appending(the i_size <= offset), if the offset < i_size and offset +
>>> count > i_size, it will do direct io anyway. seems we also can fix this
>>> by adding a check to ocfs2_direct_IO.
>> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
>> will get what I mean. ;)
> Do you mean this section in ocfs2_direct_IO_get_blocks:?
> /*
>  * Any write past EOF is not allowed because we'd be extending.
>  */
> if (create && (iblock + max_blocks) > inode_blocks) {
> 	ret = -EIO;
> 	goto bail;
> }
>
> I was using the linus tree 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> and we don't have that check, but I can find this in the 
> git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git, introduced by 
> commit 564f8a3228879d6962edb3432d01bcd7499a67ec
>
> and now with this check I got what you mean, you are right, but I wonder why 
> the linus tree doesn't have this check? and are we suppose to do with this?
> IMHO we can just push this commit to linus tree.

commit 5fe878ae7f82fbf0830dbfaee4c5ca18f3aee442
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Dec 15 16:47:50 2009 -0800

    direct-io: cleanup blockdev_direct_IO locking

This check was removed recently by the above patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-08  7:47 [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered Li Dongyang
  2010-04-08 18:41 ` Sunil Mushran
@ 2010-04-14  1:58 ` Joel Becker
  2010-04-14  7:42   ` Li Dongyang
  1 sibling, 1 reply; 27+ messages in thread
From: Joel Becker @ 2010-04-14  1:58 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Apr 08, 2010 at 03:47:24PM +0800, Li Dongyang wrote:
> when we fall back to buffered write from direct write, we call
> __generic_file_aio_write but that will end up doing direct write
> even we are only prepared to do buffered write because the file
> has O_DIRECT flag set. This is a fix for
> https://bugzilla.novell.com/show_bug.cgi?id=591039

	We need to evaluate what __g_f_a_w() is doing and make sure
we're matching it appropriately for ocfs2.

> +	ret = file_remove_suid(file);
> +	if (ret)
> +		goto out_dio;

	NAK.  We do suid checks in ocfs2_prepare_inode_for_write().
Calling file_remove_suid() is outside of ocfs2's locking.  It calls
->setattr() which has its own rules in ocfs2.

> +	file_update_time(file);

	We have special behaviors regarding time updates for ocfs2
direct I/O.  This might want to live right next to the call to
generic_file_buffered_write().  But maybe not.  It needs to be checked.

Joel

-- 

To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-14  1:58 ` Joel Becker
@ 2010-04-14  7:42   ` Li Dongyang
  0 siblings, 0 replies; 27+ messages in thread
From: Li Dongyang @ 2010-04-14  7:42 UTC (permalink / raw)
  To: ocfs2-devel

On Wednesday 14 April 2010 09:58:19 Joel Becker wrote:
> On Thu, Apr 08, 2010 at 03:47:24PM +0800, Li Dongyang wrote:
> > when we fall back to buffered write from direct write, we call
> > __generic_file_aio_write but that will end up doing direct write
> > even we are only prepared to do buffered write because the file
> > has O_DIRECT flag set. This is a fix for
> > https://bugzilla.novell.com/show_bug.cgi?id=591039
> 
> 	We need to evaluate what __g_f_a_w() is doing and make sure
> we're matching it appropriately for ocfs2.
> 
Thanks for reviewing, that's just the two uncertain points to me with the patch.
> > +	ret = file_remove_suid(file);
> > +	if (ret)
> > +		goto out_dio;
> 
> 	NAK.  We do suid checks in ocfs2_prepare_inode_for_write().
> Calling file_remove_suid() is outside of ocfs2's locking.  It calls
> ->setattr() which has its own rules in ocfs2.
you are right, I did that because we are calling __g_f_a_w() in o_f_a_w()
which will call file_remove_suid() before.
> 
> > +	file_update_time(file);
> 
> 	We have special behaviors regarding time updates for ocfs2
> direct I/O.  This might want to live right next to the call to
> generic_file_buffered_write().  But maybe not.  It needs to be checked.
I don't think we need that, as we will update the time in ocfs2_write_end_nolock()
called by ocfs2_write_end().
and could I have more details on how we handle time updates for ocfs2 direct io?
Br,
Li Dongyang
> 
> Joel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-22 14:13         ` Li Dongyang
@ 2010-04-23 20:06           ` Joel Becker
  0 siblings, 0 replies; 27+ messages in thread
From: Joel Becker @ 2010-04-23 20:06 UTC (permalink / raw)
  To: ocfs2-devel

On Thu, Apr 22, 2010 at 10:13:51PM +0800, Li Dongyang wrote:
> another question: why do we only take PR on the rw_lock and do not allow a 
> direct write extending the i_size?

	The rw_lock is the cluster equivalent of the alloc_sem.  We take
a PR when we intend to use the existing allocation but will not modify
the allocation.  When we want to change the allocation (filling holes,
changing unwritten extents, growing i_size) we must take an EX.

Joel

-- 

Life's Little Instruction Book #30

	"Never buy a house without a fireplace."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-14 19:20       ` Joel Becker
@ 2010-04-22 14:13         ` Li Dongyang
  2010-04-23 20:06           ` Joel Becker
  0 siblings, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-22 14:13 UTC (permalink / raw)
  To: ocfs2-devel

On Thursday 15 April 2010 03:20:11 Joel Becker wrote:
> On Wed, Apr 14, 2010 at 01:58:20PM +0800, Li Dongyang wrote:
> > On Wednesday 14 April 2010 07:54:35 Joel Becker wrote:
> > > 	I think Sunil and I have found the real culprit.
> > > 	If a file is opened for O_DIRECT, and there are no holes,
> > > refcounts or anything, we are doing direct I/O.  ocfs2_file_aio_write()
> > > (o_f_a_w() from now on) locks things down like so:  lock(i_mutex),
> > > down_read(ip_alloc_sem), PR(rw_lock).  We have ip_alloc_sem preventing
> > > size changes on the local node and rw_lock preventing size changes on
> > > other nodes.  We call generic_file_direct_write() ourselves.
> > > 	If a file is not opened with O_DIRECT, we are doing regular
> > > buffered writes.  o_f_a_w() locks like so: lock(i_mutex),
> > > EX(rw_lock).  It is protecting against other nodes, but it does not
> > > touch ip_alloc_sem.  Why?  Because we call __generic_file_aio_write(),
> > > which will call ->write_begin().  ip_alloc_sem will be taken inside
> > > ->write_begin().  That's where we protect against other local
> > > processes. You may already see where I'm going with this.  If we are
> > > open with O_DIRECT, but we have to fall back to buffered, we will do
> > > this locking:  lock(i_mutex), down_read(ip_alloc_sem), PR(rw_lock),
> > > NL(rw_lock), up_read(ip_alloc_sem), EX(rw_lock).  That is, we start
> > > with the direct I/O locking, then back off and do the buffered locking.
> > >  But when we get into __g_f_a_w(), it will try the direct I/O again. 
> > > If the leading portion of the I/O is capable of direct I/O, it will go
> > > into direct mode *without ever taking ip_alloc_sem*.  Once it gets to
> > > the portion of the I/O that cannot be done direct, it will fall back to
> > > buffered for the rest of the I/O and will call ->write_begin() as
> > > expected.
> > > 	So this I/O that extends i_size to the end of the allocation
> > > will proceed as a direct I/O but will not have ip_alloc_sem.  Thus
> > > truncate (and any other allocation change) can race on the local
> > > machine.
> > > 	I think some form of Dong Yang's patch is going to be necessary.
> >
> > Thanks for the great explanation and analysis, but I only see we down
> > write the OCFS2_I(inode)->ip_alloc_sem in ->write_begin() and we are
> > taking inode->i_alloc_sem in o_f_a_w() when we try to do a direct write,
> > not the ip_alloc_sem. Am I missing something?
> 
> 	You're right, we use i_alloc_sem in the direct case and
> ip_alloc_sem in the buffered case.  It is, however, for the same reason.
> i_alloc_sem is about competing with the VFS (eg, vs vfs_truncate()).
> ip_alloc_sem is about competing with ourselves (ocfs2_truncate(),
> ocfs2_readpage(), etc).
> 	While I should be saying i_alloc_sem above for the direct I/O
> case, the rest of the analysis is still correct.  We need to be holding
> i_alloc_sem if we're going to be issuing direct I/Os, and we are not
> holding it in the fallback to buffered case.
> 
> Joel
> 
another question: why do we only take PR on the rw_lock and do not allow a 
direct write extending the i_size?
Br
Li Dongyang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-14  5:58     ` Li Dongyang
@ 2010-04-14 19:20       ` Joel Becker
  2010-04-22 14:13         ` Li Dongyang
  0 siblings, 1 reply; 27+ messages in thread
From: Joel Becker @ 2010-04-14 19:20 UTC (permalink / raw)
  To: ocfs2-devel

On Wed, Apr 14, 2010 at 01:58:20PM +0800, Li Dongyang wrote:
> On Wednesday 14 April 2010 07:54:35 Joel Becker wrote:
> > 	I think Sunil and I have found the real culprit.
> > 	If a file is opened for O_DIRECT, and there are no holes,
> > refcounts or anything, we are doing direct I/O.  ocfs2_file_aio_write()
> > (o_f_a_w() from now on) locks things down like so:  lock(i_mutex),
> > down_read(ip_alloc_sem), PR(rw_lock).  We have ip_alloc_sem preventing
> > size changes on the local node and rw_lock preventing size changes on
> > other nodes.  We call generic_file_direct_write() ourselves.
> > 	If a file is not opened with O_DIRECT, we are doing regular
> > buffered writes.  o_f_a_w() locks like so: lock(i_mutex),
> > EX(rw_lock).  It is protecting against other nodes, but it does not
> > touch ip_alloc_sem.  Why?  Because we call __generic_file_aio_write(),
> > which will call ->write_begin().  ip_alloc_sem will be taken inside
> > ->write_begin().  That's where we protect against other local processes.
> > 	You may already see where I'm going with this.  If we are open
> > with O_DIRECT, but we have to fall back to buffered, we will do this
> > locking:  lock(i_mutex), down_read(ip_alloc_sem), PR(rw_lock),
> > NL(rw_lock), up_read(ip_alloc_sem), EX(rw_lock).  That is, we start with
> > the direct I/O locking, then back off and do the buffered locking.  But
> > when we get into __g_f_a_w(), it will try the direct I/O again.  If the
> > leading portion of the I/O is capable of direct I/O, it will go into
> > direct mode *without ever taking ip_alloc_sem*.  Once it gets to the
> > portion of the I/O that cannot be done direct, it will fall back to
> > buffered for the rest of the I/O and will call ->write_begin() as
> > expected.
> > 	So this I/O that extends i_size to the end of the allocation
> > will proceed as a direct I/O but will not have ip_alloc_sem.  Thus
> > truncate (and any other allocation change) can race on the local
> > machine.
> > 	I think some form of Dong Yang's patch is going to be necessary.
> > 
> Thanks for the great explanation and analysis, but I only see we down write the
> OCFS2_I(inode)->ip_alloc_sem in ->write_begin() and we are taking
> inode->i_alloc_sem in o_f_a_w() when we try to do a direct write, not the ip_alloc_sem.
> Am I missing something?

	You're right, we use i_alloc_sem in the direct case and
ip_alloc_sem in the buffered case.  It is, however, for the same reason.
i_alloc_sem is about competing with the VFS (eg, vs vfs_truncate()).
ip_alloc_sem is about competing with ourselves (ocfs2_truncate(),
ocfs2_readpage(), etc).
	While I should be saying i_alloc_sem above for the direct I/O
case, the rest of the analysis is still correct.  We need to be holding
i_alloc_sem if we're going to be issuing direct I/Os, and we are not
holding it in the fallback to buffered case.

Joel

-- 

"Depend on the rabbit's foot if you will, but remember, it didn't
 help the rabbit."
	- R. E. Shay

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-14  5:47         ` Li Dongyang
@ 2010-04-14  6:08           ` Tao Ma
  0 siblings, 0 replies; 27+ messages in thread
From: Tao Ma @ 2010-04-14  6:08 UTC (permalink / raw)
  To: ocfs2-devel

Hi Dongyang,

Li Dongyang wrote:
> Hi, Tao
> On Wednesday 14 April 2010 10:44:24 Tao Ma wrote:
>> Hi Dongyang,
>>
>> Tao Ma wrote:
>>> Li Dongyang wrote:
>>>> Hi, Tao
>>>>
>>>> On Monday 12 April 2010 13:16:43 Tao Ma wrote:
>>>>> Hi dong yang,
>>>>>
>>>>> Dong Yang Li wrote:
>>>>>> I still get a bug with this check and without my patch:
>>>>> yes, the check doesn't work actually in this case.
>>>>>
>>>>>> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug
>>>>>> expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
>>>>>> [16179.955157]
>>>>>> (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
>>>>>> 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the
>>>>>> same.
>>>>>>
>>>>>>
>>>>>> the problem is this check in ocfs2_direct_IO_get_blocks just check if
>>>>>> we are going beyond the blocks right now, so if a direct write won't
>>>>>> play with new blocks but extending the i_size still get a pass, like
>>>>>> the error above said, di->i_size is 809011, using 198 blocks and the
>>>>>> direct write end up with i_size 811008, just same 198 blocks.
>>>>> yeah, you are right.
>>>> Thanks for the script,
>>>> and a stupid question: why we still try to call __generic_file_aio_write
>>>> and let it try direct write first in ocfs2_file_aio_write even we
>>>> decided we could not do the direct write?
> yes, I also concerned about the i_alloc_sem, that's why I asked the question above.
> and I think we can remove the check in ocfs2_direct_IO_get_blocks, as it does not work.
> and your suggestion sb->s_blocksize * (iblocks+contig_blocks)>inode->i_size will give -EIO
> to those good direct writes which are not going beyond i_size but also played with
> the last partial block. e.g. an inode allocated with 4 blocks and i_size is 3 * 4096 + 2000
> and we wanna do a direct io with pos=0 and length=3 * 4096 + 1000, as we are at block level in
> o_d_I_g_b().
No, I don't mean to return -EIO. My old thought is that we can just 
clear_buffer_mapped like a hole and return 0, and then let buffer_write 
to do it. Maybe I said it somehow misleading.

Regards,
Tao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-13 23:54   ` Joel Becker
  2010-04-14  0:13     ` Tao Ma
@ 2010-04-14  5:58     ` Li Dongyang
  2010-04-14 19:20       ` Joel Becker
  1 sibling, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-14  5:58 UTC (permalink / raw)
  To: ocfs2-devel

Hi, Joel
On Wednesday 14 April 2010 07:54:35 Joel Becker wrote:
> On Mon, Apr 12, 2010 at 01:16:43PM +0800, Tao Ma wrote:
> > Dong Yang Li wrote:
> > > I still get a bug with this check and without my patch:
> >
> > yes, the check doesn't work actually in this case.
> >
> > > [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression:
> > > le64_to_cpu(fe->i_size) != i_size_read(inode) [16179.955157]
> > > (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
> > > 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the same.
> > >
> > >
> > > the problem is this check in ocfs2_direct_IO_get_blocks just check if
> > > we are going beyond the blocks right now, so if a direct write won't
> > > play with new blocks but extending the i_size still get a pass, like
> > > the error above said, di->i_size is 809011, using 198 blocks and the
> > > direct write end up with i_size 811008, just same 198 blocks.
> >
> > yeah, you are right.
> 
> 	I think Sunil and I have found the real culprit.
> 	If a file is opened for O_DIRECT, and there are no holes,
> refcounts or anything, we are doing direct I/O.  ocfs2_file_aio_write()
> (o_f_a_w() from now on) locks things down like so:  lock(i_mutex),
> down_read(ip_alloc_sem), PR(rw_lock).  We have ip_alloc_sem preventing
> size changes on the local node and rw_lock preventing size changes on
> other nodes.  We call generic_file_direct_write() ourselves.
> 	If a file is not opened with O_DIRECT, we are doing regular
> buffered writes.  o_f_a_w() locks like so: lock(i_mutex),
> EX(rw_lock).  It is protecting against other nodes, but it does not
> touch ip_alloc_sem.  Why?  Because we call __generic_file_aio_write(),
> which will call ->write_begin().  ip_alloc_sem will be taken inside
> ->write_begin().  That's where we protect against other local processes.
> 	You may already see where I'm going with this.  If we are open
> with O_DIRECT, but we have to fall back to buffered, we will do this
> locking:  lock(i_mutex), down_read(ip_alloc_sem), PR(rw_lock),
> NL(rw_lock), up_read(ip_alloc_sem), EX(rw_lock).  That is, we start with
> the direct I/O locking, then back off and do the buffered locking.  But
> when we get into __g_f_a_w(), it will try the direct I/O again.  If the
> leading portion of the I/O is capable of direct I/O, it will go into
> direct mode *without ever taking ip_alloc_sem*.  Once it gets to the
> portion of the I/O that cannot be done direct, it will fall back to
> buffered for the rest of the I/O and will call ->write_begin() as
> expected.
> 	So this I/O that extends i_size to the end of the allocation
> will proceed as a direct I/O but will not have ip_alloc_sem.  Thus
> truncate (and any other allocation change) can race on the local
> machine.
> 	I think some form of Dong Yang's patch is going to be necessary.
> 
Thanks for the great explanation and analysis, but I only see we down write the
OCFS2_I(inode)->ip_alloc_sem in ->write_begin() and we are taking
inode->i_alloc_sem in o_f_a_w() when we try to do a direct write, not the ip_alloc_sem.
Am I missing something?

Br,
Li Dongyang
> Joel
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-14  2:44       ` Tao Ma
@ 2010-04-14  5:47         ` Li Dongyang
  2010-04-14  6:08           ` Tao Ma
  0 siblings, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-14  5:47 UTC (permalink / raw)
  To: ocfs2-devel

Hi, Tao
On Wednesday 14 April 2010 10:44:24 Tao Ma wrote:
> Hi Dongyang,
> 
> Tao Ma wrote:
> > Li Dongyang wrote:
> >> Hi, Tao
> >>
> >> On Monday 12 April 2010 13:16:43 Tao Ma wrote:
> >>> Hi dong yang,
> >>>
> >>> Dong Yang Li wrote:
> >>>> I still get a bug with this check and without my patch:
> >>>
> >>> yes, the check doesn't work actually in this case.
> >>>
> >>>> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug
> >>>> expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
> >>>> [16179.955157]
> >>>> (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
> >>>> 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the
> >>>> same.
> >>>>
> >>>>
> >>>> the problem is this check in ocfs2_direct_IO_get_blocks just check if
> >>>> we are going beyond the blocks right now, so if a direct write won't
> >>>> play with new blocks but extending the i_size still get a pass, like
> >>>> the error above said, di->i_size is 809011, using 198 blocks and the
> >>>> direct write end up with i_size 811008, just same 198 blocks.
> >>>
> >>> yeah, you are right.
> >>
> >> Thanks for the script,
> >> and a stupid question: why we still try to call __generic_file_aio_write
> >> and let it try direct write first in ocfs2_file_aio_write even we
> >> decided we could not do the direct write?
yes, I also concerned about the i_alloc_sem, that's why I asked the question above.
and I think we can remove the check in ocfs2_direct_IO_get_blocks, as it does not work.
and your suggestion sb->s_blocksize * (iblocks+contig_blocks)>inode->i_size will give -EIO
to those good direct writes which are not going beyond i_size but also played with
the last partial block. e.g. an inode allocated with 4 blocks and i_size is 3 * 4096 + 2000
and we wanna do a direct io with pos=0 and length=3 * 4096 + 1000, as we are at block level in
o_d_I_g_b().
in that case, we will fall back to buffered io and the i_alloc_sem have already
down read in ocfs2_file_aio_write(), I wonder if that will cause a problem?
> >>
> >>>> IMHO, we can add this check back and fix this check, or we don't try
> >>>> to do direct write if we decided we can't in ocfs2_file_aio_write,
> >>>> after calling ocfs2_prepare_inode_for_write as my patch said.
> >>>
> >>> I think we only need to check this condition in get_blocks. So would
> >>> you mind providing a patch? You old method is too aggressive actually.
> >>
> >> what about add this check in ocfs2_direct_IO? if we see we are extending
> >> just return 0. right now we only check if we are appending.
> >
> > As for the 2 questions, I just want to do buffered write as small as
> > possible since it has to lock inode, create pages and then sync pages
> > etc(you can check ocfs2_write_begin/end for details. ;) ). So say this
> > question, actually only the last block needed to be buffered ioed and
> > i_size get updated accordingly.
> >
> > I just checked ext4_direct_IO and actually it updated the disk size at
> > the end of direct_IO. So maybe we can work like that also.
> 
> sorry, I mislead you.
> Joel pointed out that except the problem my little script exposed, there
> is another problem about ip_alloc_sem locking. So we have to fall back
> to buffer write from the very beginning. I just saw that Joel has
> commented your original patch, so do please revise it.
working on that,
Br,
Li Dongyang
> 
> Regards,
> Tao
> 
> > Regards,
> > Tao
> >
> >>> btw, I have created a small test script which will expose this bug
> >>> easily. So you don't need to use the time-consuming fsstress test now.
> >>> Just use it to test your fix.
> >>>
> >>> echo 'y'|mkfs.ocfs2 --fs-features=local,noinline-data -b 4K -C 4K
> >>> $DEVICE 1000000
> >>> mount -t ocfs2 $DEVICE $MNT_DIR
> >>> echo "foo" > $MNT_DIR/foo
> >>> dd if=/dev/zero of=$MNT_DIR/foo bs=4K count=1 conv=notrunc oflag=direct
> >>> echo "foo" > $MNT_DIR/foo
> >>> # The kernel should panic here.
> >>>
> >>> Regards,
> >>> Tao
> >>>
> >>>> Comments? ;-)
> >>>>
> >>>>
> >>>> Br,
> >>>> Li Dongyang
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-12  6:24     ` Tao Ma
@ 2010-04-14  2:44       ` Tao Ma
  2010-04-14  5:47         ` Li Dongyang
  0 siblings, 1 reply; 27+ messages in thread
From: Tao Ma @ 2010-04-14  2:44 UTC (permalink / raw)
  To: ocfs2-devel

Hi Dongyang,

Tao Ma wrote:
> 
> Li Dongyang wrote:
>> Hi, Tao
>> On Monday 12 April 2010 13:16:43 Tao Ma wrote:
>>> Hi dong yang,
>>>
>>> Dong Yang Li wrote:
>>>> I still get a bug with this check and without my patch:
>>> yes, the check doesn't work actually in this case.
>>>
>>>> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression:
>>>> le64_to_cpu(fe->i_size) != i_size_read(inode) [16179.955157]
>>>> (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
>>>> 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the same.
>>>>
>>>>
>>>> the problem is this check in ocfs2_direct_IO_get_blocks just check if we
>>>> are going beyond the blocks right now, so if a direct write won't play
>>>> with new blocks but extending the i_size still get a pass, like the error
>>>> above said, di->i_size is 809011, using 198 blocks and the direct write
>>>> end up with i_size 811008, just same 198 blocks.
>>> yeah, you are right.
>>>
>> Thanks for the script,
>> and a stupid question: why we still try to call __generic_file_aio_write and 
>> let it try direct write first in ocfs2_file_aio_write even we decided we could 
>> not do the direct write?
>>>> IMHO, we can add this check back and fix this check, or we don't try to
>>>> do direct write if we decided we can't in ocfs2_file_aio_write, after
>>>> calling ocfs2_prepare_inode_for_write as my patch said.
>>> I think we only need to check this condition in get_blocks. So would you
>>> mind providing a patch? You old method is too aggressive actually.
>>>
>> what about add this check in ocfs2_direct_IO? if we see we are extending just 
>> return 0. right now we only check if we are appending.
> As for the 2 questions, I just want to do buffered write as small as 
> possible since it has to lock inode, create pages and then sync pages 
> etc(you can check ocfs2_write_begin/end for details. ;) ). So say this 
> question, actually only the last block needed to be buffered ioed and 
> i_size get updated accordingly.
> 
> I just checked ext4_direct_IO and actually it updated the disk size at 
> the end of direct_IO. So maybe we can work like that also.
sorry, I mislead you.
Joel pointed out that except the problem my little script exposed, there 
is another problem about ip_alloc_sem locking. So we have to fall back 
to buffer write from the very beginning. I just saw that Joel has 
commented your original patch, so do please revise it.

Regards,
Tao
> 
> Regards,
> Tao
>>> btw, I have created a small test script which will expose this bug
>>> easily. So you don't need to use the time-consuming fsstress test now.
>>> Just use it to test your fix.
>>>
>>> echo 'y'|mkfs.ocfs2 --fs-features=local,noinline-data -b 4K -C 4K
>>> $DEVICE 1000000
>>> mount -t ocfs2 $DEVICE $MNT_DIR
>>> echo "foo" > $MNT_DIR/foo
>>> dd if=/dev/zero of=$MNT_DIR/foo bs=4K count=1 conv=notrunc oflag=direct
>>> echo "foo" > $MNT_DIR/foo
>>> # The kernel should panic here.
>>>
>>> Regards,
>>> Tao
>>>
>>>> Comments? ;-)
>>>>
>>>>
>>>> Br,
>>>> Li Dongyang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-13 23:54   ` Joel Becker
@ 2010-04-14  0:13     ` Tao Ma
  2010-04-14  5:58     ` Li Dongyang
  1 sibling, 0 replies; 27+ messages in thread
From: Tao Ma @ 2010-04-14  0:13 UTC (permalink / raw)
  To: ocfs2-devel

Joel Becker wrote:
> On Mon, Apr 12, 2010 at 01:16:43PM +0800, Tao Ma wrote:
>   
>> Dong Yang Li wrote:
>>     
>>> I still get a bug with this check and without my patch:
>>>       
>> yes, the check doesn't work actually in this case.
>>     
>>> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
>>> [16179.955157] (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size = 811008 != di i_size = 809011, i_flags = 0x1
>>> the call trace is the same.
>>>
>>>
>>> the problem is this check in ocfs2_direct_IO_get_blocks just check if we are going beyond the blocks right now,
>>> so if a direct write won't play with new blocks but extending the i_size still get a pass, like the error above said, di->i_size is 809011, using 198 blocks and the direct write end up with i_size 811008, just same 198 blocks.
>>>       
>> yeah, you are right.
>>     
>
> 	I think Sunil and I have found the real culprit.
> 	If a file is opened for O_DIRECT, and there are no holes,
> refcounts or anything, we are doing direct I/O.  ocfs2_file_aio_write()
> (o_f_a_w() from now on) locks things down like so:  lock(i_mutex),
> down_read(ip_alloc_sem), PR(rw_lock).  We have ip_alloc_sem preventing
> size changes on the local node and rw_lock preventing size changes on
> other nodes.  We call generic_file_direct_write() ourselves.
> 	If a file is not opened with O_DIRECT, we are doing regular
> buffered writes.  o_f_a_w() locks like so: lock(i_mutex),
> EX(rw_lock).  It is protecting against other nodes, but it does not
> touch ip_alloc_sem.  Why?  Because we call __generic_file_aio_write(),
> which will call ->write_begin().  ip_alloc_sem will be taken inside
> ->write_begin().  That's where we protect against other local processes.  
> 	You may already see where I'm going with this.  If we are open
> with O_DIRECT, but we have to fall back to buffered, we will do this
> locking:  lock(i_mutex), down_read(ip_alloc_sem), PR(rw_lock),
> NL(rw_lock), up_read(ip_alloc_sem), EX(rw_lock).  That is, we start with
> the direct I/O locking, then back off and do the buffered locking.  But
> when we get into __g_f_a_w(), it will try the direct I/O again.  If the
> leading portion of the I/O is capable of direct I/O, it will go into
> direct mode *without ever taking ip_alloc_sem*.  Once it gets to the
> portion of the I/O that cannot be done direct, it will fall back to
> buffered for the rest of the I/O and will call ->write_begin() as
> expected.
> 	So this I/O that extends i_size to the end of the allocation
> will proceed as a direct I/O but will not have ip_alloc_sem.  Thus
> truncate (and any other allocation change) can race on the local
> machine.
> 	I think some form of Dong Yang's patch is going to be necessary.
>   
oh, yes, your analysis make sense.
But that doesn't prove that my get_block suggestion doesn't work in this 
case.
If we can find this situation in ocfs2_direct_IO_get_blocks and 
clear_buffer_mapped. It should fall
back to buffer_write for the last block and update i_size properly.
Actually, the check should be easy.
sb->s_blocksize * (iblocks+contig_blocks)>inode->i_size.

In this way, we should have to fall to buffer write only necessarily.

Regards,
Tao

Regards,
Tao
> Joel
>
>   

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-12  5:16 ` Tao Ma
  2010-04-12  5:31   ` Li Dongyang
@ 2010-04-13 23:54   ` Joel Becker
  2010-04-14  0:13     ` Tao Ma
  2010-04-14  5:58     ` Li Dongyang
  1 sibling, 2 replies; 27+ messages in thread
From: Joel Becker @ 2010-04-13 23:54 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, Apr 12, 2010 at 01:16:43PM +0800, Tao Ma wrote:
> Dong Yang Li wrote:
> > I still get a bug with this check and without my patch:
> yes, the check doesn't work actually in this case.
> > 
> > 
> > [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
> > [16179.955157] (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size = 811008 != di i_size = 809011, i_flags = 0x1
> > the call trace is the same.
> > 
> > 
> > the problem is this check in ocfs2_direct_IO_get_blocks just check if we are going beyond the blocks right now,
> > so if a direct write won't play with new blocks but extending the i_size still get a pass, like the error above said, di->i_size is 809011, using 198 blocks and the direct write end up with i_size 811008, just same 198 blocks.
> yeah, you are right.

	I think Sunil and I have found the real culprit.
	If a file is opened for O_DIRECT, and there are no holes,
refcounts or anything, we are doing direct I/O.  ocfs2_file_aio_write()
(o_f_a_w() from now on) locks things down like so:  lock(i_mutex),
down_read(ip_alloc_sem), PR(rw_lock).  We have ip_alloc_sem preventing
size changes on the local node and rw_lock preventing size changes on
other nodes.  We call generic_file_direct_write() ourselves.
	If a file is not opened with O_DIRECT, we are doing regular
buffered writes.  o_f_a_w() locks like so: lock(i_mutex),
EX(rw_lock).  It is protecting against other nodes, but it does not
touch ip_alloc_sem.  Why?  Because we call __generic_file_aio_write(),
which will call ->write_begin().  ip_alloc_sem will be taken inside
->write_begin().  That's where we protect against other local processes.  
	You may already see where I'm going with this.  If we are open
with O_DIRECT, but we have to fall back to buffered, we will do this
locking:  lock(i_mutex), down_read(ip_alloc_sem), PR(rw_lock),
NL(rw_lock), up_read(ip_alloc_sem), EX(rw_lock).  That is, we start with
the direct I/O locking, then back off and do the buffered locking.  But
when we get into __g_f_a_w(), it will try the direct I/O again.  If the
leading portion of the I/O is capable of direct I/O, it will go into
direct mode *without ever taking ip_alloc_sem*.  Once it gets to the
portion of the I/O that cannot be done direct, it will fall back to
buffered for the rest of the I/O and will call ->write_begin() as
expected.
	So this I/O that extends i_size to the end of the allocation
will proceed as a direct I/O but will not have ip_alloc_sem.  Thus
truncate (and any other allocation change) can race on the local
machine.
	I think some form of Dong Yang's patch is going to be necessary.

Joel

-- 

Life's Little Instruction Book #306

	"Take a nap on Sunday afternoons."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-12  5:31   ` Li Dongyang
@ 2010-04-12  6:24     ` Tao Ma
  2010-04-14  2:44       ` Tao Ma
  0 siblings, 1 reply; 27+ messages in thread
From: Tao Ma @ 2010-04-12  6:24 UTC (permalink / raw)
  To: ocfs2-devel



Li Dongyang wrote:
> Hi, Tao
> On Monday 12 April 2010 13:16:43 Tao Ma wrote:
>> Hi dong yang,
>>
>> Dong Yang Li wrote:
>>> I still get a bug with this check and without my patch:
>> yes, the check doesn't work actually in this case.
>>
>>> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression:
>>> le64_to_cpu(fe->i_size) != i_size_read(inode) [16179.955157]
>>> (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
>>> 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the same.
>>>
>>>
>>> the problem is this check in ocfs2_direct_IO_get_blocks just check if we
>>> are going beyond the blocks right now, so if a direct write won't play
>>> with new blocks but extending the i_size still get a pass, like the error
>>> above said, di->i_size is 809011, using 198 blocks and the direct write
>>> end up with i_size 811008, just same 198 blocks.
>> yeah, you are right.
>>
> Thanks for the script,
> and a stupid question: why we still try to call __generic_file_aio_write and 
> let it try direct write first in ocfs2_file_aio_write even we decided we could 
> not do the direct write?
>>> IMHO, we can add this check back and fix this check, or we don't try to
>>> do direct write if we decided we can't in ocfs2_file_aio_write, after
>>> calling ocfs2_prepare_inode_for_write as my patch said.
>> I think we only need to check this condition in get_blocks. So would you
>> mind providing a patch? You old method is too aggressive actually.
>>
> what about add this check in ocfs2_direct_IO? if we see we are extending just 
> return 0. right now we only check if we are appending.
As for the 2 questions, I just want to do buffered write as small as 
possible since it has to lock inode, create pages and then sync pages 
etc(you can check ocfs2_write_begin/end for details. ;) ). So say this 
question, actually only the last block needed to be buffered ioed and 
i_size get updated accordingly.

I just checked ext4_direct_IO and actually it updated the disk size at 
the end of direct_IO. So maybe we can work like that also.

Regards,
Tao
>> btw, I have created a small test script which will expose this bug
>> easily. So you don't need to use the time-consuming fsstress test now.
>> Just use it to test your fix.
>>
>> echo 'y'|mkfs.ocfs2 --fs-features=local,noinline-data -b 4K -C 4K
>> $DEVICE 1000000
>> mount -t ocfs2 $DEVICE $MNT_DIR
>> echo "foo" > $MNT_DIR/foo
>> dd if=/dev/zero of=$MNT_DIR/foo bs=4K count=1 conv=notrunc oflag=direct
>> echo "foo" > $MNT_DIR/foo
>> # The kernel should panic here.
>>
>> Regards,
>> Tao
>>
>>> Comments? ;-)
>>>
>>>
>>> Br,
>>> Li Dongyang
>>>
>>>>>> Sunil Mushran  04/10/10 1:42 AM >>>
>>>   Li Dongyang wrote:
>>>> On Friday 09 April 2010 11:32:10 Tao Ma wrote:
>>>>> Hi Dongyang,
>>>>>
>>>>> Li Dongyang wrote:
>>>>>> Hi, Tao,
>>>>>>
>>>>>> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
>>>>>>> Hi Dongyang,
>>>>>>>
>>>>>>> Li Dongyang wrote:
>>>>>>>> This is because ocfs2_file_aio_write calls
>>>>>>>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds
>>>>>>>> out that direct IO would extend the file. But later we call
>>>>>>>> __generic_file_aio_write which end's up calling
>>>>>>>> generic_file_direct_write because the file has O_DIRECT flag.So
>>>>>>>> every time we do a direct write extending the file, the
>>>>>>>> inode->i_size gets inconsistent with the i_size on disk because we
>>>>>>>> call
>>>>>>>> generic_file_direct_write, and if we do a truncate after this, we
>>>>>>>> will meet a bug in ocfs2_truncate_file.
>>>>>>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
>>>>>>> call generic_file_direct_write first and then trigger to
>>>>>>> ocfs2_direct_IO. In this function we will check again and return 0.
>>>>>>> And _generic_file_aio_write will fall back to buffered write if the
>>>>>>> directIO can't write. Am I wrong somehow?
>>>>>> yes ocfs2_direct_IO has some check, but it just check if we are
>>>>>> appending(the i_size <= offset), if the offset < i_size and offset +
>>>>>> count > i_size, it will do direct io anyway. seems we also can fix
>>>>>> this by adding a check to ocfs2_direct_IO.
>>>>> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
>>>>> will get what I mean. ;)
>>>> Do you mean this section in ocfs2_direct_IO_get_blocks:?
>>>> /*
>>>>  * Any write past EOF is not allowed because we'd be extending.
>>>>  */
>>>> if (create && (iblock + max_blocks) > inode_blocks) {
>>>>     ret = -EIO;
>>>>     goto bail;
>>>> }
>>>>
>>>> I was using the linus tree
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>>>> and we don't have that check, but I can find this in the
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git,
>>>> introduced by commit 564f8a3228879d6962edb3432d01bcd7499a67ec
>>>>
>>>> and now with this check I got what you mean, you are right, but I wonder
>>>> why the linus tree doesn't have this check? and are we suppose to do
>>>> with this? IMHO we can just push this commit to linus tree.
>>> commit 5fe878ae7f82fbf0830dbfaee4c5ca18f3aee442
>>> Author: Christoph Hellwig
>>> Date:   Tue Dec 15 16:47:50 2009 -0800
>>>
>>>     direct-io: cleanup blockdev_direct_IO locking
>>>
>>> This check was removed recently by the above patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-12  5:16 ` Tao Ma
@ 2010-04-12  5:31   ` Li Dongyang
  2010-04-12  6:24     ` Tao Ma
  2010-04-13 23:54   ` Joel Becker
  1 sibling, 1 reply; 27+ messages in thread
From: Li Dongyang @ 2010-04-12  5:31 UTC (permalink / raw)
  To: ocfs2-devel

Hi, Tao
On Monday 12 April 2010 13:16:43 Tao Ma wrote:
> Hi dong yang,
> 
> Dong Yang Li wrote:
> > I still get a bug with this check and without my patch:
> 
> yes, the check doesn't work actually in this case.
> 
> > [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression:
> > le64_to_cpu(fe->i_size) != i_size_read(inode) [16179.955157]
> > (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
> > 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the same.
> >
> >
> > the problem is this check in ocfs2_direct_IO_get_blocks just check if we
> > are going beyond the blocks right now, so if a direct write won't play
> > with new blocks but extending the i_size still get a pass, like the error
> > above said, di->i_size is 809011, using 198 blocks and the direct write
> > end up with i_size 811008, just same 198 blocks.
> 
> yeah, you are right.
> 
Thanks for the script,
and a stupid question: why we still try to call __generic_file_aio_write and 
let it try direct write first in ocfs2_file_aio_write even we decided we could 
not do the direct write?
> > IMHO, we can add this check back and fix this check, or we don't try to
> > do direct write if we decided we can't in ocfs2_file_aio_write, after
> > calling ocfs2_prepare_inode_for_write as my patch said.
> 
> I think we only need to check this condition in get_blocks. So would you
> mind providing a patch? You old method is too aggressive actually.
> 
what about add this check in ocfs2_direct_IO? if we see we are extending just 
return 0. right now we only check if we are appending.
> btw, I have created a small test script which will expose this bug
> easily. So you don't need to use the time-consuming fsstress test now.
> Just use it to test your fix.
> 
> echo 'y'|mkfs.ocfs2 --fs-features=local,noinline-data -b 4K -C 4K
> $DEVICE 1000000
> mount -t ocfs2 $DEVICE $MNT_DIR
> echo "foo" > $MNT_DIR/foo
> dd if=/dev/zero of=$MNT_DIR/foo bs=4K count=1 conv=notrunc oflag=direct
> echo "foo" > $MNT_DIR/foo
> # The kernel should panic here.
> 
> Regards,
> Tao
> 
> > Comments? ;-)
> >
> >
> > Br,
> > Li Dongyang
> >
> >>>> Sunil Mushran  04/10/10 1:42 AM >>>
> >
> >   Li Dongyang wrote:
> >> On Friday 09 April 2010 11:32:10 Tao Ma wrote:
> >>> Hi Dongyang,
> >>>
> >>> Li Dongyang wrote:
> >>>> Hi, Tao,
> >>>>
> >>>> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
> >>>>> Hi Dongyang,
> >>>>>
> >>>>> Li Dongyang wrote:
> >>>>>> This is because ocfs2_file_aio_write calls
> >>>>>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds
> >>>>>> out that direct IO would extend the file. But later we call
> >>>>>> __generic_file_aio_write which end's up calling
> >>>>>> generic_file_direct_write because the file has O_DIRECT flag.So
> >>>>>> every time we do a direct write extending the file, the
> >>>>>> inode->i_size gets inconsistent with the i_size on disk because we
> >>>>>> call
> >>>>>> generic_file_direct_write, and if we do a truncate after this, we
> >>>>>> will meet a bug in ocfs2_truncate_file.
> >>>>>
> >>>>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
> >>>>> call generic_file_direct_write first and then trigger to
> >>>>> ocfs2_direct_IO. In this function we will check again and return 0.
> >>>>> And _generic_file_aio_write will fall back to buffered write if the
> >>>>> directIO can't write. Am I wrong somehow?
> >>>>
> >>>> yes ocfs2_direct_IO has some check, but it just check if we are
> >>>> appending(the i_size <= offset), if the offset < i_size and offset +
> >>>> count > i_size, it will do direct io anyway. seems we also can fix
> >>>> this by adding a check to ocfs2_direct_IO.
> >>>
> >>> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
> >>> will get what I mean. ;)
> >>
> >> Do you mean this section in ocfs2_direct_IO_get_blocks:?
> >> /*
> >>  * Any write past EOF is not allowed because we'd be extending.
> >>  */
> >> if (create && (iblock + max_blocks) > inode_blocks) {
> >>     ret = -EIO;
> >>     goto bail;
> >> }
> >>
> >> I was using the linus tree
> >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> >> and we don't have that check, but I can find this in the
> >> git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git,
> >> introduced by commit 564f8a3228879d6962edb3432d01bcd7499a67ec
> >>
> >> and now with this check I got what you mean, you are right, but I wonder
> >> why the linus tree doesn't have this check? and are we suppose to do
> >> with this? IMHO we can just push this commit to linus tree.
> >
> > commit 5fe878ae7f82fbf0830dbfaee4c5ca18f3aee442
> > Author: Christoph Hellwig
> > Date:   Tue Dec 15 16:47:50 2009 -0800
> >
> >     direct-io: cleanup blockdev_direct_IO locking
> >
> > This check was removed recently by the above patch.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-10  7:37 Dong Yang Li
  2010-04-10  9:37 ` Joel Becker
@ 2010-04-12  5:16 ` Tao Ma
  2010-04-12  5:31   ` Li Dongyang
  2010-04-13 23:54   ` Joel Becker
  1 sibling, 2 replies; 27+ messages in thread
From: Tao Ma @ 2010-04-12  5:16 UTC (permalink / raw)
  To: ocfs2-devel

Hi dong yang,

Dong Yang Li wrote:
> I still get a bug with this check and without my patch:
yes, the check doesn't work actually in this case.
> 
> 
> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
> [16179.955157] (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size = 811008 != di i_size = 809011, i_flags = 0x1
> the call trace is the same.
> 
> 
> the problem is this check in ocfs2_direct_IO_get_blocks just check if we are going beyond the blocks right now,
> so if a direct write won't play with new blocks but extending the i_size still get a pass, like the error above said, di->i_size is 809011, using 198 blocks and the direct write end up with i_size 811008, just same 198 blocks.
yeah, you are right.
> 
> 
> IMHO, we can add this check back and fix this check, or we don't try to do direct write if we decided we can't in ocfs2_file_aio_write, after calling ocfs2_prepare_inode_for_write as my patch said.
I think we only need to check this condition in get_blocks. So would you 
mind providing a patch? You old method is too aggressive actually.

btw, I have created a small test script which will expose this bug 
easily. So you don't need to use the time-consuming fsstress test now. 
Just use it to test your fix.

echo 'y'|mkfs.ocfs2 --fs-features=local,noinline-data -b 4K -C 4K 
$DEVICE 1000000
mount -t ocfs2 $DEVICE $MNT_DIR
echo "foo" > $MNT_DIR/foo
dd if=/dev/zero of=$MNT_DIR/foo bs=4K count=1 conv=notrunc oflag=direct
echo "foo" > $MNT_DIR/foo
# The kernel should panic here.

Regards,
Tao
> 
> 
> Comments? ;-)
> 
> 
> Br,
> Li Dongyang
>>>> Sunil Mushran  04/10/10 1:42 AM >>>
>   Li Dongyang wrote:
>> On Friday 09 April 2010 11:32:10 Tao Ma wrote:
>>> Hi Dongyang,
>>>
>>> Li Dongyang wrote:
>>>> Hi, Tao,
>>>>
>>>> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
>>>>> Hi Dongyang,
>>>>>
>>>>> Li Dongyang wrote:
>>>>>> This is because ocfs2_file_aio_write calls
>>>>>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds out
>>>>>> that direct IO would extend the file. But later we call
>>>>>> __generic_file_aio_write which end's up calling
>>>>>> generic_file_direct_write because the file has O_DIRECT flag.So every
>>>>>> time we do a direct write extending the file, the inode->i_size gets
>>>>>> inconsistent with the i_size on disk because we call
>>>>>> generic_file_direct_write, and if we do a truncate after this, we will
>>>>>> meet a bug in ocfs2_truncate_file.
>>>>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
>>>>> call generic_file_direct_write first and then trigger to
>>>>> ocfs2_direct_IO. In this function we will check again and return 0. And
>>>>> _generic_file_aio_write will fall back to buffered write if the directIO
>>>>> can't write. Am I wrong somehow?
>>>> yes ocfs2_direct_IO has some check, but it just check if we are
>>>> appending(the i_size <= offset), if the offset < i_size and offset +
>>>> count > i_size, it will do direct io anyway. seems we also can fix this
>>>> by adding a check to ocfs2_direct_IO.
>>> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
>>> will get what I mean. ;)
>> Do you mean this section in ocfs2_direct_IO_get_blocks:?
>> /*
>>  * Any write past EOF is not allowed because we'd be extending.
>>  */
>> if (create && (iblock + max_blocks) > inode_blocks) {
>>     ret = -EIO;
>>     goto bail;
>> }
>>
>> I was using the linus tree 
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>> and we don't have that check, but I can find this in the 
>> git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git, introduced by 
>> commit 564f8a3228879d6962edb3432d01bcd7499a67ec
>>
>> and now with this check I got what you mean, you are right, but I wonder why 
>> the linus tree doesn't have this check? and are we suppose to do with this?
>> IMHO we can just push this commit to linus tree.
> 
> commit 5fe878ae7f82fbf0830dbfaee4c5ca18f3aee442
> Author: Christoph Hellwig 
> Date:   Tue Dec 15 16:47:50 2009 -0800
> 
>     direct-io: cleanup blockdev_direct_IO locking
> 
> This check was removed recently by the above patch.
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-10  9:37 ` Joel Becker
@ 2010-04-10  9:48   ` Li Dongyang
  0 siblings, 0 replies; 27+ messages in thread
From: Li Dongyang @ 2010-04-10  9:48 UTC (permalink / raw)
  To: ocfs2-devel

On Saturday 10 April 2010 17:37:18 Joel Becker wrote:
> On Sat, Apr 10, 2010 at 01:37:58AM -0600, Dong Yang Li wrote:
> > I still get a bug with this check and without my patch:
> >
> >
> > [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression:
> > le64_to_cpu(fe->i_size) != i_size_read(inode) [16179.955157]
> > (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size =
> > 811008 != di i_size = 809011, i_flags = 0x1 the call trace is the same.
> 
> 	What's the testcase?
> 
> Joel
> 
Just fire up "fsstress -d <ocfs2 mount point> -l 1000 -p 1000 -n 1000", 
fsstress is a tool from ltp package.
JJ first found it on a 3-nodes cluster with user stack, I also can reproduce 
it on a 2-nodes cluster with o2cb.

Li Dongyang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered
  2010-04-10  7:37 Dong Yang Li
@ 2010-04-10  9:37 ` Joel Becker
  2010-04-10  9:48   ` Li Dongyang
  2010-04-12  5:16 ` Tao Ma
  1 sibling, 1 reply; 27+ messages in thread
From: Joel Becker @ 2010-04-10  9:37 UTC (permalink / raw)
  To: ocfs2-devel

On Sat, Apr 10, 2010 at 01:37:58AM -0600, Dong Yang Li wrote:
> I still get a bug with this check and without my patch:
> 
> 
> [16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
> [16179.955157] (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size = 811008 != di i_size = 809011, i_flags = 0x1
> the call trace is the same.

	What's the testcase?

Joel

-- 

	f/8 and be there.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall  back to buffered
@ 2010-04-10  7:37 Dong Yang Li
  2010-04-10  9:37 ` Joel Becker
  2010-04-12  5:16 ` Tao Ma
  0 siblings, 2 replies; 27+ messages in thread
From: Dong Yang Li @ 2010-04-10  7:37 UTC (permalink / raw)
  To: ocfs2-devel

I still get a bug with this check and without my patch:


[16179.955148] (13400,1):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
[16179.955157] (13400,1):ocfs2_truncate_file:465 ERROR: Inode 254789, inode i_size = 811008 != di i_size = 809011, i_flags = 0x1
the call trace is the same.


the problem is this check in ocfs2_direct_IO_get_blocks just check if we are going beyond the blocks right now,
so if a direct write won't play with new blocks but extending the i_size still get a pass, like the error above said, di->i_size is 809011, using 198 blocks and the direct write end up with i_size 811008, just same 198 blocks.


IMHO, we can add this check back and fix this check, or we don't try to do direct write if we decided we can't in ocfs2_file_aio_write, after calling ocfs2_prepare_inode_for_write as my patch said.


Comments? ;-)


Br,
Li Dongyang
>>> Sunil Mushran  04/10/10 1:42 AM >>>
  Li Dongyang wrote:
> On Friday 09 April 2010 11:32:10 Tao Ma wrote:
>> Hi Dongyang,
>>
>> Li Dongyang wrote:
>>> Hi, Tao,
>>>
>>> On Friday 09 April 2010 10:38:33 Tao Ma wrote:
>>>> Hi Dongyang,
>>>>
>>>> Li Dongyang wrote:
>>>>> This is because ocfs2_file_aio_write calls
>>>>> ocfs2_prepare_inode_for_write which sets direct_io to 0 if it finds out
>>>>> that direct IO would extend the file. But later we call
>>>>> __generic_file_aio_write which end's up calling
>>>>> generic_file_direct_write because the file has O_DIRECT flag.So every
>>>>> time we do a direct write extending the file, the inode->i_size gets
>>>>> inconsistent with the i_size on disk because we call
>>>>> generic_file_direct_write, and if we do a truncate after this, we will
>>>>> meet a bug in ocfs2_truncate_file.
>>>> yes we have O_DIRECT flag set and in __generic_file_aio_write it will
>>>> call generic_file_direct_write first and then trigger to
>>>> ocfs2_direct_IO. In this function we will check again and return 0. And
>>>> _generic_file_aio_write will fall back to buffered write if the directIO
>>>> can't write. Am I wrong somehow?
>>> yes ocfs2_direct_IO has some check, but it just check if we are
>>> appending(the i_size <= offset), if the offset < i_size and offset +
>>> count > i_size, it will do direct io anyway. seems we also can fix this
>>> by adding a check to ocfs2_direct_IO.
>> It is done by ocfs2_direct_IO_get_blocks. Just debug the kernel and you
>> will get what I mean. ;)
> Do you mean this section in ocfs2_direct_IO_get_blocks:?
> /*
>  * Any write past EOF is not allowed because we'd be extending.
>  */
> if (create && (iblock + max_blocks) > inode_blocks) {
>     ret = -EIO;
>     goto bail;
> }
>
> I was using the linus tree 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> and we don't have that check, but I can find this in the 
> git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git, introduced by 
> commit 564f8a3228879d6962edb3432d01bcd7499a67ec
>
> and now with this check I got what you mean, you are right, but I wonder why 
> the linus tree doesn't have this check? and are we suppose to do with this?
> IMHO we can just push this commit to linus tree.

commit 5fe878ae7f82fbf0830dbfaee4c5ca18f3aee442
Author: Christoph Hellwig 
Date:   Tue Dec 15 16:47:50 2009 -0800

    direct-io: cleanup blockdev_direct_IO locking

This check was removed recently by the above patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-04-23 20:06 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-08  7:47 [Ocfs2-devel] [PATCH] ocfs2: avoid direct write if we fall back to buffered Li Dongyang
2010-04-08 18:41 ` Sunil Mushran
2010-04-09  2:27   ` Li Dongyang
2010-04-09  2:38     ` Tao Ma
2010-04-09  3:00       ` Li Dongyang
2010-04-09  3:32         ` Tao Ma
2010-04-09  9:20           ` Li Dongyang
2010-04-09 17:36             ` Sunil Mushran
2010-04-09  7:58   ` Coly Li
2010-04-09  7:56     ` Tao Ma
2010-04-14  1:58 ` Joel Becker
2010-04-14  7:42   ` Li Dongyang
2010-04-10  7:37 Dong Yang Li
2010-04-10  9:37 ` Joel Becker
2010-04-10  9:48   ` Li Dongyang
2010-04-12  5:16 ` Tao Ma
2010-04-12  5:31   ` Li Dongyang
2010-04-12  6:24     ` Tao Ma
2010-04-14  2:44       ` Tao Ma
2010-04-14  5:47         ` Li Dongyang
2010-04-14  6:08           ` Tao Ma
2010-04-13 23:54   ` Joel Becker
2010-04-14  0:13     ` Tao Ma
2010-04-14  5:58     ` Li Dongyang
2010-04-14 19:20       ` Joel Becker
2010-04-22 14:13         ` Li Dongyang
2010-04-23 20:06           ` Joel Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.