All of lore.kernel.org
 help / color / mirror / Atom feed
* Tracking actual disk write sources instead of flush thread
@ 2014-04-16  2:01 Phillip Susi
  2014-04-16 14:01 ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2014-04-16  2:01 UTC (permalink / raw)
  To: linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

A lot of disk writes, especially when they are small individual files
being written by several different processes, are hidden behind the
flush thread.  Is there no way to properly track the process actually
responsible for the IO, even when it is the flush thread that
initiates the writeout?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTTeRzAAoJEI5FoCIzSKrw9WUH/jTozXvpeN7A0uN8KeafAhdk
agVupehoAYUDzDdYP3+JDkCmp5vgymzZydFasrPyLHmVqGsnojr/s/r6BowiQT73
Nks90Po+5wsburECDiuRz91POBbE5PTu0laK0fJWpTqeX+gZUlZBuNQlz8B2s2dr
wwtnN7tSVPc+H7TVrJz5CI/zW+rVjADXuA5GEshXg3/jfh9UnA/W5kKEVSbBvo7t
SwLH0jQ173qbTjWMu4DQIo4mAJ+JC0fYpxnmsIDZlIdNQ3JvbnwG03eyKJiDTRE1
hWklKcxSvv6MDBWcb0fhbIcLhfnQGGFq16szkQO1KmBg1/MDe7HSEncc+MooQOc=
=3C3I
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16  2:01 Tracking actual disk write sources instead of flush thread Phillip Susi
@ 2014-04-16 14:01 ` Matthew Wilcox
  2014-04-16 15:15   ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2014-04-16 14:01 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-fsdevel

On Tue, Apr 15, 2014 at 10:01:27PM -0400, Phillip Susi wrote:
> A lot of disk writes, especially when they are small individual files
> being written by several different processes, are hidden behind the
> flush thread.  Is there no way to properly track the process actually
> responsible for the IO, even when it is the flush thread that
> initiates the writeout?

Correct.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 14:01 ` Matthew Wilcox
@ 2014-04-16 15:15   ` Phillip Susi
  2014-04-16 16:42     ` Andreas Dilger
                       ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Phillip Susi @ 2014-04-16 15:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 4/16/2014 10:01 AM, Matthew Wilcox wrote:
> On Tue, Apr 15, 2014 at 10:01:27PM -0400, Phillip Susi wrote:
>> A lot of disk writes, especially when they are small individual
>> files being written by several different processes, are hidden
>> behind the flush thread.  Is there no way to properly track the
>> process actually responsible for the IO, even when it is the
>> flush thread that initiates the writeout?
> 
> Correct.

Wow.  If I understand things correctly, this also means that if
process A dirties a ton of cache pages, then process B tries to write
a relatively small amount, it can end up blocking in the synchronous
flush path, and so it will appear that process B and flush are doing
all of the writes, and not process A.

That seems like a severe defect.  How can such a defect be tolerated
in this day and age?  Why does the io accounting not track how many
pages the process dirties rather than how many it actually initiates
the writeout for?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTTp6dAAoJEI5FoCIzSKrwBV0IAJGguGEYYWIMe0II9MtbQB1e
S+ObSEygB1u6xt9GABLaUYhC7+e53vpx0/yNMbhQy+Zte79aBHXGYlP0fG4itazt
1QVjstMnSnmu6/9Q3wif3ldRgN56OkrA//L4zufLSNyWOQIRJ+I7HIg1S7ZqsvL8
Rvewat7i9jbCgT98FEqeUQp0iXqDkoZ+BVrS5s/bgBkSPdCLCiS0pRva+guabjSm
B9r7n5XGrjMqRJsSGX2kofpv3sZ7NhyTBFhbryZaBe6WIChS121rFZxqAWBTqPbp
Xm7N03djIekXD6Y+onLNloKV9Voe4k/1BLx1PLMjy1TRD4Cv6iTNWCB6x7ppTxM=
=yp+R
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 15:15   ` Phillip Susi
@ 2014-04-16 16:42     ` Andreas Dilger
  2014-04-16 17:44     ` Zuckerman, Boris
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Andreas Dilger @ 2014-04-16 16:42 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Matthew Wilcox, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On Apr 16, 2014, at 9:15 AM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 4/16/2014 10:01 AM, Matthew Wilcox wrote:
> > On Tue, Apr 15, 2014 at 10:01:27PM -0400, Phillip Susi wrote:
> >> A lot of disk writes, especially when they are small individual
> >> files being written by several different processes, are hidden
> >> behind the flush thread.  Is there no way to properly track the
> >> process actually responsible for the IO, even when it is the
> >> flush thread that initiates the writeout?
> >
> > Correct.
> 
> Wow.  If I understand things correctly, this also means that if
> process A dirties a ton of cache pages, then process B tries to write
> a relatively small amount, it can end up blocking in the synchronous
> flush path, and so it will appear that process B and flush are doing
> all of the writes, and not process A.
> 
> That seems like a severe defect.  How can such a defect be tolerated
> in this day and age?  Why does the io accounting not track how many
> pages the process dirties rather than how many it actually initiates
> the writeout for?

For Lustre (which has the added difficulty that the thread doing the
actual writeout is on a remote server) we track the last process that
dirtied the inode in the Lustre-private part of the inode itself, and
account the writes against that process.  By default, we store the
process name + UID in a 32-byte string in the inode, but this can also
be changed to store a cluster-unique job identifier (jobid).

For us, recording the PID isn't very useful, since the PID will be
different on each node in the cluster, but since it is just a string it
would be possible to store any identifier that fits.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Tracking actual disk write sources instead of flush thread
  2014-04-16 15:15   ` Phillip Susi
  2014-04-16 16:42     ` Andreas Dilger
@ 2014-04-16 17:44     ` Zuckerman, Boris
  2014-04-16 18:18       ` Phillip Susi
  2014-04-16 19:33     ` Andi Kleen
  2014-04-24 19:33     ` Jan Kara
  3 siblings, 1 reply; 17+ messages in thread
From: Zuckerman, Boris @ 2014-04-16 17:44 UTC (permalink / raw)
  To: Phillip Susi, Matthew Wilcox; +Cc: linux-fsdevel

> 
> That seems like a severe defect.  How can such a defect be tolerated in this day and
> age?  Why does the io accounting not track how many pages the process dirties rather
> than how many it actually initiates the writeout for?
> 
> 


Typically File System services do not offer semantics of transactional isolation. Attempts to add that took place and were rejected. Therefore, we are speaking only about some potential performance penalty, right? 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 17:44     ` Zuckerman, Boris
@ 2014-04-16 18:18       ` Phillip Susi
  2014-04-16 18:28         ` Zuckerman, Boris
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2014-04-16 18:18 UTC (permalink / raw)
  To: Zuckerman, Boris, Matthew Wilcox; +Cc: linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 4/16/2014 1:44 PM, Zuckerman, Boris wrote:
> Typically File System services do not offer semantics of 
> transactional isolation. Attempts to add that took place and were 
> rejected. Therefore, we are speaking only about some potential 
> performance penalty, right?

Not at all.  This has nothing to do with transactions; it is simply a
question of being able to identify what process is causing all of the
disk IO, like via iotop.  By counting writes in the process that
initiates the writeout of dirty pages, rather than the process that
dirties the page, it renders the write io tracking largely inaccurate
to the point of being useless at times, since often the process
dirtying the pages, even with an actual write() system call, is not
the process that initiates the writeout.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTTslkAAoJEI5FoCIzSKrwb4MH/AzwLC8GXD0huUYsU7xqEIDj
ZciTTSFMVTXa47tK8jIIbPLW3vTHyf0Dm/F1rnjn8jnJwhSqnVFfYKfUTrNQy/ze
7hJQXZ2oORZxp7jwPo3bHR1Oir0T1yXQdE5Ao7/W6wWCXY+JHEEvK1nyCPIClq+T
7TCLbkA3L0qHlGpaJMU7FkE9VaT2EI/8SKrQ05Hy2pLcksiAePnerFvxgCGIjSGU
uSXPEqamzl42eqU6pyTdTMM7yJDyIhluCH02he7y56tCX7IU5oEG6LYsDYKfsVLd
mEQeO2gCwcgKSrSQsjguPwTC5vyaomrYWvFt37jvvBWjcQwWj7A/kX6N9SBQOqY=
=4zjT
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Tracking actual disk write sources instead of flush thread
  2014-04-16 18:18       ` Phillip Susi
@ 2014-04-16 18:28         ` Zuckerman, Boris
  2014-04-16 19:27           ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Zuckerman, Boris @ 2014-04-16 18:28 UTC (permalink / raw)
  To: Phillip Susi, Matthew Wilcox; +Cc: linux-fsdevel

> 
> On 4/16/2014 1:44 PM, Zuckerman, Boris wrote:
> > Typically File System services do not offer semantics of transactional
> > isolation. Attempts to add that took place and were rejected.
> > Therefore, we are speaking only about some potential performance
> > penalty, right?
> 
> Not at all.  This has nothing to do with transactions; it is simply a question of being
> able to identify what process is causing all of the disk IO, like via iotop.  By counting
> writes in the process that initiates the writeout of dirty pages, rather than the process
> that dirties the page, it renders the write io tracking largely inaccurate to the point of
> being useless at times, since often the process dirtying the pages, even with an actual
> write() system call, is not the process that initiates the writeout.
> 
[bz:] 
In such case I'd rather have lighter implementation of caching layer, than ability to track disk IOs per process. IOs per process can be tracked by other tools...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 18:28         ` Zuckerman, Boris
@ 2014-04-16 19:27           ` Phillip Susi
  2014-04-23 13:48             ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2014-04-16 19:27 UTC (permalink / raw)
  To: Zuckerman, Boris, Matthew Wilcox; +Cc: linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 4/16/2014 2:28 PM, Zuckerman, Boris wrote:
> [bz:] In such case I'd rather have lighter implementation of
> caching layer, than ability to track disk IOs per process. IOs per
> process can be tracked by other tools...

What other tools?

I can envision blktrace being extended to hook into the mm layer to
track the real source of writes, but without any overhead when the
trace is not running.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTTtm3AAoJEI5FoCIzSKrwPPkIAJKnDv4uJfBAufqRlQbRPmcM
T3F6Dapzs2XSiMaBCaMeZQx4Zp7wxn79IM76GQeNUekHrV74kFqwLF7ezLpAviEm
IBdnhWv3uJXS6uX40os0DOqlRj5H49EhPzbLN6f5z4ARds06Zc+Z9flauCadZyce
j/1tinzm0cNYUxAIL9L+m1x7kj8Qzpq5LYyOEw7M8AL93xMGPJLgt4xWUP0JwP3l
cgiZKGD6H/ySDAFhckynCjmLdaDDNMOg8uWhyDNk8VQQnWuAzEwK3QgpfnLNSfkJ
vo1lvqMp86doRHc7pBoqY6cCU/iIrmXiMGMjVCXExkVFK4GrBHYlqlsBHnY5qys=
=mtN2
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 15:15   ` Phillip Susi
  2014-04-16 16:42     ` Andreas Dilger
  2014-04-16 17:44     ` Zuckerman, Boris
@ 2014-04-16 19:33     ` Andi Kleen
  2014-04-24 19:33     ` Jan Kara
  3 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2014-04-16 19:33 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Matthew Wilcox, linux-fsdevel

Phillip Susi <psusi@ubuntu.com> writes:
>
> That seems like a severe defect.  How can such a defect be tolerated
> in this day and age?  Why does the io accounting not track how many
> pages the process dirties rather than how many it actually initiates
> the writeout for?

There is no per process dirty counter anyways.

The exiting proces write counter is legacy and generally useless.

If you want to track syscalls there are already plenty of ways
to do that per process. There's no way to track background
mmap flushes and it may be meaningless in this case anyways.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 19:27           ` Phillip Susi
@ 2014-04-23 13:48             ` Matthew Wilcox
  2014-04-23 19:39               ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2014-04-23 13:48 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Zuckerman, Boris, linux-fsdevel

On Wed, Apr 16, 2014 at 03:27:52PM -0400, Phillip Susi wrote:
> On 4/16/2014 2:28 PM, Zuckerman, Boris wrote:
> > [bz:] In such case I'd rather have lighter implementation of
> > caching layer, than ability to track disk IOs per process. IOs per
> > process can be tracked by other tools...
> 
> What other tools?
> 
> I can envision blktrace being extended to hook into the mm layer to
> track the real source of writes, but without any overhead when the
> trace is not running.

How do you 'envision' this zero-overhead trace, exactly?

I don't understand your high-level goal, which makes suggesting
low-overhead solutions hard.  Can you tolerate a certain amount of
ambiguity, for example?  Do you really only want to track back to the
UID that is causing the I/O?  With shared mmaps, are you OK attributing
the I/O to one of the processes that has written to it, or do you need to
attribute the write to all the processes that have written to that page?

You're coming off kind of condescending, which isn't a great approach
when you're asking for a new feature to be implemented.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 13:48             ` Matthew Wilcox
@ 2014-04-23 19:39               ` Phillip Susi
  2014-04-23 23:00                 ` Theodore Ts'o
  2014-04-23 23:19                 ` Andreas Dilger
  0 siblings, 2 replies; 17+ messages in thread
From: Phillip Susi @ 2014-04-23 19:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Zuckerman, Boris, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 4/23/2014 9:48 AM, Matthew Wilcox wrote:
> How do you 'envision' this zero-overhead trace, exactly?

The same way blktrace currently works: the accounting code is not run
until enabled.  When disabled, nothing more runs than does now.
Obviously while actively tracing, there would be overhead.

> I don't understand your high-level goal, which makes suggesting 
> low-overhead solutions hard.  Can you tolerate a certain amount of 
> ambiguity, for example?  Do you really only want to track back to
> the UID that is causing the I/O?  With shared mmaps, are you OK
> attributing the I/O to one of the processes that has written to it,
> or do you need to attribute the write to all the processes that
> have written to that page?

I suppose the first process that dirties the page would be fine.  It
isn't very often that more than one process is writing to the same
data at the same time.

> You're coming off kind of condescending, which isn't a great
> approach when you're asking for a new feature to be implemented.

I'm just a little flabbergasted that this regression ( I'm sure there
was a time when the current io accounting mechanism did work, probably
before the buffer_head -> page cache transition way, *way* back when )
went unfixed for so long, especially when it is kind of a vital tool
for a sysadmin trying to figure out why his system is slow.  I think
every sysadmin out there takes it for granted that running iotop
should let them spot what process or processes are the source of all
the IO, so I almost can't believe that it doesn't really work.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTWBbnAAoJEI5FoCIzSKrwJaUH/RG3kn8pDxtoqvijAZj6BgnT
iuI/ptnFhaYo22p3coRggqYHKI0Nu6MmNJx9Iq2g6lfoFTCY02Bb9fZ9r05Svg5h
pQaLvk0dsK1vNwYTHW573KkMiyeUTvi7nUeRUB9MTarhHO6HveqvENd/jEiviCE4
zOqyT15V9Riwm78L5ytQFNsq43wNtZ4d9MUmw0182f/IRtpHn/G1B0UroqjZgs/+
a5hIudxajar5oVfR6O0A7+guKkJa4b8ibUHns4zgbEo1HbO7taOcn6rNoNV/C3NK
lN5/0lcHNZwvk0JN9diiEP4812KWZyJIFm7E8FlvchkXNCeHG5hNHjg7FUB1t30=
=Cgep
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 19:39               ` Phillip Susi
@ 2014-04-23 23:00                 ` Theodore Ts'o
  2014-04-24  1:20                   ` Phillip Susi
  2014-04-23 23:19                 ` Andreas Dilger
  1 sibling, 1 reply; 17+ messages in thread
From: Theodore Ts'o @ 2014-04-23 23:00 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Matthew Wilcox, Zuckerman, Boris, linux-fsdevel

On Wed, Apr 23, 2014 at 03:39:19PM -0400, Phillip Susi wrote:
> I'm just a little flabbergasted that this regression ( I'm sure there
> was a time when the current io accounting mechanism did work, probably
> before the buffer_head -> page cache transition way, *way* back when )

Using what tool?  Task-level io accounting is new compared to the
buffer -> page cache transition, as far as I can remember.

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 19:39               ` Phillip Susi
  2014-04-23 23:00                 ` Theodore Ts'o
@ 2014-04-23 23:19                 ` Andreas Dilger
  2014-04-24  1:39                   ` Phillip Susi
  2014-04-28  3:27                   ` Andi Kleen
  1 sibling, 2 replies; 17+ messages in thread
From: Andreas Dilger @ 2014-04-23 23:19 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Matthew Wilcox, Zuckerman, Boris, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]


On Apr 23, 2014, at 1:39 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 4/23/2014 9:48 AM, Matthew Wilcox wrote:
> > I don't understand your high-level goal, which makes suggesting
> > low-overhead solutions hard.  Can you tolerate a certain amount of
> > ambiguity, for example?  Do you really only want to track back to
> > the UID that is causing the I/O?  With shared mmaps, are you OK
> > attributing the I/O to one of the processes that has written to it,
> > or do you need to attribute the write to all the processes that
> > have written to that page?
> 
> I suppose the first process that dirties the page would be fine.  It
> isn't very often that more than one process is writing to the same
> data at the same time.

I think that adding a pointer or integer per page would meet resistance,
but I think it is pretty reasonable to track this on a per-inode basis.
It is fairly uncommon to have multiple threads writing to the same file,
and I would guess it is vanishingly rare that different applications are
writing to the same file at one time.

Storing {current->comm}.{pid} would take 20 bytes of space per inode, but
would be much more useful than just storing {pid}, since a process may be
long gone by the time that the blocks are even submitted to disk due to
delayed allocation and such.

It would be possible to store a refcounted struct with this info pointed
to from the inode, since it would only be useful on inodes being written,
but that has to be balanced against the complexity of maintaining that
struct and the potential of saving 12 bytes per inode (since there would
still need to be a pointer in the inode).

There are potentially a number of other fields in struct inode that are
only used during writes (i_size_seqcount, dirtied_when (how did that avoid
getting an i_ prefix?), i_wb_list, i_writecount) that might also be moved
to a separate struct that is allocated only for files being written (add
8-byte pointer, subtract 32 bytes for fields).  That would have the benefit
of slimming down the majority of files not currently being written, and
make the addition "write source" information less costly.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 23:00                 ` Theodore Ts'o
@ 2014-04-24  1:20                   ` Phillip Susi
  0 siblings, 0 replies; 17+ messages in thread
From: Phillip Susi @ 2014-04-24  1:20 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Matthew Wilcox, Zuckerman, Boris, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 04/23/2014 07:00 PM, Theodore Ts'o wrote:
> On Wed, Apr 23, 2014 at 03:39:19PM -0400, Phillip Susi wrote:
>> I'm just a little flabbergasted that this regression ( I'm sure
>> there was a time when the current io accounting mechanism did
>> work, probably before the buffer_head -> page cache transition
>> way, *way* back when )
> 
> Using what tool?  Task-level io accounting is new compared to the 
> buffer -> page cache transition, as far as I can remember.

Using iotop.  I thought that task level io accounting went all the way
back to bsd?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTWGbZAAoJEI5FoCIzSKrwAL0H/1X5SLCuLytZFmZ/EGFvpXwB
1GuMwxwwyRR+fNVTAiCvk7mHI1i1k/qfPujAc1aRtnJaAxmSSsFYo2YAMocE0Xl+
AOave6IYDaLDZruagPA/rYcfcYDrJtR6eBrVkroPGUNzkCLULOIveUYEB4WtzGft
hYn5U1hgkTwQWrsaZurjoCWj9Dxd751ohsGVzp/+6f7/G/UWwdgTHwMfNerlE4g5
tzCAud/WcIMGcxO9JQoBvY9qxp3NEHVYAlVE1kcsTjdA46u1CxRm86+vpkWM86qR
vOQ7eBfBngCgiw8iaTp3tIj+22E3TzH9FzwUB1aPmO7KpFIRa1Qb5AwN0MYgdWA=
=UCmL
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 23:19                 ` Andreas Dilger
@ 2014-04-24  1:39                   ` Phillip Susi
  2014-04-28  3:27                   ` Andi Kleen
  1 sibling, 0 replies; 17+ messages in thread
From: Phillip Susi @ 2014-04-24  1:39 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Matthew Wilcox, Zuckerman, Boris, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 04/23/2014 07:19 PM, Andreas Dilger wrote:
> I think that adding a pointer or integer per page would meet
> resistance, but I think it is pretty reasonable to track this on a
> per-inode basis. It is fairly uncommon to have multiple threads
> writing to the same file, and I would guess it is vanishingly rare
> that different applications are writing to the same file at one
> time.

Wherever the current counter is should be fine, the question is when
it gets updated.  Rather than updating it on sync it just needs
updated on actual write() / page dirty.

> Storing {current->comm}.{pid} would take 20 bytes of space per
> inode, but would be much more useful than just storing {pid}, since
> a process may be long gone by the time that the blocks are even
> submitted to disk due to delayed allocation and such.
> 
> It would be possible to store a refcounted struct with this info
> pointed to from the inode, since it would only be useful on inodes
> being written, but that has to be balanced against the complexity
> of maintaining that struct and the potential of saving 12 bytes per
> inode (since there would still need to be a pointer in the inode).

That might be a bit overkill.  I believe the current interface just
has a counter of bytes of IO hung off the task, so you can't catch
very short lived processes.  That's probably not too bad.  I don't
think it needs tracked on a per inode basis; just being able to sort
out that this process is doing the writing instead of some other
random process or the flush task that actually ends up submitting the bio.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTWGs7AAoJEI5FoCIzSKrwbiUH/iCQYRn74zTt/G1pSVEKiNCj
+j0AL004tkU8UHZkgUvTedYzMyaqy5gs59IahwuPX+u/U6jRIUElrkBpecL5E+q/
OgfL+g/HybC0YbMJDKWedjRshxwHyjLLhRuPxTPSLXLxOPj8ZCvhzg2Hc2UkswJ5
/mIiWlKt70Ezxf2sq4Xnna8nGk6iuTlNqCc/VkB/AOu+aSsYcdIkoi1T/O+vUMHz
uRUwVuo+grn85NdRjpL0l7sXT1FJaJsbQkjy72akiBEDWN4J/3w48ZaoHr1Gv7tB
uCgWmZgPdQYeftkK0gnU7MSbWplcr7VrwdnYr9BdzEluKaX6I0vxejbpPYcAwOo=
=p1rv
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-16 15:15   ` Phillip Susi
                       ` (2 preceding siblings ...)
  2014-04-16 19:33     ` Andi Kleen
@ 2014-04-24 19:33     ` Jan Kara
  3 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2014-04-24 19:33 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Matthew Wilcox, linux-fsdevel

On Wed 16-04-14 11:15:41, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 4/16/2014 10:01 AM, Matthew Wilcox wrote:
> > On Tue, Apr 15, 2014 at 10:01:27PM -0400, Phillip Susi wrote:
> >> A lot of disk writes, especially when they are small individual
> >> files being written by several different processes, are hidden
> >> behind the flush thread.  Is there no way to properly track the
> >> process actually responsible for the IO, even when it is the
> >> flush thread that initiates the writeout?
> > 
> > Correct.
> 
> Wow.  If I understand things correctly, this also means that if
> process A dirties a ton of cache pages, then process B tries to write
> a relatively small amount, it can end up blocking in the synchronous
> flush path, and so it will appear that process B and flush are doing
> all of the writes, and not process A.
  There is nothing like synchronous flush path anymore (since 3.1 or
something like that). All flushing is done by the flusher works these days
(and sync(2), fsync(2) obviously). And you are right we have no way to
attribute particular dirty pages in page cache to a process. However we do
track how many pages each process has dirtied and use this information in
throttling processes in balance_dirty_pages() (the process waits in that
function for a time proportional to the number of dirtied pages since
it last entered that function).

However that dirty counter isn't very useful for administrative purposes
because it gets zeroed out in balance_dirty_pages(). But as others have
pointed out for tracking originators of dirty pages, we would like
something which survives longer than a process anyway.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tracking actual disk write sources instead of flush thread
  2014-04-23 23:19                 ` Andreas Dilger
  2014-04-24  1:39                   ` Phillip Susi
@ 2014-04-28  3:27                   ` Andi Kleen
  1 sibling, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2014-04-28  3:27 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Phillip Susi, Matthew Wilcox, Zuckerman, Boris, linux-fsdevel

Andreas Dilger <adilger@dilger.ca> writes:

> I think that adding a pointer or integer per page would meet resistance,
> but I think it is pretty reasonable to track this on a per-inode
> basis.

That's a really bad hot cache line in the making.

> It is fairly uncommon to have multiple threads writing to the same file,
> and I would guess it is vanishingly rare that different applications are
> writing to the same file at one time.

Sounds like a bad assumption to me.
Likely would be a scalability disaster.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-04-28  3:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-16  2:01 Tracking actual disk write sources instead of flush thread Phillip Susi
2014-04-16 14:01 ` Matthew Wilcox
2014-04-16 15:15   ` Phillip Susi
2014-04-16 16:42     ` Andreas Dilger
2014-04-16 17:44     ` Zuckerman, Boris
2014-04-16 18:18       ` Phillip Susi
2014-04-16 18:28         ` Zuckerman, Boris
2014-04-16 19:27           ` Phillip Susi
2014-04-23 13:48             ` Matthew Wilcox
2014-04-23 19:39               ` Phillip Susi
2014-04-23 23:00                 ` Theodore Ts'o
2014-04-24  1:20                   ` Phillip Susi
2014-04-23 23:19                 ` Andreas Dilger
2014-04-24  1:39                   ` Phillip Susi
2014-04-28  3:27                   ` Andi Kleen
2014-04-16 19:33     ` Andi Kleen
2014-04-24 19:33     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.