All of lore.kernel.org
 help / color / mirror / Atom feed
* fiologparser
@ 2016-07-14 15:35 Karl Cronburg
  2016-07-14 16:29 ` fiologparser Mark Nelson
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Cronburg @ 2016-07-14 15:35 UTC (permalink / raw)
  To: fio

Hello all,

I have written a new version of fiologparser and could use some feedback
from anyone who uses the tool regularly. It is on branch fiologparser of my fork
of fio:

https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py

It uses numpy and cython to speed up the data parsing and percentile
calculations,
while decreasing the memory footprint and making it scalable.

-Karl Cronburg-

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: fiologparser
  2016-07-14 15:35 fiologparser Karl Cronburg
@ 2016-07-14 16:29 ` Mark Nelson
  2016-07-14 18:48   ` fiologparser Karl Cronburg
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Nelson @ 2016-07-14 16:29 UTC (permalink / raw)
  To: Karl Cronburg, fio



On 07/14/2016 10:35 AM, Karl Cronburg wrote:
> Hello all,
>
> I have written a new version of fiologparser and could use some feedback
> from anyone who uses the tool regularly. It is on branch fiologparser of my fork
> of fio:
>
> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py
>
> It uses numpy and cython to speed up the data parsing and percentile
> calculations,
> while decreasing the memory footprint and making it scalable.

I was really trying to avoid the extra dependencies. :/  That's why 
Ben's last commit was reworked so we could avoid adding numpy as a 
dependency in the fio distribution (not that fio itself would need it, 
but it would be really swell if none of the tools needed anything other 
than stock python).

How much improvement are you seeing and where is the speedup coming 
from?  I suspect we can get similar gains without needing to bring it 
in.  For now though I really think we need to worry about correctness 
rather than speed.  As such, does this version use a similar technique 
as what's in wip-interval?

https://github.com/markhpc/fio/commits/wip-interval

If so, we should probably merge that first and make sure we have test 
cases for it to make sure the idea is sound.  If this is something new, 
it would be good to showcase that it's as good or better than 
wip-interval.  I could easily be convinced that it is, but let's 
demonstrate it.

Mark


>
> -Karl Cronburg-
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: fiologparser
  2016-07-14 16:29 ` fiologparser Mark Nelson
@ 2016-07-14 18:48   ` Karl Cronburg
  2016-07-14 19:05     ` fiologparser Mark Nelson
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Cronburg @ 2016-07-14 18:48 UTC (permalink / raw)
  To: Mark Nelson; +Cc: fio

On Thu, Jul 14, 2016 at 12:29 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>
>
> On 07/14/2016 10:35 AM, Karl Cronburg wrote:
>>
>> Hello all,
>>
>> I have written a new version of fiologparser and could use some feedback
>> from anyone who uses the tool regularly. It is on branch fiologparser of
>> my fork
>> of fio:
>>
>> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py
>>
>> It uses numpy and cython to speed up the data parsing and percentile
>> calculations,
>> while decreasing the memory footprint and making it scalable.
>
>
> I was really trying to avoid the extra dependencies. :/  That's why Ben's
> last commit was reworked so we could avoid adding numpy as a dependency in
> the fio distribution (not that fio itself would need it, but it would be
> really swell if none of the tools needed anything other than stock python).
>
> How much improvement are you seeing and where is the speedup coming from?  I

All the speedup is coming from using dedicated parsing libraries (pandas),
numpy arrays (with vector math) instead of python lists, and streaming smaller
portions of the data at a time into memory.

On a single 39 MB latency log file it takes ~ a tenth of a second per
interval to calculate
percentiles for each interval whereas the one in wip-interval (with
slight modification
to stop a NoneType error) takes 5+ seconds per interval.

> suspect we can get similar gains without needing to bring it in.  For now
> though I really think we need to worry about correctness rather than speed.

I would think someone running python on a large data set would be more than
willing to install something as widely used and supported as numpy. If anything
numpy gives us better correctness guarantees and less technical debt.

Who else is using / wants the percentiles fiologparser is calculating? If enough
users are put-off by the dependencies I understand, the data scientist
in me is just
really sad that the exact thing numpy was built for is being re-engineered.

> As such, does this version use a similar technique as what's in
> wip-interval?

Yes - weighted percentiles are calculated using the fraction of a
sample falling in
the given interval.

>
> https://github.com/markhpc/fio/commits/wip-interval
>
> If so, we should probably merge that first and make sure we have test cases
> for it to make sure the idea is sound.  If this is something new, it would
> be good to showcase that it's as good or better than wip-interval.  I could
> easily be convinced that it is, but let's demonstrate it.

Will do - I'll make some test cases to see where wip-interval differs
with this implementation.
If anything I figure it's good to maintain both for now to see where
our thinking differs - just looking
at your code I wasn't even sure what numbers should be calculated, but
now that I've implemented
weighted latency percentiles myself I have a better idea what we want.

-Karl-

>
> Mark
>
>
>>
>> -Karl Cronburg-
>> --
>> To unsubscribe from this list: send the line "unsubscribe fio" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: fiologparser
  2016-07-14 18:48   ` fiologparser Karl Cronburg
@ 2016-07-14 19:05     ` Mark Nelson
  2016-07-15 13:02       ` fiologparser Karl Cronburg
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Nelson @ 2016-07-14 19:05 UTC (permalink / raw)
  To: Karl Cronburg; +Cc: fio



On 07/14/2016 01:48 PM, Karl Cronburg wrote:
> On Thu, Jul 14, 2016 at 12:29 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>>
>>
>> On 07/14/2016 10:35 AM, Karl Cronburg wrote:
>>>
>>> Hello all,
>>>
>>> I have written a new version of fiologparser and could use some feedback
>>> from anyone who uses the tool regularly. It is on branch fiologparser of
>>> my fork
>>> of fio:
>>>
>>> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py
>>>
>>> It uses numpy and cython to speed up the data parsing and percentile
>>> calculations,
>>> while decreasing the memory footprint and making it scalable.
>>
>>
>> I was really trying to avoid the extra dependencies. :/  That's why Ben's
>> last commit was reworked so we could avoid adding numpy as a dependency in
>> the fio distribution (not that fio itself would need it, but it would be
>> really swell if none of the tools needed anything other than stock python).
>>
>> How much improvement are you seeing and where is the speedup coming from?  I
>
> All the speedup is coming from using dedicated parsing libraries (pandas),
> numpy arrays (with vector math) instead of python lists, and streaming smaller
> portions of the data at a time into memory.
>
> On a single 39 MB latency log file it takes ~ a tenth of a second per
> interval to calculate
> percentiles for each interval whereas the one in wip-interval (with
> slight modification
> to stop a NoneType error) takes 5+ seconds per interval.

FWIW no optimizations were done at all in wip-interval.  We should be 
able to speed it up quite a bit since it's doing quite a bit of work it 
probably doesn't need to do every loop, though yours might still be 
faster.  I guess the question is how fast do we need it and is it worth 
bringing in the extra dependency.

>
>> suspect we can get similar gains without needing to bring it in.  For now
>> though I really think we need to worry about correctness rather than speed.
>
> I would think someone running python on a large data set would be more than
> willing to install something as widely used and supported as numpy. If anything
> numpy gives us better correctness guarantees and less technical debt.

If it does something that we can't easily do ourselves I could be 
convinced, but it's a tradeoff I'd personally not want to make lightly. 
Ultimately it should Jen's call though.

>
> Who else is using / wants the percentiles fiologparser is calculating? If enough
> users are put-off by the dependencies I understand, the data scientist
> in me is just
> really sad that the exact thing numpy was built for is being re-engineered.
>
>> As such, does this version use a similar technique as what's in
>> wip-interval?
>
> Yes - weighted percentiles are calculated using the fraction of a
> sample falling in
> the given interval.
>
>>
>> https://github.com/markhpc/fio/commits/wip-interval
>>
>> If so, we should probably merge that first and make sure we have test cases
>> for it to make sure the idea is sound.  If this is something new, it would
>> be good to showcase that it's as good or better than wip-interval.  I could
>> easily be convinced that it is, but let's demonstrate it.
>
> Will do - I'll make some test cases to see where wip-interval differs
> with this implementation.
> If anything I figure it's good to maintain both for now to see where
> our thinking differs - just looking
> at your code I wasn't even sure what numbers should be calculated, but
> now that I've implemented
> weighted latency percentiles myself I have a better idea what we want.

There's no guarantee that mine is correct either! :)  Hence why I didn't 
want to submit a PR for it yet.  Having another set of eyes looking at 
it is good.

One thing I suspect may be contentious is when you are looking at sample 
points that are averages from fio (like 100ms) and you have two samples 
that are farther than 100ms apart, does that mean the interval is longer 
than 100ms or that the interval is 100ms with a gap?  I assumed a longer 
than 100ms interval, but this may be wrong.  Probably won't know without 
reading the fio code.

Mark

>
> -Karl-
>
>>
>> Mark
>>
>>
>>>
>>> -Karl Cronburg-
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: fiologparser
  2016-07-14 19:05     ` fiologparser Mark Nelson
@ 2016-07-15 13:02       ` Karl Cronburg
  2016-07-25 15:00         ` fiologparser Karl Cronburg
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Cronburg @ 2016-07-15 13:02 UTC (permalink / raw)
  To: Mark Nelson; +Cc: fio

On Thu, Jul 14, 2016 at 3:05 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>
>
> On 07/14/2016 01:48 PM, Karl Cronburg wrote:
>>
>> On Thu, Jul 14, 2016 at 12:29 PM, Mark Nelson <mark.a.nelson@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On 07/14/2016 10:35 AM, Karl Cronburg wrote:
>>>>
>>>>
>>>> Hello all,
>>>>
>>>> I have written a new version of fiologparser and could use some feedback
>>>> from anyone who uses the tool regularly. It is on branch fiologparser of
>>>> my fork
>>>> of fio:
>>>>
>>>> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py
>>>>
>>>> It uses numpy and cython to speed up the data parsing and percentile
>>>> calculations,
>>>> while decreasing the memory footprint and making it scalable.
>>>
>>>
>>>
>>> I was really trying to avoid the extra dependencies. :/  That's why Ben's
>>> last commit was reworked so we could avoid adding numpy as a dependency
>>> in
>>> the fio distribution (not that fio itself would need it, but it would be
>>> really swell if none of the tools needed anything other than stock
>>> python).
>>>
>>> How much improvement are you seeing and where is the speedup coming from?
>>> I
>>
>>
>> All the speedup is coming from using dedicated parsing libraries (pandas),
>> numpy arrays (with vector math) instead of python lists, and streaming
>> smaller
>> portions of the data at a time into memory.
>>
>> On a single 39 MB latency log file it takes ~ a tenth of a second per
>> interval to calculate
>> percentiles for each interval whereas the one in wip-interval (with
>> slight modification
>> to stop a NoneType error) takes 5+ seconds per interval.
>
>
> FWIW no optimizations were done at all in wip-interval.  We should be able
> to speed it up quite a bit since it's doing quite a bit of work it probably
> doesn't need to do every loop, though yours might still be faster.  I guess
> the question is how fast do we need it and is it worth bringing in the extra
> dependency.
>
>>
>>> suspect we can get similar gains without needing to bring it in.  For now
>>> though I really think we need to worry about correctness rather than
>>> speed.
>>
>>
>> I would think someone running python on a large data set would be more
>> than
>> willing to install something as widely used and supported as numpy. If
>> anything
>> numpy gives us better correctness guarantees and less technical debt.
>
>
> If it does something that we can't easily do ourselves I could be convinced,
> but it's a tradeoff I'd personally not want to make lightly. Ultimately it
> should Jen's call though.
>
>>
>> Who else is using / wants the percentiles fiologparser is calculating? If
>> enough
>> users are put-off by the dependencies I understand, the data scientist
>> in me is just
>> really sad that the exact thing numpy was built for is being
>> re-engineered.
>>
>>> As such, does this version use a similar technique as what's in
>>> wip-interval?
>>
>>
>> Yes - weighted percentiles are calculated using the fraction of a
>> sample falling in
>> the given interval.
>>
>>>
>>> https://github.com/markhpc/fio/commits/wip-interval
>>>
>>> If so, we should probably merge that first and make sure we have test
>>> cases
>>> for it to make sure the idea is sound.  If this is something new, it
>>> would
>>> be good to showcase that it's as good or better than wip-interval.  I
>>> could
>>> easily be convinced that it is, but let's demonstrate it.
>>
>>
>> Will do - I'll make some test cases to see where wip-interval differs
>> with this implementation.
>> If anything I figure it's good to maintain both for now to see where
>> our thinking differs - just looking
>> at your code I wasn't even sure what numbers should be calculated, but
>> now that I've implemented
>> weighted latency percentiles myself I have a better idea what we want.
>
>
> There's no guarantee that mine is correct either! :)  Hence why I didn't
> want to submit a PR for it yet.  Having another set of eyes looking at it is
> good.
>
> One thing I suspect may be contentious is when you are looking at sample
> points that are averages from fio (like 100ms) and you have two samples that
> are farther than 100ms apart, does that mean the interval is longer than
> 100ms or that the interval is 100ms with a gap?  I assumed a longer than
> 100ms interval, but this may be wrong.  Probably won't know without reading
> the fio code.

From looking at the code that assumption looks good to me. Namely here:

https://github.com/axboe/fio/blob/master/stat.c#L2135

the code checks to see if *at least* 100ms have passed since the last time we
logged an average. So you effectively get an average sample as soon as you see
the first IO request complete after the given time interval. This is
also why the sample
end times appear to drift forward in time with averaging, e.g.:

102, 2293, 0, 0
202, 1688, 0, 0
306, 1797, 0, 0
406, 2056, 0, 0
508, 1227, 0, 0

-Karl-

>
> Mark
>
>
>>
>> -Karl-
>>
>>>
>>> Mark
>>>
>>>
>>>>
>>>> -Karl Cronburg-
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe fio" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: fiologparser
  2016-07-15 13:02       ` fiologparser Karl Cronburg
@ 2016-07-25 15:00         ` Karl Cronburg
  0 siblings, 0 replies; 6+ messages in thread
From: Karl Cronburg @ 2016-07-25 15:00 UTC (permalink / raw)
  To: Mark Nelson; +Cc: fio, Ben England

On Fri, Jul 15, 2016 at 9:02 AM, Karl Cronburg <kcronbur@redhat.com> wrote:
>
> On Thu, Jul 14, 2016 at 3:05 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
> >
> >
> > On 07/14/2016 01:48 PM, Karl Cronburg wrote:
> >>
> >> On Thu, Jul 14, 2016 at 12:29 PM, Mark Nelson <mark.a.nelson@gmail.com>
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 07/14/2016 10:35 AM, Karl Cronburg wrote:
> >>>>
> >>>>
> >>>> Hello all,
> >>>>
> >>>> I have written a new version of fiologparser and could use some feedback
> >>>> from anyone who uses the tool regularly. It is on branch fiologparser of
> >>>> my fork
> >>>> of fio:
> >>>>
> >>>> https://github.com/cronburg/fio/blob/fiologparser/tools/fiologparser.py
> >>>>
> >>>> It uses numpy and cython to speed up the data parsing and percentile
> >>>> calculations,
> >>>> while decreasing the memory footprint and making it scalable.
> >>>
> >>>
> >>>
> >>> I was really trying to avoid the extra dependencies. :/  That's why Ben's
> >>> last commit was reworked so we could avoid adding numpy as a dependency
> >>> in
> >>> the fio distribution (not that fio itself would need it, but it would be
> >>> really swell if none of the tools needed anything other than stock
> >>> python).
> >>>
> >>> How much improvement are you seeing and where is the speedup coming from?
> >>> I
> >>
> >>
> >> All the speedup is coming from using dedicated parsing libraries (pandas),
> >> numpy arrays (with vector math) instead of python lists, and streaming
> >> smaller
> >> portions of the data at a time into memory.
> >>
> >> On a single 39 MB latency log file it takes ~ a tenth of a second per
> >> interval to calculate
> >> percentiles for each interval whereas the one in wip-interval (with
> >> slight modification
> >> to stop a NoneType error) takes 5+ seconds per interval.
> >
> >
> > FWIW no optimizations were done at all in wip-interval.  We should be able
> > to speed it up quite a bit since it's doing quite a bit of work it probably
> > doesn't need to do every loop, though yours might still be faster.  I guess
> > the question is how fast do we need it and is it worth bringing in the extra
> > dependency.
> >
> >>
> >>> suspect we can get similar gains without needing to bring it in.  For now
> >>> though I really think we need to worry about correctness rather than
> >>> speed.
> >>
> >>
> >> I would think someone running python on a large data set would be more
> >> than
> >> willing to install something as widely used and supported as numpy. If
> >> anything
> >> numpy gives us better correctness guarantees and less technical debt.
> >
> >
> > If it does something that we can't easily do ourselves I could be convinced,
> > but it's a tradeoff I'd personally not want to make lightly. Ultimately it
> > should Jen's call though.
> >
> >>
> >> Who else is using / wants the percentiles fiologparser is calculating? If
> >> enough
> >> users are put-off by the dependencies I understand, the data scientist
> >> in me is just
> >> really sad that the exact thing numpy was built for is being
> >> re-engineered.
> >>
> >>> As such, does this version use a similar technique as what's in
> >>> wip-interval?
> >>
> >>
> >> Yes - weighted percentiles are calculated using the fraction of a
> >> sample falling in
> >> the given interval.
> >>
> >>>
> >>> https://github.com/markhpc/fio/commits/wip-interval
> >>>
> >>> If so, we should probably merge that first and make sure we have test
> >>> cases
> >>> for it to make sure the idea is sound.  If this is something new, it
> >>> would
> >>> be good to showcase that it's as good or better than wip-interval.  I
> >>> could
> >>> easily be convinced that it is, but let's demonstrate it.
> >>
> >>
> >> Will do - I'll make some test cases to see where wip-interval differs
> >> with this implementation.
> >> If anything I figure it's good to maintain both for now to see where
> >> our thinking differs - just looking
> >> at your code I wasn't even sure what numbers should be calculated, but
> >> now that I've implemented
> >> weighted latency percentiles myself I have a better idea what we want.
> >
> >
> > There's no guarantee that mine is correct either! :)  Hence why I didn't
> > want to submit a PR for it yet.  Having another set of eyes looking at it is
> > good.
> >
> > One thing I suspect may be contentious is when you are looking at sample
> > points that are averages from fio (like 100ms) and you have two samples that
> > are farther than 100ms apart, does that mean the interval is longer than
> > 100ms or that the interval is 100ms with a gap?  I assumed a longer than
> > 100ms interval, but this may be wrong.  Probably won't know without reading
> > the fio code.
>
> From looking at the code that assumption looks good to me. Namely here:
>
> https://github.com/axboe/fio/blob/master/stat.c#L2135
>
> the code checks to see if *at least* 100ms have passed since the last time we
> logged an average. So you effectively get an average sample as soon as you see
> the first IO request complete after the given time interval. This is
> also why the sample
> end times appear to drift forward in time with averaging, e.g.:
>
> 102, 2293, 0, 0
> 202, 1688, 0, 0
> 306, 1797, 0, 0
> 406, 2056, 0, 0
> 508, 1227, 0, 0
>

Mark,

I have a (pure python / no new dependencies) implementation now of the
weighted percentiles method as described on wikipedia
(https://en.wikipedia.org/wiki/Percentile#Weighted_percentile), with
test cases and verified against one of the "Linear Interpolation
Between Closest Ranks" examples with weights of all one - namely at
the bottom of this ipython notebook:

https://cronburg.com/fio/Weighted-Percentiles.html

Will make a PR to your wip-interval branch shortly. One thing I needed
to do was add a command line parameter to specify bandwidth vs
latency, because of the differing units / need to calculate start
times of samples based on these units for latency logs.

Another thing I noticed is for bandwidth files with both read/write
enabled we get a divide by zero error because of the way start / end
times are calculated. We'll need to segregate the data by r/w
direction in order to account for this. For example the first few
entries in a bw file:

74, 1, 0, 4096
74, 3, 1, 4096
930, 4, 0, 4096
930, 28, 1, 4096
1667, 81, 0, 4096
1667, 124, 1, 4096

-Karl-

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-07-25 15:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 15:35 fiologparser Karl Cronburg
2016-07-14 16:29 ` fiologparser Mark Nelson
2016-07-14 18:48   ` fiologparser Karl Cronburg
2016-07-14 19:05     ` fiologparser Mark Nelson
2016-07-15 13:02       ` fiologparser Karl Cronburg
2016-07-25 15:00         ` fiologparser Karl Cronburg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.