All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance degradation over time
@ 2016-03-08 14:09 Matthew Keeler
  2016-03-08 16:32 ` Michal Kazior
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Keeler @ 2016-03-08 14:09 UTC (permalink / raw)
  To: ath10k

 
I have an arch linux box running kernel version 4.4 and within it two Airetos AEX-QCA9880-NX cards in it. I have hostapd configured to use one for 2.4 GHz bgn and the other for 5 GHz n/ac. After a fresh boot of the box I can get about 50Mbps over 2.4 GHz using iperf and 350-400MBps over 5GHz. After a couple of days or so the performance of my 2.4GHz card is dropping to about 5Mbps. Has anyone else had similar issues with this kind of performance degradation or does anyone know a good place to start to try and figure why this could be happening.

A couple things I have checked: nothing in dmesg, nothing abnormal in hostapd logs, restarting hostapd doesn’t seem to help, only rebooting helps. 

--  
Matt Keeler



_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-08 14:09 Performance degradation over time Matthew Keeler
@ 2016-03-08 16:32 ` Michal Kazior
  2016-03-08 16:38   ` Matthew Keeler
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Kazior @ 2016-03-08 16:32 UTC (permalink / raw)
  To: Matthew Keeler; +Cc: ath10k

On 8 March 2016 at 15:09, Matthew Keeler <mjkeeler7@gmail.com> wrote:
>
> I have an arch linux box running kernel version 4.4 and within it two Airetos AEX-QCA9880-NX cards in it. I have hostapd configured to use one for 2.4 GHz bgn and the other for 5 GHz n/ac. After a fresh boot of the box I can get about 50Mbps over 2.4 GHz using iperf and 350-400MBps over 5GHz. After a couple of days or so the performance of my 2.4GHz card is dropping to about 5Mbps. Has anyone else had similar issues with this kind of performance degradation or does anyone know a good place to start to try and figure why this could be happening.
>
> A couple things I have checked: nothing in dmesg, nothing abnormal in hostapd logs, restarting hostapd doesn’t seem to help, only rebooting helps.

Hi,

You didn't mention firmware version that you're using:

  dmesg | grep ath10k.*firmware

From the looks of it it seems like a firmware (probably rate control
module) issue and you can't do much in driver about that. At best you
could force firmware reboot via debugfs to workaround the problem:

  echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash

FWIW I suggest you to try out latest 10.2.4 if you didn't do that already, e.g.

  https://github.com/kvalo/ath10k-firmware/blob/master/QCA988X/10.2.4/firmware-5.bin_10.2.4.70.22-2

I guess you could also try getting fw stats after fresh boot and then
after the problem is observed and compare:

  cat /sys/kernel/debug/ieee80211/phy0/ath10k/fw_stats


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-08 16:32 ` Michal Kazior
@ 2016-03-08 16:38   ` Matthew Keeler
  2016-03-09  2:01     ` Matthew Keeler
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Keeler @ 2016-03-08 16:38 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k

 
I am using firmware version 10.2.4.70.9-2 (from the linux-firmware arch package). So it looks to be a little out of date. I will try the latest firmware and see if it fixes things.  

Thanks for the tips about the simulate firmware crash and the fw stats. 

--  
Matt Keeler

On March 8, 2016 at 11:32:58, Michal Kazior (michal.kazior@tieto.com(mailto:michal.kazior@tieto.com)) wrote:

> On 8 March 2016 at 15:09, Matthew Keeler wrote:
> >
> > I have an arch linux box running kernel version 4.4 and within it two Airetos AEX-QCA9880-NX cards in it. I have hostapd configured to use one for 2.4 GHz bgn and the other for 5 GHz n/ac. After a fresh boot of the box I can get about 50Mbps over 2.4 GHz using iperf and 350-400MBps over 5GHz. After a couple of days or so the performance of my 2.4GHz card is dropping to about 5Mbps. Has anyone else had similar issues with this kind of performance degradation or does anyone know a good place to start to try and figure why this could be happening.
> >
> > A couple things I have checked: nothing in dmesg, nothing abnormal in hostapd logs, restarting hostapd doesn’t seem to help, only rebooting helps.
>  
> Hi,
>  
> You didn't mention firmware version that you're using:
>  
> dmesg | grep ath10k.*firmware
>  
> From the looks of it it seems like a firmware (probably rate control
> module) issue and you can't do much in driver about that. At best you
> could force firmware reboot via debugfs to workaround the problem:
>  
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash
>  
> FWIW I suggest you to try out latest 10.2.4 if you didn't do that already, e.g.
>  
> https://github.com/kvalo/ath10k-firmware/blob/master/QCA988X/10.2.4/firmware-5.bin_10.2.4.70.22-2
>  
> I guess you could also try getting fw stats after fresh boot and then
> after the problem is observed and compare:
>  
> cat /sys/kernel/debug/ieee80211/phy0/ath10k/fw_stats
>  
>  
> Michał


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-08 16:38   ` Matthew Keeler
@ 2016-03-09  2:01     ` Matthew Keeler
  2016-03-09  6:00       ` Michal Kazior
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Keeler @ 2016-03-09  2:01 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k

 
So with the most recent firmware I am experiencing different performance issues. First, the latest firmware after coming up does significantly less than it used to. For 2.4 GHz where before it would do ~50Mbps it now can only average ~30Mbps (the rate is sporadic though anywhere from 10Mbps to 40Mbps) and this is about 3 inches away from my antennas on an unused channel. Secondly sometimes I have seen it drop down to < .1 Mbps. I grabbed the fw_stats. One thing that seems drastically different between my 2.4 Ghz and my 5 Ghz is that 2.4 has extremely high error counts where my 5 GHz is < 400. Could this be a symptom of misconfiguration or more firmware issues?


             ath10k PDEV stats
             =================

           Channel noise floor        -82
              Channel TX power         46
                TX frame count  340181623
                RX frame count 1520199208
                RX clear count 1951746151
                   Cycle count 3237081850
               PHY error count      71162
                 RTS bad count      16735
                RTS good count      27724
                 FCS bad count    1104584
               No beacon count          0
                 MIB int count         25


          ath10k PDEV TX stats
             =================

            HTT cookies queued    1071875
             HTT cookies disp.    1071875
                   MSDU queued    1119755
                   MPDU queued    1119543
                 MSDUs dropped          0
                  Local enqued      47880
                   Local freed      47880
                     HW queued     311696
                  PPDUs reaped     311696
                 Num underruns          3
                 PPDUs cleaned          1
                  MPDUs requed     491558
             Excessive retries      57767
                       HW rate         67
            Sched self tiggers     141832
     Dropped due to SW retries         39
       Illegal rate phy errors          0
        Pdev continuous xretry          0
                    TX timeout          2
                   PDEV resets          5
                  PHY underrun          0
  MPDU is more than txop limit          0

          ath10k PDEV RX stats
             =================

         Mid PPDU route change        256
       Tot. number of statuses    1508637
        Extra frags on rings 0          0
        Extra frags on rings 1        430
        Extra frags on rings 2        548
        Extra frags on rings 3          0
        MSDUs delivered to HTT    1508637
        MPDUs delivered to HTT    1508637
      MSDUs delivered to stack     137106
      MPDUs delivered to stack     137106
               Oversized AMSUs          0
                    PHY errors    4009165
              PHY errors drops      66554
   MPDU errors (FCS, MIC, ENC)     229322

             ath10k VDEV stats (0)
             =================


             ath10k PEER stats (3)
             =================

              Peer MAC address 44:c3:06:00:01:c8
                     Peer RSSI 0
                  Peer TX rate 0
                  Peer RX rate 0

              Peer MAC address 20:c9:d0:43:34:17
                     Peer RSSI 32
                  Peer TX rate 52000
                  Peer RX rate 52000

              Peer MAC address 00:00:00:00:70:3e
                     Peer RSSI 25939
                  Peer TX rate 19
                  Peer RX rate 2000

--  
Matt Keeler

On March 8, 2016 at 11:38:59, Matthew Keeler (mjkeeler7@gmail.com(mailto:mjkeeler7@gmail.com)) wrote:

>  
> I am using firmware version 10.2.4.70.9-2 (from the linux-firmware arch package). So it looks to be a little out of date. I will try the latest firmware and see if it fixes things.
>  
> Thanks for the tips about the simulate firmware crash and the fw stats.  
>  
> --
> Matt Keeler
>  
> On March 8, 2016 at 11:32:58, Michal Kazior (michal.kazior@tieto.com(mailto:michal.kazior@tieto.com)) wrote:
>  
> > On 8 March 2016 at 15:09, Matthew Keeler wrote:
> > >
> > > I have an arch linux box running kernel version 4.4 and within it two Airetos AEX-QCA9880-NX cards in it. I have hostapd configured to use one for 2.4 GHz bgn and the other for 5 GHz n/ac. After a fresh boot of the box I can get about 50Mbps over 2.4 GHz using iperf and 350-400MBps over 5GHz. After a couple of days or so the performance of my 2.4GHz card is dropping to about 5Mbps. Has anyone else had similar issues with this kind of performance degradation or does anyone know a good place to start to try and figure why this could be happening.
> > >
> > > A couple things I have checked: nothing in dmesg, nothing abnormal in hostapd logs, restarting hostapd doesn’t seem to help, only rebooting helps.
> >
> > Hi,
> >
> > You didn't mention firmware version that you're using:
> >
> > dmesg | grep ath10k.*firmware
> >
> > From the looks of it it seems like a firmware (probably rate control
> > module) issue and you can't do much in driver about that. At best you
> > could force firmware reboot via debugfs to workaround the problem:
> >
> > echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash
> >
> > FWIW I suggest you to try out latest 10.2.4 if you didn't do that already, e.g.
> >
> > https://github.com/kvalo/ath10k-firmware/blob/master/QCA988X/10.2.4/firmware-5.bin_10.2.4.70.22-2
> >
> > I guess you could also try getting fw stats after fresh boot and then
> > after the problem is observed and compare:
> >
> > cat /sys/kernel/debug/ieee80211/phy0/ath10k/fw_stats
> >
> >
> > Michał
>  


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-09  2:01     ` Matthew Keeler
@ 2016-03-09  6:00       ` Michal Kazior
  2016-03-09 12:46         ` Matthew Keeler
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Kazior @ 2016-03-09  6:00 UTC (permalink / raw)
  To: Matthew Keeler; +Cc: ath10k

On 9 March 2016 at 03:01, Matthew Keeler <mjkeeler7@gmail.com> wrote:
>
> So with the most recent firmware I am experiencing different performance issues. First, the latest firmware after coming up does significantly less than it used to. For 2.4 GHz where before it would do ~50Mbps it now can only average ~30Mbps (the rate is sporadic though anywhere from 10Mbps to 40Mbps) and this is about 3 inches away from my antennas on an unused channel. Secondly sometimes I have seen it drop down to < .1 Mbps. I grabbed the fw_stats. One thing that seems drastically different between my 2.4 Ghz and my 5 Ghz is that 2.4 has extremely high error counts where my 5 GHz is < 400. Could this be a symptom of misconfiguration or more firmware issues?

The only thing that comes to mind is that this could be related to
(mis)calibration.

Does your card contain calibration in EEPROM or is it out-of-band? Do
note: many cards found in routers have out-of-band cal data. If it's
in EEPROM this could be either a quirk in board.bin or otp.bin (which
is embedded in the firmware blob). You could try experimenting with
different firmware versions (including 10.1.467) to see if it changes
much.


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-09  6:00       ` Michal Kazior
@ 2016-03-09 12:46         ` Matthew Keeler
  2016-03-17 11:21           ` Michal Kazior
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Keeler @ 2016-03-09 12:46 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k

 How can I tell if calibration is done in EEPROM or out-of-band?


--  
Matt Keeler

On March 9, 2016 at 01:00:45, Michal Kazior (michal.kazior@tieto.com(mailto:michal.kazior@tieto.com)) wrote:

> On 9 March 2016 at 03:01, Matthew Keeler wrote:
> >
> > So with the most recent firmware I am experiencing different performance issues. First, the latest firmware after coming up does significantly less than it used to. For 2.4 GHz where before it would do ~50Mbps it now can only average ~30Mbps (the rate is sporadic though anywhere from 10Mbps to 40Mbps) and this is about 3 inches away from my antennas on an unused channel. Secondly sometimes I have seen it drop down to < .1 Mbps. I grabbed the fw_stats. One thing that seems drastically different between my 2.4 Ghz and my 5 Ghz is that 2.4 has extremely high error counts where my 5 GHz is < 400. Could this be a symptom of misconfiguration or more firmware issues?
>  
> The only thing that comes to mind is that this could be related to
> (mis)calibration.
>  
> Does your card contain calibration in EEPROM or is it out-of-band? Do
> note: many cards found in routers have out-of-band cal data. If it's
> in EEPROM this could be either a quirk in board.bin or otp.bin (which
> is embedded in the firmware blob). You could try experimenting with
> different firmware versions (including 10.1.467) to see if it changes
> much.
>  
>  
> Michał


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2016-03-09 12:46         ` Matthew Keeler
@ 2016-03-17 11:21           ` Michal Kazior
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Kazior @ 2016-03-17 11:21 UTC (permalink / raw)
  To: Matthew Keeler; +Cc: ath10k

Sorry for the late reply.

Normally one should expect OTP to fail and thus, ath10k to fail to
load. Moreover, if you bought the card standalone it probably has
calibration data.. or at least I think it should. Devices found in
routers can have calibration data stored on generic flash partition
instead of dedicated wifi eeprom.

One think you could see with uncalibrated device is a random mac on
the wlan interface (e.g. 00:03:7f:xx:xx:xx) or some suspicious looking
one (e.g. 00:03:7f:11:22:33).


Michał



On 9 March 2016 at 13:46, Matthew Keeler <mjkeeler7@gmail.com> wrote:
>  How can I tell if calibration is done in EEPROM or out-of-band?
>
>
> --
> Matt Keeler
>
> On March 9, 2016 at 01:00:45, Michal Kazior (michal.kazior@tieto.com(mailto:michal.kazior@tieto.com)) wrote:
>
>> On 9 March 2016 at 03:01, Matthew Keeler wrote:
>> >
>> > So with the most recent firmware I am experiencing different performance issues. First, the latest firmware after coming up does significantly less than it used to. For 2.4 GHz where before it would do ~50Mbps it now can only average ~30Mbps (the rate is sporadic though anywhere from 10Mbps to 40Mbps) and this is about 3 inches away from my antennas on an unused channel. Secondly sometimes I have seen it drop down to < .1 Mbps. I grabbed the fw_stats. One thing that seems drastically different between my 2.4 Ghz and my 5 Ghz is that 2.4 has extremely high error counts where my 5 GHz is < 400. Could this be a symptom of misconfiguration or more firmware issues?
>>
>> The only thing that comes to mind is that this could be related to
>> (mis)calibration.
>>
>> Does your card contain calibration in EEPROM or is it out-of-band? Do
>> note: many cards found in routers have out-of-band cal data. If it's
>> in EEPROM this could be either a quirk in board.bin or otp.bin (which
>> is embedded in the firmware blob). You could try experimenting with
>> different firmware versions (including 10.1.467) to see if it changes
>> much.
>>
>>
>> Michał
>

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-11  9:15       ` Marcin Deranek
@ 2012-10-14 19:31         ` Peter Grandi
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Grandi @ 2012-10-14 19:31 UTC (permalink / raw)
  To: Linux fs XFS

>>> [ ... ] open() syscalls (open for writing) were taking
>>> significantly more time than they should eg. 15-20ms vs
>>> 100-150us. [ ... ] That means that we create lots of files in
>>> /mountpoint/some/path/.tmp directory, but directory is empty
>>> as they are moved (rename() syscall) shortly after file
>>> creation to a different directory on the same filesystem.

>>> The workaround which I found so far is to remove that
>>> directory (/mountpoint/some/path/.tmp in our case) with its
>>> content and re-create it. After this operation open() syscall
>>> goes down to 100-150us again.

>>> Is this a known problem ?

Indeed, two known (for several decades) problems: using
filesystems as DBMSes and directories as spool queues.

[ ... ]

> After mounting XFS with inode64 I see performance improvement
> (open() now takes ~3ms vs ~15ms previous) although it's still
> not something I would expect (~150us.)

It would be amusing to know why ever to expect a random metadata
access operation to take 150µs on *average* on a storage system
that seems to have rotating disk with 10-15ms *average* access
time.

The metadata operations may have locality, but unsurprisingly
that decreases with time...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-11  8:33     ` Marcin Deranek
@ 2012-10-11  9:15       ` Marcin Deranek
  2012-10-14 19:31         ` Peter Grandi
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Deranek @ 2012-10-11  9:15 UTC (permalink / raw)
  To: Marcin Deranek; +Cc: Eric Sandeen, stan, xfs

On Thu, 11 Oct 2012 10:33:52 +0200
Marcin Deranek <marcin.deranek@booking.com> wrote:

> I guess next step would be to use inode64..

After mounting XFS with inode64 I see performance improvement (open()
now takes ~3ms vs ~15ms previous) although it's still not something I
would expect (~150us.) On Dave's suggestion I will give a shot CentOS
6.x and see if that makes any difference although this needs to be
monitored over longer period of time to reliably tell if that make a
difference.
Regards,

Marcin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-10 23:37 ` Dave Chinner
@ 2012-10-11  8:42   ` Marcin Deranek
  0 siblings, 0 replies; 15+ messages in thread
From: Marcin Deranek @ 2012-10-11  8:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 11 Oct 2012 10:37:04 +1100
Dave Chinner <david@fromorbit.com> wrote:

> Use a more recent distro. I reworked the metadata caching algorithms
> a couple of years ago to avoid these sorts of problems with memory
> reclaim.

I can give a shot CentOS 6.x although that might take some time..

Marcin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-10 14:31   ` Eric Sandeen
@ 2012-10-11  8:33     ` Marcin Deranek
  2012-10-11  9:15       ` Marcin Deranek
  0 siblings, 1 reply; 15+ messages in thread
From: Marcin Deranek @ 2012-10-11  8:33 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: stan, xfs

Hi Eric,

On Wed, 10 Oct 2012 09:31:16 -0500
Eric Sandeen <sandeen@sandeen.net> wrote:

> Yep.  Ditch that; it overrides the maintained module that comes with
> the kernel itself.  See if that helps, first, I suppose.

I wasn't aware that stock kernel comes with xfs module. From my testing
looks like stock kernel module is still preferred over kmod-xfs:

# modinfo xfs
filename:       /lib/modules/2.6.18-308.el5/kernel/fs/xfs/xfs.ko
license:        GPL
description:    SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
author:         Silicon Graphics, Inc.
srcversion:     D37A003AFEE1A42BDD4DD56
depends:        
vermagic:       2.6.18-308.el5 SMP mod_unload gcc-4.1
module_sig:
883f3504f44471c48d0a1fbae482c4c11225a009e3fa1179850eea96ab882c910d750e88743fec5309d1ca09de3d81add6999f9dedc65f84a0d1e21293

Most likely due to historical reasons we still install kmod-xfs on our 
systems.
To be sure I have removed kmod-xfs, unmounted filesystem and removed
kernel module and them mounted filesystem again. Still seeing the very same
behaviour.

> Agreed that it would be good to know whether inode64 is in use.

No, we don't use any special mount options here.

> Let's start there (and with a modern xfs.ko) before we speculate
> further.

I guess next step would be to use inode64..
Regards,

Marcin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-10  8:51 Marcin Deranek
  2012-10-10 13:17 ` Stan Hoeppner
@ 2012-10-10 23:37 ` Dave Chinner
  2012-10-11  8:42   ` Marcin Deranek
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2012-10-10 23:37 UTC (permalink / raw)
  To: Marcin Deranek; +Cc: xfs

On Wed, Oct 10, 2012 at 10:51:42AM +0200, Marcin Deranek wrote:
> Hi,
> 
> We are running XFS filesystem on one of out machines which is a big
> store (~3TB) of different data files (mostly images). Quite recently we
> experienced some performance problems - machine wasn't able to keep up
> with updates. After some investigation it turned out that open()
> syscalls (open for writing) were taking significantly more time than
> they should eg. 15-20ms vs 100-150us.

Which is clearly an IO latency vs cache hit latency.

> Some more info about our workload as I think it's important here:
> our XFS filesystem is exclusively used as data store, so we only
> read and write our data (we mostly write). When new update comes it's
> written to a temporary file eg.
> 
> /mountpoint/some/path/.tmp/file
> 
> When file is completely stored we move it to final location eg.
> 
> /mountpoint/some/path/different/subdir/newname
> 
> That means that we create lots of files in /mountpoint/some/path/.tmp
> directory, but directory is empty as they are moved (rename() syscall)
> shortly after file creation to a different directory on the same
> filesystem.
> The workaround which I found so far is to remove that directory
> (/mountpoint/some/path/.tmp in our case) with its content and re-create
> it. After this operation open() syscall goes down to 100-150us again.
> Is this a known problem ?

By emptying the directory, you are making it smaller and likely
causing it to be cached in memory again as new files are added to
it. Over time, blocks will be removed from the cache due to memory
pressure, and latencies will be seen again.

> Information regarding our system:
> CentOS 5.8 / kernel 2.6.18-308.el5 / kmod-xfs-0.4-2

Use a more recent distro. I reworked the metadata caching algorithms
a couple of years ago to avoid these sorts of problems with memory
reclaim.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-10 13:17 ` Stan Hoeppner
@ 2012-10-10 14:31   ` Eric Sandeen
  2012-10-11  8:33     ` Marcin Deranek
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2012-10-10 14:31 UTC (permalink / raw)
  To: stan; +Cc: xfs

On 10/10/12 8:17 AM, Stan Hoeppner wrote:
> On 10/10/2012 3:51 AM, Marcin Deranek wrote:
>> Hi,
>>
>> We are running XFS filesystem on one of out machines which is a big
>> store (~3TB) of different data files (mostly images). Quite recently we
>> experienced some performance problems - machine wasn't able to keep up
>> with updates. After some investigation it turned out that open()
>> syscalls (open for writing) were taking significantly more time than
>> they should eg. 15-20ms vs 100-150us.
>> Some more info about our workload as I think it's important here:
>> our XFS filesystem is exclusively used as data store, so we only
>> read and write our data (we mostly write). When new update comes it's
>> written to a temporary file eg.
>>
>> /mountpoint/some/path/.tmp/file
>>
>> When file is completely stored we move it to final location eg.
>>
>> /mountpoint/some/path/different/subdir/newname
>>
>> That means that we create lots of files in /mountpoint/some/path/.tmp
>> directory, but directory is empty as they are moved (rename() syscall)
>> shortly after file creation to a different directory on the same
>> filesystem.
>> The workaround which I found so far is to remove that directory
>> (/mountpoint/some/path/.tmp in our case) with its content and re-create
>> it. After this operation open() syscall goes down to 100-150us again.
>> Is this a known problem ?
>> Information regarding our system:
>> CentOS 5.8 / kernel 2.6.18-308.el5 / kmod-xfs-0.4-2
>> Let me know if you need to know anything more.
> 
> Hi Marcin,
> 
> I'll begin where you ended:  kmod-xfs.  DO NOT USE THAT.  Use the kernel
> driver.  Eric Sandeen can point you to the why.  AIUI that XFS module
> hasn't been supported for many many years.

Yep.  Ditch that; it overrides the maintained module that comes with the
kernel itself.  See if that helps, first, I suppose.

I've been asking Centos for a while to find some way to deprecate that,
but it's like night of the living dead xfs modules.

(modinfo xfs will tell you for sure which xfs.ko is getting loaded I suppose).

> Regarding your problem, I can't state some of the following with
> authority, though it might read that way.  I'm making an educated guess
> based on what I do know of XFS and the behavior you're seeing.  Dave
> will clobber and correct me if I'm wrong here. ;)
> 
> XFS filesystems are divided into multiple equal sized allocation groups
> on the underlying storage device (single disk, RAID, LVM volume, etc).
> With inode32 each directory that is created has its files store in only
> one AG, with some exceptions, which you appear to bumping up against.
> If you're using inode64 the directories, along with their files, go into
> the AGs round robin.

Agreed that it would be good to know whether inode64 is in use.

Let's start there (and with a modern xfs.ko) before we speculate further.

> Educated guessing:  When you use rename(2) to move the files, the file
> contents are not being moved, only the directory entry, as with EXTx
> etc.  Thus the file data is still in the ".tmp" directory AG, but that
> AG is no longer its home.  Once this temp dir AG gets full of these
> "phantom" file contents (you can only see them with XFS tools), the AG
> spills over.  At that point XFS starts moving the phantom contents of
> the rename(2) files into the AG which owns the directory of the
> rename(2) target.  I believe this is the source of your additional
> latency.  Each time you do an open(2) call to write a new file, XFS is
> moving a file's contents (extents) to its new/correct parent AG, causing
> much additional IO, especially if these are large files.

Nope, don't think so ;) Nothing is going to be moving file contents
behind your back on a rename.

<snip>

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Performance degradation over time
  2012-10-10  8:51 Marcin Deranek
@ 2012-10-10 13:17 ` Stan Hoeppner
  2012-10-10 14:31   ` Eric Sandeen
  2012-10-10 23:37 ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Stan Hoeppner @ 2012-10-10 13:17 UTC (permalink / raw)
  To: xfs

On 10/10/2012 3:51 AM, Marcin Deranek wrote:
> Hi,
> 
> We are running XFS filesystem on one of out machines which is a big
> store (~3TB) of different data files (mostly images). Quite recently we
> experienced some performance problems - machine wasn't able to keep up
> with updates. After some investigation it turned out that open()
> syscalls (open for writing) were taking significantly more time than
> they should eg. 15-20ms vs 100-150us.
> Some more info about our workload as I think it's important here:
> our XFS filesystem is exclusively used as data store, so we only
> read and write our data (we mostly write). When new update comes it's
> written to a temporary file eg.
> 
> /mountpoint/some/path/.tmp/file
> 
> When file is completely stored we move it to final location eg.
> 
> /mountpoint/some/path/different/subdir/newname
> 
> That means that we create lots of files in /mountpoint/some/path/.tmp
> directory, but directory is empty as they are moved (rename() syscall)
> shortly after file creation to a different directory on the same
> filesystem.
> The workaround which I found so far is to remove that directory
> (/mountpoint/some/path/.tmp in our case) with its content and re-create
> it. After this operation open() syscall goes down to 100-150us again.
> Is this a known problem ?
> Information regarding our system:
> CentOS 5.8 / kernel 2.6.18-308.el5 / kmod-xfs-0.4-2
> Let me know if you need to know anything more.

Hi Marcin,

I'll begin where you ended:  kmod-xfs.  DO NOT USE THAT.  Use the kernel
driver.  Eric Sandeen can point you to the why.  AIUI that XFS module
hasn't been supported for many many years.

Regarding your problem, I can't state some of the following with
authority, though it might read that way.  I'm making an educated guess
based on what I do know of XFS and the behavior you're seeing.  Dave
will clobber and correct me if I'm wrong here. ;)

XFS filesystems are divided into multiple equal sized allocation groups
on the underlying storage device (single disk, RAID, LVM volume, etc).
With inode32 each directory that is created has its files store in only
one AG, with some exceptions, which you appear to bumping up against.
If you're using inode64 the directories, along with their files, go into
the AGs round robin.

Educated guessing:  When you use rename(2) to move the files, the file
contents are not being moved, only the directory entry, as with EXTx
etc.  Thus the file data is still in the ".tmp" directory AG, but that
AG is no longer its home.  Once this temp dir AG gets full of these
"phantom" file contents (you can only see them with XFS tools), the AG
spills over.  At that point XFS starts moving the phantom contents of
the rename(2) files into the AG which owns the directory of the
rename(2) target.  I believe this is the source of your additional
latency.  Each time you do an open(2) call to write a new file, XFS is
moving a file's contents (extents) to its new/correct parent AG, causing
much additional IO, especially if these are large files.

As you are witnessing, if XFS did the move to the new AG in real time,
the performance of rename(2) would be horrible on the front end.  I'd
guess the developers never imagined that a user would fill an entire AG
using rename(2) calls.  Your deleting and recreating of the .tmp
directory which fixes the performance seems to be evidence of this.
Each time you delete/create that directory it is put into a different AG
in the filesystem, in a round robin fashion.  If you do this enough
times, you should eventually create the directory in the original AG
that's full of the rename(2) file extents, and performance will suffer
again.

One of the devs probably has some tricks/tools up his sleeve to force
those extents to their new parent AG.  You might be able to run a
nightly script to do this housekeeping.  Or you could always put the
.tmp directory on a different filesystem on a scratch disk.

This problem could also be a free space fragmentation issue, but given
that recreating the .tmp directory fixes it, I doubt free space frag is
the problem.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Performance degradation over time
@ 2012-10-10  8:51 Marcin Deranek
  2012-10-10 13:17 ` Stan Hoeppner
  2012-10-10 23:37 ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Marcin Deranek @ 2012-10-10  8:51 UTC (permalink / raw)
  To: xfs

Hi,

We are running XFS filesystem on one of out machines which is a big
store (~3TB) of different data files (mostly images). Quite recently we
experienced some performance problems - machine wasn't able to keep up
with updates. After some investigation it turned out that open()
syscalls (open for writing) were taking significantly more time than
they should eg. 15-20ms vs 100-150us.
Some more info about our workload as I think it's important here:
our XFS filesystem is exclusively used as data store, so we only
read and write our data (we mostly write). When new update comes it's
written to a temporary file eg.

/mountpoint/some/path/.tmp/file

When file is completely stored we move it to final location eg.

/mountpoint/some/path/different/subdir/newname

That means that we create lots of files in /mountpoint/some/path/.tmp
directory, but directory is empty as they are moved (rename() syscall)
shortly after file creation to a different directory on the same
filesystem.
The workaround which I found so far is to remove that directory
(/mountpoint/some/path/.tmp in our case) with its content and re-create
it. After this operation open() syscall goes down to 100-150us again.
Is this a known problem ?
Information regarding our system:
CentOS 5.8 / kernel 2.6.18-308.el5 / kmod-xfs-0.4-2
Let me know if you need to know anything more.
Cheers,

Marcin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-03-17 11:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-08 14:09 Performance degradation over time Matthew Keeler
2016-03-08 16:32 ` Michal Kazior
2016-03-08 16:38   ` Matthew Keeler
2016-03-09  2:01     ` Matthew Keeler
2016-03-09  6:00       ` Michal Kazior
2016-03-09 12:46         ` Matthew Keeler
2016-03-17 11:21           ` Michal Kazior
  -- strict thread matches above, loose matches on Subject: below --
2012-10-10  8:51 Marcin Deranek
2012-10-10 13:17 ` Stan Hoeppner
2012-10-10 14:31   ` Eric Sandeen
2012-10-11  8:33     ` Marcin Deranek
2012-10-11  9:15       ` Marcin Deranek
2012-10-14 19:31         ` Peter Grandi
2012-10-10 23:37 ` Dave Chinner
2012-10-11  8:42   ` Marcin Deranek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.