Performance Tuning with Pogoplug v4 on Arch Linux ARM

August 3, 2014August 4, 2015 linuxengineering ARMarch linux arm, home server, linux, NAS, performance, pogoplug

Updated: August 4, 2015
Please stop by every now and then, even if you already applied some changes, as I will continue to update this page.

Specs:

Architecture: ARMv5te
Processor: Marvell Kirkwood 800MHz 88F6192
RAM: 128MiB DDR2 (@400MHz)
Storage: 128MiB NAND Flash
Network: Gigabit Ethernet
Ports: (1) SATA II (Not Mobile), (2) USB 3.0 (Not Mobile), (1) USB 2.0, (1) SD Card
Other Features: Cryptographic Engine, XOR Engine

Introduction

I have always enjoyed tinkering with technology in my spare time. So recently I had decided to purchase three Pogoplug v4 devices because they looked like a good project. One thing that attacted me to these devices is that they offer gigabit ethernet, reasonable specs (even USB 3.0 on one), were fully supported by the Arch Linux community, and cost only $6 for the mobile Pogoplug v4 and $18 for the better version with SATA II and USB 3.0 on Amazon. My recommendation is to go with the better version for a home server and the cheaper “mobile” version for a backup device, where storage speed is less of a concern.

As a lead systems engineer, specializing in performance tuning and deep-dive problem analysis, I thought that it would be fun to address some of the common issues that I have read with these devices. Generally they seem to revolve around network and disk i/o, and file serving performance. When you consider that these devices have limited memory and relatively weak CPUs, it requires some skill to tune them for optimal performance with such limited resources.

Installation of Arch Linux ARM on this device is easy and I won’t go into it here. It is documented well and straight simple. One thing to note is that because this is a fully supported device for Arch Linux ARM, you can update the kernel without any fuss, which makes this a great low-powered Linux platform.

As mentioned earlier, many posts exist on the web about how these devices don’t perform well. Often times though, people posting these comments lack the technical knowledge to understand why. Additionally, few people out there know how to properly tune, and therefore there are a lot of tuning recommendations on the Internet that are just wrong. These devices offer very well balanced performance for a fantastic value.

The first rule of performance tuning is to test. While there are some tunings that simply cannot be easily tested, most of the tuning changes I have made are a result of extensive testing. Some changes are not easily understood or explained, but I will make efforts to explain high-level what is being tuned and why.

So… without further ado, let’s dive into this.

Issue #1: Network Performance

First off, most reviewers and comments seem to indicate poor network performance. Default tunings and having certain NIC features disabled has a lot to do with this poor performance. Likewise, some people mistakenly point to poor network performance when other non-network bottlenecks are responsible.

By default, Arch Linux ARM had an incorrectly tuned network stack. We will correct this and optimize it for Gigabit Ethernet on a local area network.

DEFAULT TUNINGS

net.core.rmem_default = 163840
net.core.rmem_max = 163840
net.core.wmem_default = 163840
net.core.wmem_max = 163840

net.ipv4.tcp_rmem = 4096 87380 969536
net.ipv4.tcp_wmem = 4096 16384 969536

One may ask… what’s wrong with this? Well, this requires an understanding of how Linux handles socket memory. In a nutshell, the tcp memory maximums can be no larger than core. So the first thing we need to do is fix this. Likewise, we should recognize that we’re tuning for optimal Gigabit Ethernet performance but have limited memory available to us. The only thing I am going to adjust is the maximum values here, primarily because of limited memory.

In my case, I am going to be using this (non-mobile) Pogoplug v4 for a local file server, to provide both CIFS and NFS. Therefore, I am not planning to have lots of connections (sockets) as one might if it were a web server open to the outside. This is just an example of a consideration when tuning.

Add the following (under NEW TUNINGS) to /etc/sysctl.d/sysctl.conf:

NEW TUNINGS

net.core.rmem_max = 2801664
net.core.wmem_max = 2097152

net.ipv4.tcp_rmem = 4096 87380 2801664
net.ipv4.tcp_wmem = 4096 16384 2097152

Above we have re-tuned the maximum memory values for both tcp and core to 2MiB. tcp_rmem has 25% of memory dedicated to the application buffer by default, and so we have (2097152 / 0.75). This is rounded up to 2801664, because memory pages are 4096 bytes.

Additionally, we should make some other changes. The following may be added (optmem is not necessary, but you may) to /etc/sysctl.d/sysctl.conf:

NEW TUNINGS

net.ipv4.tcp_timestamps = 0

net.core.optmem_max = 65535

net.core.netdev_max_backlog = 5000

I’m not going to go into much detail here, as you can read documentation if you wish. Timestamps is one that may buy us a little extra CPU overhead by turning off, and CPU is a precious commodity on this device. On new, high-performance servers, this should be left on as you will see no benefit to turning it off.

By default, I see that generic segmentation offload is turned off. GSO will delay a lot of CPU-hogging packet segmentation and processing until later in the network stack, which I have tested to have a significant performance improvement on this device. Generic segmentation offload is an LSO (large segmentation offload) feature that passes a very large packet (super packet) through the network stack and then breaks the packet into multiple pieces to be sent across the network, thereby saving CPU resources due to the overhead associated with more smaller packets. Fewer, larger packets have less CPU overhead than more, smaller ones. GSO works even in the absence of any LSO features in the network driver.

NOTE: In recent kernel releases, as noted by one reader, the NIC driver was modified to enable software TSO. This should be turned off, because it will cause data corruption. GSO works perfectly.
UPDATED NOTE (8/4/2015): Software TSO data corruption bug may be fixed, but I have not tested this on the latest kernels yet. See updates to the driver in recent kernels here: mv643xx_eth. It is also not clear that this offers any improvements over GSO, so leaving it off is fine.

DEFAULT TUNINGS

generic-segmentation-offload: off
tcp-segmentation-offload: on

Output from ifconfig also shows we could increase the transmit queue.

DEFAULT TUNINGS

txqueuelen 1000

Now, create a file called /etc/udev/rules.d/50-eth.rules and add:

NEW TUNINGS

ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ethtool -K %k gso on”
ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ethtool -K %k tso off”
ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ifconfig eth0 txqueuelen 5000″

If ethtool is not installed, install it:

pacman -Sy ethtool

These entries will tell udev to modify the Ethernet device to turn generic segmentation offload and set the transmit queue length to 5000 when the device is created.

Other thoughts on improving performance may be to enable jumbo frames. Jumbo frames (packets >1500 MTU) have other considerations and may also not be supported by all networking equipment, so you should probably not do this unless you know what you are doing. I do not have equipment at home to test this and what the advantages are of using Jumbo frames on throughput or CPU utilization of the device. It should be noted that using jumbo frames may disable checksum offloading, thereby increasing CPU utilization.

Currently I have filed a bug with kernel.org on another feature that I would like to tune but cannot, due to a bug in v1.4 of the mv643xx_eth driver used by the Ethernet interface. At the expense of a small amount of extra memory, I would like to be able to tune the network ring buffer slots to a higher value. Ring buffers are where the data sits before entering into the socket on the way in, and where the data sits after being sent from the transmit socket on the way out.

DEFAULT TUNINGS

Ring parameters for eth0:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             128
RX Mini:        0
RX Jumbo:       0
TX:             512

NOTE: In recent kernels, TX descriptors were increased to 512 by default from 256. Oddly, this is the exactly what I was trying to set them to in the bug report.

When I attempt to modify these, which I should be able to, the system becomes increasingly unresponsive until it drops network connections. As a consequence, you should not attempt to change it. Interestingly, I do see it allocate the additional memory for the ring buffers, but the driver is clearly doing something bad when dynamically allocating them.

The problem with some of these tunings is that it can be difficult to measure on systems like this, which have limited CPU and memory resources. For example, using tools like iperf to measure network throughput doesn’t work very well, because these tools end up bottle-necking due to CPU speed and memory bandwidth constraints related to the type of operations and calls they make. It is possible that better compiler optimizations may help with this, but ultimately it may come down to how optimized the code is for resource limited systems. Modern, high-powered systems just don’t need to worry as much about saving a call here or there, but it does make a difference on this device.

Issue #2: Storage Performance

There are significant improvements that can be realized by properly optimizing the storage i/o performance. We need to optimize performance while minimizing impact on CPU and memory in order to achieve the best result.

Let’s first look at the types of storage devices/media we will be using. With the Pogoplug mobile, you are limited to a single USB 2.0 interface, even though a powered USB 2.0 hub could be used to extend this. The higher-end Pogoplug v4, with USB 3.0 and SATA II, offer faster options that are worthwhile when using as a home server.

On the higher-end model, which is what I am focusing on for this segment on storage performance, you can only boot from USB 2.0 and SATA due to limitations of the boot loader. Therefore, I am going to use a USB 2.0/3.0 flash drive plugged into the USB 2.0 port, but only for the operating system itself. USB 3.0 ports will be used for a dual USB 3.0 (Icy Dock) external enclosure, which is where I will store my data.

(Note: There is an alternate boot loader option that does allow booting to USB 3.0, but it isn’t supported by Arch.)

My recommendation is to use a flash drive larger than 2GiB, as it will be challenging to install any additional packages or do updates without the extra space. While performance is not of big importance with the OS drive, you may want to consider a more modern flash drive with reasonable read/write performance, which will help with package updates and when executing programs. If write performance is of no concern, I would highly recommend going with a Sandisk Cruiser Fit for the higher-end Pogoplug v4, as they are tiny and allow you to place the cover back on the top.

With any flash media, it is important to optimize/align the layout of partitions and file systems for optimal performance. Much of this has to do with the erase block and sector sizes. You can find information out there on how to optimize the geometry and file system of the flash drive for performance, so I’m not going to get into this here. You should consider doing this for your operating system drive when installing Arch Linux Arm.

It is also important to optimize the layout of partitions and file systems on spinning disks. Again, there is plenty of information available on the Internet about doing this. There are additional considerations if you have an Advanced Format drive, which typically has 4kiB physical sectors but may report them as 512 bytes or 4kiB. RAID stripes also require additional tuning, which is discussed further in this article.

In order to simply things, we’re going to use parted to create a new partition on our external HDD that will automatically optimize the partition alignment for us. Note that you will destroy any data on your external disk when you do this, so please know what you’re doing. You will see that I have chosen XFS as my file system (discussed later), which is why I am labeling the partition as XFS.

parted -a opt /dev/sdX # Where X is the device letter of your external disk

# You are now at the (parted) command line

mklabel gpt

mkpart xfs primary 0% 100%

quit

Once you exit, the a new GPT partition table will be written to the disk. We will then need to format the partition with a new file system. XFS was chosen after extensive testing with dd, bonnie++, and iozone.

mkfs.xfs /dev/sdX # Where X is the device letter of your external disk

XFS was chosen as the best file system for this device because of it’s low CPU utilization and impressive performance in a variety of circumstances. Ext3/4 can offer reasonable performance and are certainly reliable, but my focus is on squeezing every last bit of i/o out of the disk as possible.

Note that if you use XFS, you should ensure the file system is unmounted cleanly on reboot, as it is possible the XFS repair utility may require more memory than is available on the Pogoplug to repair it. In such a case, you may need to unplug the drive, then repair the XFS file system on another computer. There have been a couple times where the system wouldn’t boot due to an XFS file system requiring repair, so just unplugging the drive will allow it to boot normally until you can repair it.

Benchmark comparison tests were run against Ext3, Ext4, JFS and XFS, using the above mentioned tools. Running any of them without an accurate understanding of factors involved in the design of a test may cause inaccurate results. Additionally, understanding the results require some experience and knowledge that is beyond the scope of this article. You may prefer another file system or find another works better for your particular purpose. With each file system, there are benefits and drawbacks, but I happened to think that the benefits of XFS, experienced in testing, outweighed any drawbacks.

Modify your /etc/fstab as appropriate, using the following as a guide:

UUID=”xxxxxxxx” /u01/external/ xfs defaults,nofail,noatime,nodev 0 1

Some other optimizations, primarily for spinning disks, should be made. These include the block i/o scheduler, fs read-ahead, and acoustic performance setting. You may want to consider adjusting the power settings for the drive, which may offer some improvements.

bfq was the default block i/o scheduler, but I chose deadline, which offers significant performance improvements over cfq and bfq. deadline is also suitable SSD and flash devices, although you may consider noop for those, as it acts as a simple FIFO with no reordering. fair queuing schedulers are typically good for systems (desktops or general purpose servers) with many processes that may be competing for i/o, but server applications may find deadline or noop (for SSD/flash) outperform them and are lighter on resources. noop should not be used on spinning disks as a general rule.

Create a new udev rule as /etc/udev/rules.d/60-sd.rules:

# set deadline scheduler
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/scheduler}=”deadline”
# set read ahead sectors
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1″, RUN+=”/usr/bin/hdparm -a 1024 /dev/%k”
# set maximum performance
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1″, RUN+=”/usr/bin/hdparm -M 254 /dev/%k”

If hdparm is not installed, install it:

pacman -Sy hdparm

Sometimes flash drives report as rotational devices, so you may need to put additional logic in the udev rule to prevent hdparm from being used to change settings on them. After these changes, you should reboot your device and ensure the changes took effect.

So… what kind of performance am I now getting versus before? Darn near triple, in some cases. My 7200RPM 2TiB Toshiba drive, using DIRECT I/O, is able to read at nearly 100MiB/s and write at nearly 85MiB/s! My older 7200RPM 500GiB Samsung drive got about 85MiB/s read and 65MiB/s write.

You may ask: Why Direct I/O? Well, this is because operations that use buffered i/o are VERY expensive on this device, and will result in less throughput and lower performance. Meaning, bottleneck. Doing buffered asynchronous reads/writes resulted in less performance. Similar to why iperf had poor performance when driving network throughput, the expense of these operations slowed block i/o due to CPU speed and memory bandwidth limitations. Direct I/O is not the same as synchronous, however. In a nutshell, Direct I/O uses DMA to read and write directly from storage, bypassing the operating system caches. Keep this all in mind for later…

Below is a simple test you can perform yourself. Delete the test file after you are finished. It writes a 1024MiB file, and then reads it. The large file size will also help eliminate the potential impact of file system caching on this system, since it only has 128MiB of RAM.

Direct I/O

dd if=/dev/zero of=./bigfile bs=1M count=1024 oflag=direct # Write to “bigfile” in this directory
dd of=/dev/null if=./bigfile bs=1M count=1024 iflag=direct # Read from “bigfile” in this directory

Buffered Async I/O

dd if=/dev/zero of=./bigfile bs=1M count=1024 # Write to “bigfile” in this directory
dd of=/dev/null if=./bigfile bs=1M count=1024 # Read from “bigfile” in this directory

It is not really possible to bypass the file system cache for everything though, so real world performance is going to depend a lot on whether you have control over how reads and writes are performed against the file system by a program. Some file systems offer a sync option, but this should be avoided as it won’t likely provide you good throughput.

The Marvell Kirkwood SoC also has hardware XOR engine to accelerate RAID 5. I have not done anything with this, because I don’t have any need for this. It also may accelerate iSCSI CRC32 and possibly some memory operations. In the case of the memory operations, I am going to research this more later.

Many people want to consider using some form of RAID (Redundant Array of Inexpensive Disks) to protect their data and/or offer some performance improvement. Several considerations are necessary in order to determine how this should be best configured. One of the most important considerations is trying to make sure your Pogoplug v4 isn’t bottlenecking in terms of bandwidth (USB, for example) or CPU.

For example, if you use software RAID 1 (mirror), write disk operations will double on the USB bus, as each write will need to be written independently to each disk. Software RAID 5, which is a striped, with distributed parity across all disks (3 or more), requires CPU-intensive calculations when writing data, even with hardware XOR offload. RAID 0, which is a non-redundant, simple data stripe across all disks (2 or more), may not perform any better than a single disk configuration if the disk performance isn’t the limiting factor.

If you are looking for optimal performance with a simple mirror, my recommendation would be to purchase an external enclosure (if using USB) that does hardware mirroring (RAID 1). This is because hardware mirroring will eliminate the need create two or more i/o operations for each single write, which will have a positive impact whether using USB 2.0 or 3.0. If the enclosure controller is intelligent, it should be able to optimize read operations across both disks to help improve read performance.

In the case of using SATA, such as with the Seagate GoFlex, you’d probably be better off sticking with software RAID 0 or 1, unless your device had USB 3.0 as an option. Since this article focuses on the Pogoplug v4, that is what I will spend most time on.

RAID stripes (regardless of RAID 5 or other striped configurations) require special considerations with regard to chunk size and file system. Large chunk sizes benefit large writes but hurt small writes. On the other hand, decreasing chunk size can cause files to be broken into more pieces and spread across more spindles, potentially increasing transfer throughput but at the expense of positioning performance. It is hard to know on the Pogoplug whether the increase in operations related to smaller chunk size would benefit or hurt. That would require testing.

You can figure out the ideal chunk size by looking at the typical i/o size. If you have an existing file system, you run ‘iostat -x’ to get your average request size (may be in sectors — check first) since the system was booted. Then divide that typical i/o size (in kiB) by the number of disks in the RAID stripe, minus parity disks. This would be 3, if you have a 4 disk RAID 5 stripe. So if you had a typical i/o size of 96kiB, you’d want to divide 96kB by 3, which is 32kiB. Always round up to the nearest 4kB (or page size). This is what you need to properly create the RAID 5 array.

In order to optimally lay out the file system on your RAID stripe, this will depend somewhat on the file system and creation options. Normally you figure stride, which is the chunk size (32kiB) divided by the file system block size (usually 4kiB). In this case, it makes a stride of 8. This would be used to create your file system on the newly create RAID group.

Storage is only useful when you have something to use it for. This will be the focus of our next section.

Issue #3: SAMBA Performance

This is probably what most people will be using it for, regardless of whether it is for backups or central storage of files on your network. Many of the SAMBA tuning guides will result in poor performance on the Pogoplug v4. This is because they are not optimized for devices with limited resources, such as is the case with buffered and async i/o. On many more powerful systems, these could be a benefit to enhance performance, but they may have the opposite effect here.

Some of the tunings below may be defaults, but I have placed them there so that you know if they conflict with something you placed in the SAMBA configuration file.

Modify your /etc/samba/smb.conf file to include the following:

strict allocate = Yes
read raw = Yes
write raw = Yes
strict locking = No
socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=131072 SO_SNDBUF=131072
min receivefile size = 0
use sendfile = true
aio read size = 0
aio write size = 0
oplocks = yes
max xmit = 65535
max connections = 8
deadtime = 15

You may feel free to experiment with these settings. Generally speaking, when you ask the device to use CPU and memory resources for better throughput, such as is the case with async i/o, it will result in worse performance, not better. The network socket buffers of 128kiB may be lowered if you want. Could save a small amount of memory without any negative performance impact, but you will need to test. If you experience disconnects, check if removing deadtime or upping the max connections helps.

End result? 26MiB/s write performance and 42MiB/s read performance, over gigabit LAN, with a Windows 7 client, using the USB 3.0 drive as storage. At that point, it is pretty clear that the CPU and memory are a bottleneck, and why the Pogoplug v2, with it’s 1200MHz CPU, could probably achieve faster throughput if it had USB 3.0… but it doesn’t, and so the performance on the Pogoplug v4, with its’ 800MHz CPU, is probably better. There is also a device called a Seagate GoFlex that seems to have both SATA and a 1200MHz processor, and so you will probably have proportionally better throughput on that (>60MiB/s).

Looking at those numbers, that is between 200 and 350 megabits per second of throughput with SAMBA. Keep in mind there is plenty of protocol overhead, so actual network throughput is more than this. Pretty impressive, given I spent $18 on the device, minus the storage itself. In the old days, this amount of throughput would have blown the doors off of some pretty expensive enterprise-class file servers. But keep in mind, the reason for NAS (Network Attached Storage) in the home is either to have central storage or to provide backups. So even if these numbers don’t impress you, they should be more than adequate for your home network. Additionally, you’ll be unlikely to saturate SAMBA on this device using a wireless client, as those often have real world performance maxing out at about 12MiB/s under good conditions.

Issue #4: (En/De)cryption Throughput

Encryption is expensive for the CPU, and if it weren’t for hardware encryption acceleration on the Marvell Kirkwood SoC, it would be something we would want to avoid. Fortunately, the folks at Arch Linux Arm have created a package that will enable this feature on the Pogoplug, and allow things that use OpenSSL to benefit from hardware acceleration.

According to the instruction here, we can follow a few simple steps to enable it. This doesn’t automatically imply that everything that uses supported hardware encryption algorithms will see any benefit, as it depends on whether they directly use OpenSSL for this or if they can be compiled or configured to enable it directly. In any case, I’ll provide the steps below, but use the power of Google and look for “cryptodev” in reference to other things, such as SSH, OpenVPN, etc.

NOTE: One reader commented that their system did not boot after installing openssl-cryptodev. What likely happened is that the copy/paste from the below inserted unicode characters into the udev rule, causing udev to barf. To ensure that doesn’t happen, please type that line in manually.

# Make sure your package repos are in sync and the system up-to-date.
pacman -Syyu

# Replace the OpenSSL package with the cryptodev-enabled one
pacman -S openssl-cryptodev

# Create udev rule to open up permissions to the cryptodev device
echo ‘KERNEL==”crypto”, MODE=”0666″‘ > /etc/udev/rules.d/99-cryptodev.rules

# Load the cryptodev kernel module now
modprobe cryptodev

# Load the cryptodev kernel module on boot
echo “cryptodev” > /etc/modules-load.d/cryptodev.conf

Using the linked document at the beginning of this article, on the Kirkwood architecture, you can experiment with the supported hardware encryption algorithms using openssl command with the speed argument and -evp flag, specifying the cipher, to see it in action. It will accelerate AES, DES, 3DES, SHA1, and MD5.

Hardware encryption acceleration will help a lot if you can use it, such as with SSL web serving and OpenSSH. Anything to offload tasks from the CPU or make them more efficient should be considered. If you’re interested in disk encryption, you may use the hardware acceleration feature for it. There is currently a problem with mv_cesa in that it doesn’t use DMA and therefore is only marginally faster than software only, but hopefully this will be fixed by Marvell sometime. Not holding my breath there, though.

Once the openssl-cryptodev package is installed and the cryptodev module loaded with correct permissions on the device, OpenSSH will automatically begin using this when it is restarted. Just make sure your Ciphers and MACs entries are configured to use a supported algorithm in sshd_config. You may use lsof to check if sshd has the cryptodev device open.

Other Optimizations

It may be worthwhile in creating a small swap partition on either your fast operating system flash drive or on the external USB 3.0 drive. While I have found that it is quite possible to do a lot within the limitations of 128MiB of RAM, the OS will undoubtedly have some inactive memory pages that might be better swapped to disk than taking up valuable RAM.

If you do decide to add some swap, consider making the following tuning changes. One will reduce the tendency to swap unless under considerable memory pressure, while the other will more aggressively prune the page cache to try to keep more free memory available.

Add to /etc/sysctl.d/sysctl.conf:

vm.swappiness = 10 # Default is 60
vm.vfs_cache_pressure = 1000 # Default is 100

Additionally, you can conserve memory on the system by disabling ntpd from running as a daemon. Unless you have a lot running on the system, you are probably fine enabling it or leaving it enabled, but it is 4MiB of RAM on a system with 128MiB.

Overall Architecture

The Marvell Kirkwood is really an excellent SoC for NAS applications. This is part of why it is used in so many NAS appliances in the home consumer market. In my opinion, this makes a more ideal home server than a Raspberry Pi. RPs are also more expensive and do not offer the ideal set of features seen on the Pogoplug v4.

If you look at the architecture document, linked at the beginning of this article, you will notice that there is no USB 3.0. When designing the Pogoplug v4, they utilized the PCIe interface for a USB 3.0 chipset. The PCIe interface sits on the same high-speed bus as the USB 2.0, Gigabit Ethernet, and SATA II port. Interestingly enough, the internal NAND flash, which we don’t use, sits on a low speed bus, as does the flash card reader. Because of the SD card reader being on the slow bus, you should probably just avoid using it altogether.

Another interesting thing is that it has 400MHz DDR2 on the Pogoplug v4 board due to the SoC limitation. If it had faster memory, it would likely have helped somewhat with some of the memory operations that seemed to hinder performance.

While the Pogoplug v4 has only 128MiB of RAM, I have not found this to be a problem. I am able to run OpenSSH with multiple sessions, vsftpd, SAMBA with multiple sessions, and Deluge and Deluge-Web with 200 active connections and multiple torrents, all without running out of memory. To be sure, it does run slim on memory, but it doesn’t run out.

Final Thoughts

The Pogoplug v4 makes a fantastic home server, especially considering the low cost and official Arch Linux ARM support and great community. Even though I had heard of a Pogoplug before, I never looked into getting one until I saw how cheap they were and that they had official support for Arch Linux ARM. While I do enjoy cross-compiling and hacking together embedded systems from scratch, it is a painfully slow and laborious effort, which is why this is so great.

If there is positive feedback to this article, I will consider posting more about the Pogoplug v4 in the future. It certainly is a fun project and I’m sure you’ll enjoy it if you get one.

47 thoughts on “Performance Tuning with Pogoplug v4 on Arch Linux ARM”

Guido says:

August 12, 2014 at 12:16 pm

Great stuff!
Thanks a lot for that 🙂
I have a question: your suggested tweaks have improved a lot the performances of my PP4 when I use the disk which is connected via USB3. However I also have a RAID5 connected to the SATA port (4x2TB) and the effect of this seems to be much smaller, especiallz when writing. Any suggestion for that?
And, as I use RAID5, I’ll be very interested in your suggestions about the XOR engine.

Thanks again
Guido

LikeLike

Reply
- linuxengineering says:
  
  August 12, 2014 at 1:20 pm
  
  Hi, Guido —
  
  Some thoughs…
  
  First though, I could potentially set up a “fake” RAID 5 and look into whether any additional work is necessary to use the XOR offload (or if it already does), but I’m guessing your issue may just be that it is RAID 5. Even with the XOR offload, my guess is that performance may lag somewhat when compared with other RAID layouts. RAID 5 is typically used for getting the maximum amount of usable space in a fault-tolerant configuration rather than for performance. However, typically RAID 5 offers reasonable read performance and usually only suffers greatly in write performance.
  
  Normally I might recommend using RAID 10 on a server to avoid the overhead associated with parity calculations with RAID5, even with XOR offload, but in this case, it would likely cause more I/O in a striped mirror configuration. That might be a drawback here, but I’m not sure without testing.
  
  RAID stripes (regardless of RAID 5 or other striped configurations) require special considerations with regard to chunk size and file system. Large chunk sizes benefit large writes but hurt small writes. On the other hand, decreasing chunk size can cause files to be broken into more pieces and spread across more spindles, potentially increasing transfer throughput but at the expense of positioning performance. Hard to know on the Pogoplug whether the increase in operations related to smaller chunk size would benefit or hurt. That would require testing.
  
  As a general rule, you can figure out the ideal chunk size by looking at the typical i/o size. If you have an existing file system, you run ‘iostat -x’ to get your average request size (may be in sectors — check first) since the system was booted. Then divide that typical i/o size (in kiB) by the number of disks in the RAID stripe, minus parity disks. In your case, this would be 3, if you have a 4 disk RAID 5 stripe. So if you had a typical i/o size of 96kiB, you’d want to divide 96kB by 3, which is 32kiB. Always round up to the nearest 4kB (or page size). This is what you need to properly create the RAID 5 array.
  
  In order to optimally lay out the file system on your RAID stripe, this will depend somewhat on the file system and creation options. Normally you figure stride, which is the chunk size (32kiB) divided by the file system block size (usually 4kiB). In this case, it makes a stride of 8. This would be used to create your file system on the newly create RAID group.
  
  Hope this helps!
  
  LikeLike
  
  Reply
  - Guido says:
    
    August 12, 2014 at 2:07 pm
    
    Thanks again for all the details, I am learning a lot!
    I did already set up the RAID5 according to your reccomendatio on chunk size. And, by the way, I am generally happy what I get as speed.
    The reason for my question was that with your tweaks I get an even better reading speed (almost double) but writing doesn’t change much. Which indeed is due, as you explained well, to the RAID5 limitations. I was hoping that playing with XOR could give something.
    But, as I said, I am anyway more than happy with the cost/performance reults: with 20$ for a PP4 and 80$ for a 4-bay enclosure I have a perfect home NAS!
    
    Thanks a lot once more for your work!
    
    LikeLike
  - linuxengineering says:
    
    August 12, 2014 at 2:50 pm
    
    Hi, Guido —
    
    If you do a “dmesg | grep -i xor”, you should see the xor driver loaded on boot. If so, I suspect you are already using XOR offload or write performance might be worse than it is. That said, in my experience, even fast hardware RAID controllers don’t totally compensate for the performance penalty associated with RAID5.
    
    If it weren’t a fault-tolerant array, I might also suggest trying to offload your file system journal onto a fast, alternate USB 3.0 drive. This would help eliminate the performance penalty from having a journal on the same disk(s). The only caveat here is that you are on a fault-tolerant array and putting a file system journal on a fast USB 3.0 drive would result in eliminating the file system integrity should the flash drive fail. But your committed changes would technically still be safe, even if that happened. This would be try at your own risk type stuff and it is hard to know if the CPU limitations of the Pogoplug v4 would come into play with that.
    
    Edit: I had another thought. I assume you’re mounting a RAID device, which means you should probably check the i/o scheduler for that to see that you’re using deadline. The udev rule I posted in the article was going to adjust the scheduler for the individual disks (sd[a-z]), but you may still need to do this for the RAID block device itself. Using another scheduler could certainly affect the read/write performance you’re getting, depending on what the bottlenecks are.
    
    Good luck!
    
    LikeLike
  - Guido says:
    
    August 13, 2014 at 2:32 pm
    
    I am not brave enough to play with the file system journal, but thanks for the tip 😉
    I have also tried to add
    ACTION==”add|change”, KERNEL==”md[0-9]“, ATTR{queue/scheduler}=”deadline”
    to /etc/udev/rules.d/60-sd.rules but it doesn’t change performances (measured, as you suggest, with dd).
    I have being doing some more accurate measurements and I get 28-30 MB/s writing and 85-90 MB/s reading. If I compare that with the performance I get from a comparable disk connected via USB3 (40.42 MB/s writing and 65-70 MB/s reading) I think they make sense, because, using a RAID5, I am penalised a bit in writing, due to the RAID overhead, but I benefit from parallel reading. And, as I read much more ofter than I write (I use it to share documents across all my home devices) I think the deal is fair! 🙂
    
    Thanks again, I improved significantly the performance of the system and while doing it I have learned a lot!
    
    LikeLike
  - linuxengineering says:
    
    August 14, 2014 at 11:25 am
    
    Hi, Guido — You’re welcome. I’m glad you found the article useful to you! My guess on your RAID configuration is what you already figured, that RAID 5 has a fair amount of overhead which is probably responsible for the write performance being low. Still, it isn’t as bad as I thought you were going to say, so I am guessing the XOR offload is working.
    
    LikeLike
Geoff says:

August 22, 2014 at 1:13 am

Many thanks for this!

LikeLike

Reply
- linuxengineering says:
  
  August 22, 2014 at 10:39 am
  
  Glad you liked it and hope it was helpful.
  
  Michael
  
  LikeLike
  
  Reply
kreegaa says:

August 22, 2014 at 8:25 pm

would you be able to provide some insight into nfs tuning? its really hard to find relevant info for nfs tuning specific to this device.. i have the very similar goflex device running a few services and working well. thanks again for the info you already presented here!

LikeLike

Reply
- linuxengineering says:
  
  August 23, 2014 at 10:17 pm
  
  Yes, I can. I actually do a lot of NFS performance tuning. Are you planning to use this as an NFS server or client? I will try to write something up in the next week or two on this subject.
  
  LikeLike
  
  Reply
  - kreegaa says:
    
    August 27, 2014 at 11:24 am
    
    my goflex home is my nfs server.. a total of 4-5 clients, never more than 2-3 at the same time connects to it.. it also runs nzbget, couchpotato and sickbeard downloading onto a sata connected harddrive.. which, thanks to you, is giveing me ~100+ MBps read and ~65 MBps write..thanks for all your inputs.
    
    LikeLike
Chad says:

August 25, 2014 at 10:13 pm

Excellent post! I was wondering how you got XFS to work. When I try to mount an XFS partition it says that it is unknown. It appears that XFS support wasn’t compiled. Did you recompile the kernel or how did you get XFS to work?

LikeLike

Reply
- linuxengineering says:
  
  August 26, 2014 at 12:05 pm
  
  Couple thoughts. Yes, it should be part of the latest kernel and recent installations of ALARM should have the XFS tools installed.
  
  Chances are you may be specifying some unsupported mount options or something. Also make sure to specify XFS if you are mounting it manually without an fstab entry and that it is labeled as an XFS partition with parted.
  
  LikeLike
  
  Reply
  - Chad says:
    
    September 15, 2014 at 4:08 pm
    
    Thanks! I tried the same steps over again and it worked like a charm. Looking forward to more of your guides!
    
    LikeLike
ebbes says:

August 26, 2014 at 7:13 am

Hi,

thanks for your guide. I’m using a similar GoFlex Net (2xSata, 1200 MHz) as NAS, but I also own a Pogoplug Mobile (with added Sata port) and v4. Actually you could add a second Sata port to both models if you wanted to. However, mine boot from SD card with a custom bootloader build.

I’d be very interested in a NFS server performance tuning guide (as my networking knowledge is very limited, I don’t think I could find the best options myself…).

By the way: Are you on kernel 3.16.x yet? I’m getting problems with samba when use sendfile is enabled starting with kernel 3.16: reproducible artifacts when streaming video (using VLC on Windows 7, XBMC on Raspberry Pi or VLC via mount.cifs-mounted directory) and CRC errors when mounting an iso image via Windows 7 network share. Disabling sendfile does the trick on 3.16; on 3.15 everything worked well with sendfile enabled.
A git bisect session revealed “first bad commit: [3ae8f4e0b98b640aadf410c21185ccb6b5b02351] net: mv643xx_eth: Implement software TSO”, and indeed “ethtool -K eth0 tso off” does the trick on kernel 3.16: No more problems occur.
Did you experience similar problems? The only drawback I see is losing a performance boost that didn’t even exist on 3.15.

LikeLike

Reply
- linuxengineering says:
  
  August 26, 2014 at 12:00 pm
  
  Edit: I reread your comment and found the commit you were speaking of. It indeed looks like someone enabled software TSO in the driver, which is quite surprising to me. I am going to do some testing on this, as it was previously not an option. GSO may still be a better solution and not a hack. I want to take a look at the changes and see what they did. Thanks for the heads up.
  
  As far as NFS, I will be posting an article on that in the next week or two. I would recommend NFS for Linux-like systems anyway, since it is a native protocol for Linux.
  
  LikeLike
  
  Reply
scienceguy3 says:

October 13, 2014 at 8:58 pm

Would any of these tweaks be useful for a pogoplug v2? Or is all of this v4 specific?

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:33 pm
  
  This should all work with the older Pogoplug E02. You will be limited to the capabilities of USB 2.0 however, but you may find benefit to many of these tunings.
  
  LikeLike
  
  Reply
kadaveruHaranath says:

November 2, 2014 at 1:19 pm

Hi,

Really great information. It’s very technical although I have experience of unix programming way back in the year 1987, I still remember unix file system, mounting, nfs etc. I need your advice please.

I have few devices PC, ipad, iphone, ATV3 and WD TV Live. Last year I purchased ATV 3 not knowing about jailbreak stuff. Otherwise I would have bought ATV2. Anyway I am using some hack on ATV3 to view movies from Plex server running on my PC. Recently I bought a USB WD hard disk 2TB. Last week I preordered Amazon Fire stick. I would like to run XBMC on that and access my movies. Since few days I was looking for a good NAS with primary objective to have movies repository and also have cloud environment to save ipad ( videos, photos ), iphone ( videos, photos ) and PC ( camera photos, downloaded movies,music files and some important files ). I have plan to upgrade / buy hard disk to 5 TB.

After some research on Internet, I have seen how to configure NAS using old PC. I did not like that because I need to keep that big box running all the time. Then I looked for NAS box from DLink, Synology, QNAP etc. found them very expensive. Then I came across this cheap and best suitable device called Pogoplug .

Now I need your advice here. Lot of people on the Internet were saying POGO-E02 having more clock speed and RAM is better that pogoplug v4. Although new version has USB 3.0 and SATA connectivity, old version is still in demand. What do you suggest? Shall I go with new version and implement tuning you suggest or go with old version POGO-E02?

I am still trying to understand transcoding, suggest me weather to have XBMC or Plex on amazon fire stick.

I will be thankful for your advice.

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:31 pm
  
  My response is probably not very timely (sorry), but the answer to your question is: It depends. The Pogoplug v4 (mobile) has a slower (MHz) processor than the E02 and there is more memory in E02 (256MiB vs 128MiB, I think..), but E02 is limited to USB 2.0 only. With the high-end v4 model, you will get USB 3.0 which can provide better throughput than you would get from USB 2.0, even on the slower SoC. Honestly, if the Pogoplug folks had taken the faster E02 SoC and paired it with the USB 3.0, it would scream. There are other variations of this hardware though that you could search for. Getting back to your question, if you plan to use it simply as a file server with SAMBA, then go with the v4. On the other hand, if you plan to use it as a general purpose server and want more memory, or plan to use it more exclusively with NFS, go with the E02 model. As you have probably noticed, the v4 (with USB 3.0) is usually still cheaper than the E02. Good luck.
  
  LikeLike
  
  Reply
samura says:

November 27, 2014 at 10:28 am

Hi!

Great article! Seems like you’re a linux expert, can you try to see if is possible to mount Pogoplug Cloud as fixed drive?

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:21 pm
  
  I’ve never actually used their cloud service for anything. They probably have an API of some sort for accessing that storage, but I am not familiar with what it is. You might try asking in their forums to see if anyone has something similar to a Google Drive capability for the Pogoplug Cloud service.
  
  LikeLike
  
  Reply
Johan says:

November 28, 2014 at 1:57 pm

I, too, would much appreciate any insight you could provide into NFS tuning. I’ve had trouble getting NFSv4 to perform acceptably on Kirkwood-family servers, especially after UDP support was dropped from the kernel and you’re forced to run it over TCP.

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:18 pm
  
  Thanks, Johan. I am sorry to everyone that I neglected this page for so long. My time was very limited since writing it and I didn’t get the chance to add anything about NFS to it. I should be able to get to it in the next few weeks, but as you’ll see, throughput is still less than with SAMBA (as tuned in this article).
  
  LikeLike
  
  Reply
geek42 says:

December 19, 2014 at 3:08 am

so i found many tv box recently using better soc and usb3 port
can i use them as a distributted cluster :]

LikeLike

Reply
Joe B says:

January 30, 2015 at 5:48 pm

Excellent article! Made my pogoplug performance much better!

I would love to get some tips for setting up NFS, too!

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:16 pm
  
  I was intending to get instructions up for setting up NFS, as I am using it on mine. Interestingly enough though, while it is possible to get better performance than “stock” with NFS, it is still slower (throughput) than SAMBA when tuned as documented in this article. My time has been pretty limited since I wrote this article and never got around to doing it, but I will do that soon. The one thing I wanted to do was update this article to include fixes to prevent network data corruption because of some NIC driver “enhancements” — see the notes on TSO.
  
  LikeLike
  
  Reply
sudos says:

February 26, 2015 at 1:17 pm

I understand this is a bit of an old post. Is there any way to get the same results out of a Debian Wheezy install on one of these? I’m currently in the process of moving from an OptiPlex 760 USFF to a PPv4 running Debian off the hard drive, and would definitely enjoy some performance enhancement in the long run. especially since, moving to this will greatly improve upon the physical space I have for such things… http://goput.it/str/did.jpg

I have not yet made the full move to systemd and probably won’t (I’m still a bit sore about it being forced upon the Linux community… thanks, Lennart.) so configuration is a bit different for me, I understand that much. I’m not looking to do anything over SMB or anything for it, I just want to take advantage of network and disk access improvements.

Thanks for any input you can give towards this!

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:11 pm
  
  Hi, Sudos. You should be able to do much of what I wrote about here on Debian, but there may be some differences. For example, they may or may not have an OpenSSL package compiled with cryptodev support. Whether cryptodev is in their kernel is also a question. Many of the udev rules, such as for disk access (make sure hdparm is installed) should work fine in Debian.
  
  Most people have a love/hate relationship with systemd. My primary complaint about systemd is that it breaks one of the primary design rules in UNIX software, which is to do one thing and do it well. Systemd does everything. While most people think sysvinit is a relic, it is precisely the simplicity of it that made it work so well. It wasn’t designed to be everything.
  
  LikeLike
  
  Reply
Eric Nemchik says:

March 14, 2015 at 2:12 am

Just wanted to point out that the openssl-cryptodev package seems to be broken as of writing this. I was able to implement all of your recommendations except the openssl-cryptodev related ones. Installing openssl-cryptodev makes my system unbootable.

LikeLike

Reply
- linuxengineering says:
  
  March 24, 2015 at 10:01 pm
  
  Sorry for the troubles. I think what happened is when you copy/pasted the line that echo’s an entry into a new udev rule, the single quotes inserted unicode characters into the file. If you modify that file with ‘vi’ or manually type the line, it should work fine. When I was setting up a new Pogoplug v4 (USB 2.0 only version), I discovered this while going through my instructions from the site. When you have a situation where a pogoplug won’t boot, just unplug the boot drive and mount it on another Linux system to fix.
  
  LikeLike
  
  Reply
Obihoernchen says:

May 2, 2015 at 9:44 pm

Nice tutorial thanks!
Found one or two additional settings for my pogoplug v2.

I’ve written something similar some time ago: http://obihoernchen.net/877/setup-samba-4-on-arch-linux/

LikeLike

Reply
Setup Samba 4 on Arch Linux | Obihörnchen's Blog says:

May 3, 2015 at 6:19 am

[…] For some additonal performance tips check this blogpost: https://linuxengineering.wordpress.com/2014/08/03/performance-tuning-with-pogoplug-v4/ […]

LikeLike

Reply
moonman says:

May 14, 2015 at 2:30 am

The CPU on V4 (mobile) is 88F6192 and not 88F6281 (used on other 1.2GHz kirkwood devices, i.e. Pogplug V2)

LikeLike

Reply
- linuxengineering says:
  
  May 16, 2015 at 12:27 pm
  
  Correct, thanks. I updated that. Not quite sure how I ended up with the wrong model number, except to say I never cracked the case open to look (always seem to damage the plastic when I do that stuff). Although there is no major difference in the CPU architecture, that correction did clear up why they used 400MHz DDR2, as that is all the SoC supported.
  
  LikeLike
  
  Reply
canonim says:

May 17, 2015 at 6:54 pm

Hi and a lot of thanks for tutorial. I’m new at Linux side. And I can not find the sysctl.conf file on /etc/sysctl/ path. I’m using pogo plug v4 with Alarm. Is this normal? How can I find the file?

LikeLike

Reply
- linuxengineering says:
  
  May 18, 2015 at 2:47 pm
  
  You can create a new .conf file in /etc/sysctl.d if one does not exist. If there is nothing to change from the kernel defaults, then there isn’t typically one present. Good luck and I hope you learn a lot!
  
  LikeLiked by 1 person
  
  Reply
newprouser says:

June 28, 2015 at 12:20 pm

Hi,

This a wonderful tutorial… makes me want to buy the PPv4 asap. But one thing I would like to know is.. what would be the performance impact if I chose to keep the external HDD’s formatted in NTFS (I am a totally windows user, so this is what I am most comfortable with).

Also will using NTFS cause it to be less reliable in cases where there is a power failure – this is rarely a problem in windows, but I have no idea about linux.

Thanks !

LikeLike

Reply
- linuxengineering says:
  
  June 28, 2015 at 11:36 pm
  
  Thanks for your comments.
  
  NTFS is supported, but it you may find it doesn’t perform very well on this device due to the limitations of the SoC. it isn’t hard to create a new file system using EXT3, for example, providing you backed up your data and can copy it back. My preference is to always go with a native Linux file system rather than use NTFS on Linux, but it is an option if you prefer.
  
  If you aren’t too familiar with Linuz, I’d suggest installing a copy in a VM using virtual box and learn the basics. It also makes sense to have a Linux system to work with the file system if you have problems with it booting, etc.
  
  If you are really worried most about being able to recover from a power outage where the file system might need repair, use EXT3. The system should be able to check and repair the file system automatically without any hassle.
  
  LikeLike
  
  Reply
  - newprouser says:
    
    June 29, 2015 at 2:18 pm
    
    thanks for your suggestion !
    
    Meanwhile looking at the alternatives I came across Radxa Rock. Can you consider writing about the Pro/Cons of Radxa Rock vs PogoPlug ?
    
    LikeLike
  - linuxengineering says:
    
    June 29, 2015 at 4:35 pm
    
    Although I haven’t specifically looked at the Radxa Rock before, it looks like a fairly typical development SBC. There is a much more powerful SoC on that than the PogoPlug, however these are really not apples-to-apples comparisons. That is to say, the main draw of getting a PogoPlug v4/Mobile is that it is dirt cheap, complete (everything you need minus the external storage),has Gigabit Ethernet (1000Mb/s), and may have USB 3.0 if you purchased the v4. In comparison, many of the development boards have USB 2.0, and Fast Ethernet (100Mb/s). Development boards that do have Gigabit Ethernet often end up having USB 2.0 interfaces. You can get USB 3.0 and Gigabit Ethernet on some development boards, but they are often pretty spendy. Additionally, the development boards often come as just the board, and to add the power adapter (and cables), case, etc, can add up to far more than the cost of a PogoPlug v4.
    
    Speaking for myself, I have a Odroid-C1. It has Gigabit Ethernet, but lacks USB 3.0. In my case, I’m using it as a headless backup server running a custom hacked (to run on ARM) version of CrashPlan. It simply NFS mounts one of my PogoPlug devices and backs data up from it to the cloud. I use this because CrashPlan needs ~1GiB of memory for large backup sets and it allows me to have continuous backup of my NAS without having a power-sucking PC running all the time.
    
    Anyway, point being that all of these have a different purposes in mind. If I could find a really cheap device that had a more powerful SoC, Gigabit Ethernet, USB 3.0, more memory, and complete with case and power, all with a low price tag like the PogoPlug, I’d jump on it. So far, I haven’t seen one yet, but I’m sure another diamond in the rough like the PogoPlug will show up at some point.
    
    LikeLike
kreegaa says:

August 7, 2015 at 1:15 pm

do you have any comments on disabling kernel drivers and services etc that are not needed for the device on archlinuxarm?

LikeLike

Reply
- linuxengineering says:
  
  August 7, 2015 at 1:32 pm
  
  I would probably recommend disabling ntpd and running it as a cronjob or updating time some other way, as mentioned in the blog article. There are potentially other services you may be able to disable, but this is really specific to your use case. If you’re doing a base install of Arch Linux on this device, there isn’t probably a whole lot of unneeded processes running, but that is one of them that can save about 4MiB of memory. When we’re talking about 128MiB of total RAM, that by itself is valuable savings if you plan on doing a lot on the system. If you do plan on running memory around maximum capacity, I’d consider adding some swap (probably not on a flash device though, unless it has TRIM, etc) to allow infrequently accessed memory pages to be swapped to disk freeing up more memory for other processes. In the event you do add swap, just ensure you’re not swapping/paging active memory pages to disk, or you will see your system performance grind to a halt.
  
  My best recommendation there is probably the obvious one, which is to spend a little time going through the running process list and determining whether those processes running perform a necessary function.
  
  As far as device drivers, modules not needed aren’t going to be loaded. However, if you have unneeded drivers in the kernel (very often the case), the only way to really slim things down is to build your own kernel without the unneeded drivers and functions. I have had embedded systems where I was able to build very slim kernels by shaving out things I didn’t absolutely need, saving large amounts of memory. Building your own kernel does take some knowledge and time, but it can be worth it in certain circumstances. That said, I haven’t specifically looked at the linux-kirkwood kernel configuration to see how it could be slimmed down, but I venture to say it could be. You’re still probably better off starting with unnecessary daemons and such before building your own custom kernel for it.
  
  LikeLike
  
  Reply
Alberto says:

August 8, 2015 at 1:08 pm

Nice tutorial!
Did this for Debian (Openmediavault) on a NSA325v2 (kirkwood at 1.6 ghz with 512MB ram and two native Sata + a usb 3.0 and a couple usb 2.0 I’m using for a flash drive for the OS), and I’m getting somewhere around 26-30 MB/s write and 40-45 MB/s read over Samba (Linux Mint Debian client).

hdparm, ethtool and ifconfig in Debian are in /sbin/, by the way. So yeah, I did check that.

The NAS is connected directly to a ethernet port of the PC. Both are gigabit. Cat5e cable, 1 meter long.

From htop CLI task manager (over SSH) I see that the processor is loaded at 60% (some peaks at 80-ish%), and that it is using all the free ram (Debian is using like 60 MB) for caching.
(htop reports it is using 2% of CPU for itself).
Also from top it’s the same.

With direct data transfer I was getting 125 MB/s both read and write.

With async I was getting 88 MB read and 125 MB/s write.

Any idea on how to improve speeds?

LikeLike

Reply
- linuxengineering says:
  
  August 8, 2015 at 5:56 pm
  
  Thanks for your comment. Are you asking how to improve performance with SAMBA specifically in relation to the raw i/o performance from your drive? If so, the best advise I can do is to ensure your sockets are properly tuned (as discussed in the article) and you are using the optimized SAMBA tunings. Since I am able to get the Pogoplug v4, with an 800MHz kirkwood, performing roughly the same as what you stated, I think it should be possible to get better throughput in your case. There are also other considerations regarding the clients, how they are connected and how well they are optimized for network communications. Just to be sure you did too, make you had rebooted after making the tuning changes and verify that they actually worked (made the intended change). You probably did, but it’s worth mentioning. You may also may have some performance bottlenecks on your router/switch itself. It’s really too hard to say unfortunately. I believe one of the other commenters in this posting did in fact have a similar device as you and was getting better throughput than you were noting from SAMBA, so I would expect you could do better.
  
  I am eagerly awaiting some enhancements to the Linux kernel that I plan to test as a way of improving disk throughput on the kirkwood SoC for more programs like SAMBA. Presumably some of these may be coming in the 4.2 kernel and have the potential to lower overhead to achieve better throughput. When I can do this depends on when the linux-kirkwood (Arch) package is updated to the 4.2 kernel. Typically I have seen Arch is pretty good about staying current. In the case of Debian, I don’t use it anywhere, so I am not sure what your kernel version is or whether 4.2 will be made available to you when it is released without building it yourself. Anyway, be sure to check back later on this year once 4.2 is released and hopefully I will have some answers on whether performance can be further improved beyond the current limitations.
  
  LikeLike
  
  Reply
  - Alberto says:
    
    August 9, 2015 at 2:18 pm
    
    >>>Are you asking how to improve performance with SAMBA specifically in relation to the raw i/o performance from your drive?
    
    Yep. I have a device that should blow yours out of the water, yet it isn’t.
    
    >>>If so, the best advise I can do is to ensure your sockets are properly tuned (as discussed in the article)
    
    I followed tutorials for tuning for Gigabit ethernet on Linux, since I have no ram shortage.
    
    this is my file in sysctl.d, you see something wrong?
    
    —————————–
    #Set maximum TCP window sizes to 12 megabytes:
    net.core.rmem_max = 11960320
    net.core.wmem_max = 11960320
    
    #Set minimum, default, and maximum TCP buffer limits:
    net.ipv4.tcp_rmem = 4096 524288 11960320
    net.ipv4.tcp_wmem = 4096 524288 11960320
    
    #Set maximum network input buffer queue length:
    net.core.netdev_max_backlog = 30000
    
    #maximum amount of option memory buffers
    net.core.optmem_max = 65535
    ——————————-
    
    >>>you are using the optimized SAMBA tunings
    
    The ones you mentioned in the blog post.
    
    >>>You may also may have some performance bottlenecks on your router/switch itself.
    
    As I said, it’s connected directly to my PC’s own gigabit ethernet port with a cable. No router, no switch.
    
    I connect to the internet with wifi, but during the tests it was disconnected.
    
    >>>I am not sure what your kernel version is
    
    3.17, tried also 3.18.5 and 4.0.something.
    
    >>>whether 4.2 will be made available to you when it is released without building it yourself
    
    the guy making the kernel and rootfs I’m using is very zealous about updating to latest kernels. He is using standard Debian configs, he mostly removes all useless drivers and adds some drivers for sensors and lights.
    Can compile it myself too.
    
    LikeLike
  - linuxengineering says:
    
    August 10, 2015 at 9:28 am
    
    One thing I noticed is that you really don’t need the socket memory tuned as you do. Tuning them as noted in the article should be sufficient. Unless you were doing long haul, high bandwidth connections, or had 10GbE, there is no reason to have your sockets that large.
    
    If you are directly connected to the device from the client, then I would check the client in more depth, but again there are a lot of places where this could be going wrong. One thing you could do, is to analyse a network capture from the client perspective to determine if you are having any network flow problems. Check resource utilization on the client while you’re performing your tests. Whether Linux or Windows, both can be tuned.
    
    Other things to verify. Enable SAR on your system by installing sysstat. Run ‘iostat -x 10’ on your server while running the tests. SAR data should also help determine if there is any system bottleneck on the server side and what it might be.
    
    Verify your smb.conf looks like this:
    
    strict allocate = Yes
    read raw = Yes
    write raw = Yes
    strict locking = No
    socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=131072 SO_SNDBUF=131072
    min receivefile size = 0
    use sendfile = true
    aio read size = 0
    aio write size = 0
    oplocks = yes
    max xmit = 65535
    max connections = 8
    deadtime = 15
    
    .. and then you can try increasing the 128kiB to 256kiB for testing purposes.
    
    These are just suggestions. As you can imagine, there are really a lot of different things that could go wrong here. On the Arch Linux forums, there was someone that had your same hardware, applied my tunings, and had significantly higher throughput than what you’re stating. Because of all the differences here, it’s difficult for me to do anything other than offer advice on where to look.
    
    LikeLike

	inf3rno on Pogoplug v4 Performance Tuning…
	linuxengineering on Pogoplug v4 Performance Tuning…
	inf3rno on Pogoplug v4 Performance Tuning…
	linuxengineering on Pogoplug v4 Performance Tuning…
	inf3rno on Pogoplug v4 Performance Tuning…

Linux Engineering

Systems, Network, and Storage Blog

Performance Tuning with Pogoplug v4 on Arch Linux ARM

47 thoughts on “Performance Tuning with Pogoplug v4 on Arch Linux ARM”

Leave a comment Cancel reply

Share this:

47 thoughts on “Performance Tuning with Pogoplug v4 on Arch Linux ARM”

Leave a comment Cancel reply