Linux Kernel 4.2 is here! The good and the far..for the Pogoplug v4/mobile

Arch just released the 4.2 kernel last weekend and I was already jumping on it to test.  Unfortunately, we already hit a bug (actually two) in one of the big features I was looking forward to, which was the DMA support in the crypto driver for Marvell SoCs.

  1. Non-DT (device tree) kernels, such as linux-kirkwood, showed the new driver (replacement for mv_cesa, marvell_cesa) loading, but it was not showing up as available to the kernel crypto.  The only modules available for crypto was the kernel, and we should be seeing marvell_cesa there.  I pointed this out to the Arch developers and they removed marvell_cesa from linux-kirkwood and enabled mv_cesa again.
  2. DT  (linux-kirkwood-dt) kernels suffered from a different problem.  Working with one of the developers at Arch, I found that there was a patch to the device tree for kirkwood SoCs that was never merged in 4.2, which provided the necessary compatibility property to the driver to utilize the TDMA block.  As a consequence, it was still reporting a property that resulted in the driver disabling DMA.  Hence, performance was no better (in some cases worse, but more consistent) than mv_cesa.

I am in process of patching and compiling a new DT kernel to determine if we can get the new marvell_cesa driver working with DMA on the Pogoplug v4/mobile.  Hopefully we can merge this patch in with linux-kirkwood-dt until it is merged upstream if it works.

For the issue with the non-DT kernel (linux-kirkwood), we may be out of luck without manually patching the marvel_cesa driver.  This is because they did not intend to allow for DMA with non-DT kernels, even though according to messages it appears the authors intended it to work without DMA in non-DT kernels.  In any case, if you don’t need either the crypto acceleration at all or the DMA for the crypto acceleration, then there is no major reason to use the DT kernel.

Assuming I can get this working by patching the kirkwood device tree, I should then be able to provide some impressions on the performance of the driver on the Pogoplug v4/mobile.  Please stay tuned for that.

Now, the other bad.  There was a feature called DAX that I was eyeing to help improve disk I/O performance for certain types of operations.  While this was primarily intended for memory devices, where it wouldn’t actually make sense to have a page cache, the page cache itself was part of the problem on the Pogoplug.  Unfortunately, while this code appears to have been merged into Linux 4.2 for EXT4 and XFS, it is not available to us on ARM.  There is a small paragraph under Shortcomings in the DAX Documentation that explains why.

The DAX code does not work correctly on architectures which have virtually
mapped caches such as ARM, MIPS and SPARC.

There was an old option, deprecated since roughly 2010, in the EXT file system that may have been interesting to try called nobh.  Unfortunately, nobh is ignored in modern kernels and therefore not able to be tested.  You can still attempt to mount an EXT file system with this option, and while it won’t return an error, you can see from dmesg output that it was ignored.

The silver lining in this posting is that for those of us that were looking for DMA in the Marvell crypto driver, we may soon have a LUKS rock star in the Pogoplug v4/mobile.  It just didn’t work out of the box, and that is what we’re trying to fix.


Pogoplug v4 Performance Tuning Future Enhancements

There are some exciting changes coming to the Linux kernel for both ext4 and xfs file systems that have the potential to greatly increase I/O performance on the Pogoplug v4. As we know, one of the limits of the Pogoplug v4 is related to memory operation performance. Once the kernel has been released for linux-kirkwood, I will begin testing these changes for possible inclusion into my performance tuning guide.

I plan to make a posting about using LUKS with the Pogoplug to enhance security on your data. There is hardware crypto with the kirkwood that does work, but the mv_cesa driver has some design issues that affect performance on the kirkwood due to not using DMA. Recall the aforementioned issues with memory operations on this platform.

There is some evidence that the DMA issue with mv_cesa may be fixed in the future.  Stay tuned and if/when that comes, I’ll be sure to test it.  If they successfully implement DMA in mv_cesa, this should greatly increase throughput of the hardware crypto engine, speeding up LUKS, OpenSSL, and OpenSSH by extension.

Performance Tuning with Pogoplug v4 on Arch Linux ARM



Updated: August 4, 2015
Please stop by every now and then, even if you already applied some changes, as I will continue to update this page.



Architecture: ARMv5te
Processor: Marvell Kirkwood 800MHz 88F6192
RAM: 128MiB DDR2 (@400MHz)
Storage: 128MiB NAND Flash
Network: Gigabit Ethernet
Ports: (1) SATA II (Not Mobile), (2) USB 3.0 (Not Mobile), (1) USB 2.0, (1) SD Card
Other Features: Cryptographic Engine, XOR Engine



I have always enjoyed tinkering with technology in my spare time.  So recently I had decided to purchase three Pogoplug v4 devices because they looked like a good project.  One thing that attacted me to these devices is that they offer gigabit ethernet, reasonable specs (even USB 3.0 on one), were fully supported by the Arch Linux community, and cost only $6 for the mobile Pogoplug v4 and $18 for the better version with SATA II and USB 3.0 on Amazon.  My recommendation is to go with the better version for a home server and the cheaper “mobile” version for a backup device, where storage speed is less of a concern.

As a lead systems engineer, specializing in performance tuning and deep-dive problem analysis, I thought that it would be fun to address some of the common issues that I have read with these devices.  Generally they seem to revolve around network and disk i/o, and file serving performance.  When you consider that these devices have limited memory and relatively weak CPUs, it requires some skill to tune them for optimal performance with such limited resources.

Installation of Arch Linux ARM on this device is easy and I won’t go into it here.  It is documented well and straight simple.  One thing to note is that because this is a fully supported device for Arch Linux ARM, you can update the kernel without any fuss, which makes this a great low-powered Linux platform.

As mentioned earlier, many posts exist on the web about how these devices don’t perform well.  Often times though, people posting these comments lack the technical knowledge to understand why.  Additionally, few people out there know how to properly tune, and therefore there are a lot of tuning recommendations on the Internet that are just wrong.  These devices offer very well balanced performance for a fantastic value.

The first rule of performance tuning is to test.  While there are some tunings that simply cannot be easily tested, most of the tuning changes I have made are a result of extensive testing.  Some changes are not easily understood or explained, but I will make efforts to explain high-level what is being tuned and why.

So… without further ado, let’s dive into this.

Issue #1:  Network Performance

First off, most reviewers and comments seem to indicate poor network performance.  Default tunings and having certain NIC features disabled has a lot to do with this poor performance.  Likewise, some people mistakenly point to poor network performance when other non-network bottlenecks are responsible.

By default, Arch Linux ARM had an incorrectly tuned network stack.  We will correct this and optimize it for Gigabit Ethernet on a local area network.


net.core.rmem_default = 163840
net.core.rmem_max = 163840
net.core.wmem_default = 163840
net.core.wmem_max = 163840

net.ipv4.tcp_rmem = 4096        87380   969536
net.ipv4.tcp_wmem = 4096        16384   969536

One may ask… what’s wrong with this?  Well, this requires an understanding of how Linux handles socket memory.  In a nutshell, the tcp memory maximums can be no larger than core.  So the first thing we need to do is fix this.  Likewise, we should recognize that we’re tuning for optimal Gigabit Ethernet performance but have limited memory available to us.   The only thing I am going to adjust is the maximum values here, primarily because of limited memory.

In my case, I am going to be using this (non-mobile) Pogoplug v4 for a local file server, to provide both CIFS and NFS.  Therefore, I am not planning to have lots of connections (sockets) as one might if it were a web server open to the outside.  This is just an example of a consideration when tuning.

Add the following (under NEW TUNINGS) to /etc/sysctl.d/sysctl.conf:


net.core.rmem_max = 2801664
net.core.wmem_max = 2097152

net.ipv4.tcp_rmem = 4096        87380   2801664
net.ipv4.tcp_wmem = 4096        16384   2097152

Above we have re-tuned the maximum memory values for both tcp and core to 2MiB.  tcp_rmem has 25% of memory dedicated to the application buffer by default, and so we have (2097152 / 0.75).  This is rounded up to 2801664, because memory pages are 4096 bytes.

Additionally, we should make some other changes.  The following may be added (optmem is not necessary, but you may) to /etc/sysctl.d/sysctl.conf:


net.ipv4.tcp_timestamps = 0

net.core.optmem_max = 65535

net.core.netdev_max_backlog = 5000

I’m not going to go into much detail here, as you can read documentation if you wish.  Timestamps is one that may buy us a little extra CPU overhead by turning off, and CPU is a precious commodity on this device.  On new, high-performance servers, this should be left on as you will see no benefit to turning it off.

By default, I see that generic segmentation offload is turned off.   GSO will delay a lot of CPU-hogging packet segmentation and processing until later in the network stack, which I have tested to have a significant performance improvement on this device. Generic segmentation offload is an LSO (large segmentation offload) feature that passes a very large packet (super packet) through the network stack and then breaks the packet into multiple pieces to be sent across the network, thereby saving CPU resources due to the overhead associated with more smaller packets.  Fewer, larger packets have less CPU overhead than more, smaller ones.  GSO works even in the absence of any LSO features in the network driver.

NOTE:  In recent kernel releases, as noted by one reader, the NIC driver was modified to enable software TSO.  This should be turned off, because it will cause data corruption.  GSO works perfectly.
UPDATED NOTE (8/4/2015): Software TSO data corruption bug may be fixed, but I have not tested this on the latest kernels yet.  See updates to the driver in recent kernels here: mv643xx_eth.  It is also not clear that this offers any improvements over GSO, so leaving it off is fine.


generic-segmentation-offload: off
tcp-segmentation-offload: on

Output from ifconfig also shows we could increase the transmit queue.


txqueuelen 1000

Now, create a file called /etc/udev/rules.d/50-eth.rules and add:


ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ethtool -K %k gso on”
ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ethtool -K %k tso off”
ACTION==”add”, SUBSYSTEM==”net”, KERNEL==”eth0″, RUN+=”/usr/bin/ifconfig eth0 txqueuelen 5000″

If ethtool is not installed, install it:

pacman -Sy ethtool

These entries will tell udev to modify the Ethernet device to turn generic segmentation offload and set the transmit queue length to 5000 when the device is created.

Other thoughts on improving performance may be to enable jumbo frames.  Jumbo frames (packets >1500 MTU) have other considerations and may also not be supported by all networking equipment, so you should probably not do this unless you know what you are doing.  I do not have equipment at home to test this and what the advantages are of using Jumbo frames on throughput or CPU utilization of the device.  It should be noted that using jumbo frames may disable checksum offloading, thereby increasing CPU utilization.

Currently I have filed a bug with on another feature that I would like to tune but cannot, due to a bug in v1.4 of the mv643xx_eth driver used by the Ethernet interface.  At the expense of a small amount of extra memory, I would like to be able to tune the network ring buffer slots to a higher value.  Ring buffers are where the data sits before entering into the socket on the way in, and where the data sits after being sent from the transmit socket on the way out.


Ring parameters for eth0:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             128
RX Mini:        0
RX Jumbo:       0
TX:             512

NOTE: In recent kernels, TX descriptors were increased to 512 by default from 256.  Oddly, this is the exactly what I was trying to set them to in the bug report.

When I attempt to modify these, which I should be able to, the system becomes increasingly unresponsive until it drops network connections.  As a consequence, you should not attempt to change it.  Interestingly, I do see it allocate the additional memory for the ring buffers, but the driver is clearly doing something bad when dynamically allocating them.

The problem with some of these tunings is that it can be difficult to measure on systems like this, which have limited CPU and memory resources.  For example, using tools like iperf to measure network throughput doesn’t work very well, because these tools end up bottle-necking due to CPU speed and memory bandwidth constraints related to the type of operations and calls they make.   It is possible that better compiler optimizations may help with this, but ultimately it may come down to how optimized the code is for resource limited systems.  Modern, high-powered systems just don’t need to worry as much about saving a call here or there, but it does make a difference on this device.

Issue #2: Storage Performance

There are significant improvements that can be realized by properly optimizing the storage i/o performance.  We need to optimize performance while minimizing impact on CPU and memory in order to achieve the best result.

Let’s first look at the types of storage devices/media we will be using.  With the Pogoplug mobile, you are limited to a single USB 2.0 interface, even though a powered USB 2.0 hub could be used to extend this.  The higher-end Pogoplug v4, with USB 3.0 and SATA II, offer faster options that are worthwhile when using as a home server.

On the higher-end model, which is what I am focusing on for this segment on storage performance, you can only boot from USB 2.0 and SATA due to limitations of the boot loader.  Therefore, I am going to use a USB 2.0/3.0 flash drive plugged into the USB 2.0 port, but only for the operating system itself. USB 3.0 ports will be used for a dual USB 3.0 (Icy Dock) external enclosure, which is where I will store my data.

(Note: There is an alternate boot loader option that does allow booting to USB 3.0, but it isn’t supported by Arch.)

My recommendation is to use a flash drive larger than 2GiB, as it will be challenging to install any additional packages or do updates without the extra space.  While performance is not of big importance with the OS drive, you may want to consider a more modern flash drive with reasonable read/write performance, which will help with package updates and when executing programs.  If write performance is of no concern, I would highly recommend going with a Sandisk Cruiser Fit for the higher-end Pogoplug v4, as they are tiny and allow you to place the cover back on the top.

With any flash media, it is important to optimize/align the layout of partitions and file systems for optimal performance.  Much of this has to do with the erase block and sector sizes.  You can find information out there on how to optimize the geometry and file system of the flash drive for performance, so I’m not going to get into this here.  You should consider doing this for your operating system drive when installing Arch Linux Arm.

It is also important to optimize the layout of partitions and file systems on spinning disks.  Again, there is plenty of information available on the Internet about doing this.  There are additional considerations if you have an Advanced Format drive, which typically has 4kiB physical sectors but may report them as 512 bytes or 4kiB.   RAID stripes also require additional tuning, which is discussed further in this article.

In order to simply things, we’re going to use parted to create a new partition on our external HDD that will automatically optimize the partition alignment for us.  Note that you will destroy any data on your external disk when you do this, so please know what you’re doing.  You will see that I have chosen XFS as my file system (discussed later), which is why I am labeling the partition as XFS.

parted -a opt /dev/sdX # Where X is the device letter of your external disk

# You are now at the (parted) command line

mklabel gpt

mkpart xfs primary 0% 100%


Once you exit, the a new GPT partition table will be written to the disk.  We will then need to format the partition with a new file system.  XFS was chosen after extensive testing with ddbonnie++, and iozone.

mkfs.xfs /dev/sdX # Where X is the device letter of your external disk

XFS was chosen as the best file system for this device because of it’s low CPU utilization and impressive performance in a variety of circumstances.  Ext3/4 can offer reasonable performance and are certainly reliable, but my focus is on squeezing every last bit of i/o out of the disk as possible.

Note that if you use XFS, you should ensure the file system is unmounted cleanly on reboot, as it is possible the XFS repair utility may require more memory than is available on the Pogoplug to repair it.  In such a case, you may need to unplug the drive, then repair the XFS file system on another computer.  There have been a couple times where the system wouldn’t boot due to an XFS file system requiring repair, so just unplugging the drive will allow it to boot normally until you can repair it.

Benchmark comparison tests were run against Ext3, Ext4, JFS and XFS, using the above mentioned tools.  Running any of them without an accurate understanding of factors involved in the design of a test may cause inaccurate results.  Additionally, understanding the results require some experience and knowledge that is beyond the scope of this article.  You may prefer another file system or find another works better for your particular purpose.  With each file system, there are benefits and drawbacks, but I happened to think that the benefits of XFS,  experienced in testing, outweighed any drawbacks.

Modify your /etc/fstab as appropriate, using the following as a guide:

UUID=”xxxxxxxx” /u01/external/   xfs     defaults,nofail,noatime,nodev       0       1

Some other optimizations, primarily for spinning disks, should be made.  These include the block i/o scheduler, fs read-ahead, and acoustic performance setting.  You may want to consider adjusting the power settings for the drive, which may offer some improvements.

bfq was the default block i/o scheduler, but I chose deadline, which offers significant performance improvements over cfq and bfq.  deadline is also suitable SSD and flash devices, although you may consider noop for those, as it acts as a simple FIFO with no reordering.  fair queuing schedulers are typically good for systems (desktops or general purpose servers) with many processes that may be competing for i/o, but server applications may find deadline or noop (for SSD/flash) outperform them and are lighter on resources.  noop should not be used on spinning disks as a general rule.

Create a new udev rule as /etc/udev/rules.d/60-sd.rules:

# set deadline scheduler
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/scheduler}=”deadline”
# set read ahead sectors
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1″, RUN+=”/usr/bin/hdparm -a 1024 /dev/%k”
# set maximum performance
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1″, RUN+=”/usr/bin/hdparm -M 254 /dev/%k”

If hdparm is not installed, install it:

pacman -Sy hdparm

Sometimes flash drives report as rotational devices, so you may need to put additional logic in the udev rule to prevent hdparm from being used to change settings on them.  After these changes, you should reboot your device and ensure the changes took effect.

So… what kind of performance am I now getting versus before?  Darn near triple, in some cases.  My 7200RPM 2TiB Toshiba drive, using DIRECT I/O, is able to read at nearly 100MiB/s and write at nearly 85MiB/s!  My older 7200RPM 500GiB Samsung drive got about 85MiB/s read and 65MiB/s write.

You may ask: Why Direct I/O?  Well, this is because operations that use buffered  i/o are VERY expensive on this device, and will result in less throughput and lower performance.  Meaning, bottleneck.  Doing buffered asynchronous reads/writes resulted in less performance.  Similar to why iperf had poor performance when driving network throughput, the expense of these operations slowed block i/o due to CPU speed and memory bandwidth limitations.    Direct I/O is not the same as synchronous, however.  In a nutshell, Direct I/O uses DMA to read and write directly from storage, bypassing the operating system caches.  Keep this all in mind for later…

Below is a simple test you can perform yourself.  Delete the test file after you are finished.  It writes a 1024MiB file, and then reads it.  The large file size will also help eliminate the potential impact of file system caching on this system, since it only has 128MiB of RAM.

Direct I/O

dd if=/dev/zero of=./bigfile bs=1M count=1024 oflag=direct # Write to “bigfile” in this directory
dd of=/dev/null if=./bigfile bs=1M count=1024 iflag=direct # Read from “bigfile” in this directory

Buffered Async I/O

dd if=/dev/zero of=./bigfile bs=1M count=1024 # Write to “bigfile” in this directory
dd of=/dev/null if=./bigfile bs=1M count=1024 # Read from “bigfile” in this directory

It is not really possible to bypass the file system cache for everything though, so real world performance is going to depend a lot on whether you have control over how reads and writes are performed against the file system by a program.  Some file systems offer a sync option, but this should be avoided as it won’t likely provide you good throughput.

The Marvell Kirkwood SoC also has hardware XOR engine to accelerate RAID 5.  I have not done anything with this, because I don’t have any need for this.  It also may accelerate iSCSI CRC32 and possibly some memory operations.  In the case of the memory operations, I am going to research this more later.

Many people want to consider using some form of RAID (Redundant Array of Inexpensive Disks) to protect their data and/or offer some performance improvement.  Several considerations are necessary in order to determine how this should be best configured.  One of the most important considerations is trying to make sure your Pogoplug v4 isn’t bottlenecking in terms of bandwidth (USB, for example) or CPU.

For example, if you use software RAID 1 (mirror), write disk operations will double on the USB bus, as each write will need to be written independently to each disk.  Software RAID 5, which is a striped, with distributed parity across all disks (3 or more), requires CPU-intensive calculations when writing data, even with hardware XOR offload.  RAID 0, which is a non-redundant, simple data stripe across all disks (2 or more), may not perform any better than a single disk configuration if the disk performance isn’t the limiting factor.

If you are looking for optimal performance with a simple mirror, my recommendation would be to purchase an external enclosure (if using USB) that does hardware mirroring (RAID 1).  This is because hardware mirroring will eliminate the need create two or more i/o operations for each single write, which will have a positive impact whether using USB 2.0 or 3.0.  If the enclosure controller is intelligent, it should be able to optimize read operations across both disks to help improve read performance.

In the case of using SATA, such as with the Seagate GoFlex, you’d probably be better off sticking with software RAID 0 or 1, unless your device had USB 3.0 as an option.  Since this article focuses on the Pogoplug v4, that is what I will spend most time on.

RAID stripes (regardless of RAID 5 or other striped configurations) require special considerations with regard to chunk size and file system.  Large chunk sizes benefit large writes but hurt small writes.  On the other hand, decreasing chunk size can cause files to be broken into more pieces and spread across more spindles, potentially increasing transfer throughput but at the expense of positioning performance.  It is hard to know on the Pogoplug whether the increase in operations related to smaller chunk size would benefit or hurt. That would require testing.

You can figure out the ideal chunk size by looking at the typical i/o size.  If you have an existing file system, you run ‘iostat -x’ to get your average request size (may be in sectors — check first) since the system was booted.  Then divide that typical i/o size (in kiB) by the number of disks in the RAID stripe, minus parity disks.  This would be 3, if you have a 4 disk RAID 5 stripe.  So if you had a typical i/o size of 96kiB, you’d want to divide 96kB by 3, which is 32kiB.  Always round up to the nearest 4kB (or page size).  This is what you need to properly create the RAID 5 array.

In order to optimally lay out the file system on your RAID stripe, this will depend somewhat on the file system and creation options.  Normally you figure stride, which is the chunk size (32kiB) divided by the file system block size (usually 4kiB).  In this case, it makes a stride of 8.  This would be used to create your file system on the newly create RAID group.

Storage is only useful when you have something to use it for.  This will be the focus of our next section.

Issue #3: SAMBA Performance

This is probably what most people will be using it for, regardless of whether it is for backups or central storage of files on your network.  Many of the SAMBA tuning guides will result in poor performance on the Pogoplug v4.  This is because they are not optimized for devices with limited resources, such as is the case with buffered and async i/o.  On many more powerful systems, these could be a benefit to enhance performance, but they may have the opposite effect here.

Some of the tunings below may be defaults, but I have placed them there so that you know if they conflict with something you placed in the SAMBA configuration file.

Modify your /etc/samba/smb.conf file to include the following:

strict allocate = Yes
read raw = Yes
write raw = Yes
strict locking = No
min receivefile size = 0
use sendfile = true
aio read size = 0
aio write size = 0
oplocks = yes
max xmit = 65535
max connections = 8
deadtime = 15

You may feel free to experiment with these settings.  Generally speaking, when you ask the device to use CPU and memory resources for better throughput, such as is the case with async i/o, it will result in worse performance, not better.  The network socket buffers of 128kiB may be lowered if you want.  Could save a small amount of memory without any negative performance impact, but you will need to test.  If you experience disconnects, check if removing deadtime or upping the max connections helps.

End result?  26MiB/s write performance and 42MiB/s read performance, over gigabit LAN, with a Windows 7 client, using the USB 3.0 drive as storage.  At that point, it is pretty clear that the CPU and memory are a bottleneck, and why the Pogoplug v2, with it’s 1200MHz CPU, could probably achieve faster throughput if it had USB 3.0… but it doesn’t, and so the performance on the Pogoplug v4, with its’ 800MHz CPU, is probably better.  There is also a device called a Seagate GoFlex that seems to have both SATA and a 1200MHz processor, and so you will probably have proportionally better throughput on that (>60MiB/s).

Looking at those numbers, that is between 200 and 350 megabits per second of throughput with SAMBA.  Keep in mind there is plenty of protocol overhead, so actual network throughput is more than this.  Pretty impressive, given I spent $18 on the device, minus the storage itself.  In the old days, this amount of throughput would have blown the doors off of some pretty expensive enterprise-class file servers.  But keep in mind, the reason for NAS (Network Attached Storage)  in the home is either to have central storage or to provide backups.  So even if these numbers don’t impress you, they should be more than adequate for your home network.  Additionally, you’ll be unlikely to saturate SAMBA on this device using a wireless client, as those often have real world performance maxing out at about 12MiB/s under good conditions.

Issue #4: (En/De)cryption Throughput

Encryption is expensive for the CPU, and if it weren’t for hardware encryption acceleration on the Marvell Kirkwood SoC, it would be something we would want to avoid.  Fortunately, the folks at Arch Linux Arm have created a package that will enable this feature on the Pogoplug, and allow things that use OpenSSL to benefit from hardware acceleration.

According to the instruction here, we can follow a few simple steps to enable it.  This doesn’t automatically imply that everything that uses supported hardware encryption algorithms will see any benefit, as it depends on whether they directly use OpenSSL for this or if they can be compiled or configured to enable it directly.  In any case, I’ll provide the steps below, but use the power of Google and look for “cryptodev” in reference to other things, such as SSH, OpenVPN, etc.

NOTE: One reader commented that their system did not boot after installing openssl-cryptodev.  What likely happened is that the copy/paste from the below inserted unicode characters into the udev rule, causing udev to barf.  To ensure that doesn’t happen, please type that line in manually.

# Make sure your package repos are in sync and the system up-to-date.
pacman -Syyu

# Replace the OpenSSL package with the cryptodev-enabled one
pacman -S openssl-cryptodev

# Create udev rule to open up permissions to the cryptodev device
echo ‘KERNEL==”crypto”, MODE=”0666″‘ > /etc/udev/rules.d/99-cryptodev.rules

# Load the cryptodev kernel module now
modprobe cryptodev

# Load the cryptodev kernel module on boot
echo “cryptodev” > /etc/modules-load.d/cryptodev.conf

Using the linked document at the beginning of this article, on the Kirkwood architecture, you can experiment with the supported hardware encryption algorithms using openssl command with the speed argument and -evp flag, specifying the cipher, to see it in action.  It will accelerate AES, DES, 3DES, SHA1, and MD5.

Hardware encryption acceleration will help a lot if you can use it, such as with SSL web serving and OpenSSH.  Anything to offload tasks from the CPU or make them more efficient should be considered.  If you’re interested in disk encryption, you may use the hardware acceleration feature for it.  There is currently a problem with mv_cesa in that it doesn’t use DMA and therefore is only marginally faster than software only, but hopefully this will be fixed by Marvell sometime.  Not holding my breath there, though.

Once the openssl-cryptodev package is installed and the cryptodev module loaded with correct permissions on the device, OpenSSH will automatically begin using this when it is restarted.  Just make sure your Ciphers and MACs entries are configured to use a supported algorithm in sshd_config.  You may use lsof to check if sshd has the cryptodev device open.

Other Optimizations

It may be worthwhile in creating a small swap partition on either your fast operating system flash drive or on the external USB 3.0 drive.  While I have found that it is quite possible to do a lot within the limitations of 128MiB of RAM, the OS will undoubtedly have some inactive memory pages that might be better swapped to disk than taking up valuable RAM.

If you do decide to add some swap, consider making the following tuning changes.  One will reduce the tendency to swap unless under considerable memory pressure, while the other will more aggressively prune the page cache to try to keep more free memory available.

Add to /etc/sysctl.d/sysctl.conf:

vm.swappiness = 10 # Default is 60
vm.vfs_cache_pressure = 1000 # Default is 100

Additionally, you can conserve memory on the system by disabling ntpd from running as a daemon.  Unless you have a lot running on the system, you are probably fine enabling it or leaving it enabled, but it is 4MiB of RAM on a system with 128MiB.

Overall Architecture

The Marvell Kirkwood is really an excellent SoC for NAS applications.  This is part of why it is used in so many NAS appliances in the home consumer market.  In my opinion, this makes a more ideal home server than a Raspberry Pi.  RPs are also more expensive and do not offer the ideal set of features seen on the Pogoplug v4.

If you look at the architecture document, linked at the beginning of this article, you will notice that there is no USB 3.0.  When designing the Pogoplug v4, they utilized the PCIe interface for a USB 3.0 chipset.  The PCIe interface sits on the same high-speed bus as the USB 2.0, Gigabit Ethernet, and SATA II port.  Interestingly enough, the internal NAND flash, which we don’t use, sits on a low speed bus, as does the flash card reader.  Because of the SD card reader being on the slow bus, you should probably just avoid using it altogether.

Another interesting thing is that it has 400MHz DDR2 on the Pogoplug v4 board due to the SoC limitation.  If it had faster memory, it would likely have helped somewhat with some of the memory operations that seemed to hinder performance.

While the Pogoplug v4 has only 128MiB of RAM, I have not found this to be a problem.  I am able to run OpenSSH with multiple sessions, vsftpd, SAMBA with multiple sessions, and Deluge and Deluge-Web with 200 active connections and multiple torrents, all without running out of memory.  To be sure, it does run slim on memory, but it doesn’t run out.

Final Thoughts

The Pogoplug v4 makes a fantastic home server, especially considering the low cost and official Arch Linux ARM support and great community.  Even though I had heard of a Pogoplug before, I never looked into getting one until I saw how cheap they were and that they had official support for Arch Linux ARM.  While I do enjoy cross-compiling and hacking together embedded systems from scratch, it is a painfully slow and laborious effort, which is why this is so great.

If there is positive feedback to this article, I will consider posting more about the Pogoplug v4 in the future.  It certainly is a fun project and I’m sure you’ll enjoy it if you get one.