server gnome

tech jibber-jabber

HAProxy Manual

As a sysadmin, I've adopted a nice practice of always reading product manuals before fiddling around with things. Be it some application or a dish washer.
Considering that I like reading manuals, I also like searching them, having references, and actually having them in a format nicer than plain old text.
Especially complex 100+ page ones !

Recently I wanted to re-read HAProxy's 1.7 manual, but it's distributed only as plain text and I really find that unreadable to the point of wanting to change it.
Looks like I'm not the only one. Some nice people already made a python converter from HAProxy's standard .txt files to wonderfully searcheable and indexed html.

In case anyone else finds it useful, I'm hosting it here and here

Enjoy !

Link Aggregation For The Masses

Living in a rural part of the country, I've encountered a serious problem. Can't get anything other than a 3G or satellite internet connection.
Using satellite connection is out of the question; not so much for it's high price tag, rather than reliability and high round trip time, especially during bad weather. Link reliability is the key for me because I need persistent VPN connections to work, and any reconnects can destroy quite a bit of work. I.e. when round trip time (commonly referred to as "ping" by gamers) goes over a few hundred milliseconds, VPN software decides that the connection is broken and needs to be re-established. Which takes a minute to get back to normal, making my submitted content lost forever.
Not to mention daily conference calls via VOIP and video which simply can't work on unreliable or high-latency links.

Fortunately, I live in Croatia which among a few other advantages has a really well built 3G and 4G networks. However, my specific location isn't covered by 4G. Only 3G (HSPA+) is available here but it does work really well. Also, my specific location is a popular tourist destination, so during tourist season the number of mobile users in this tiny fishing village grows 50 fold, meaning a lot of service disruptions for any individual mobile user, especially data traffic.

Challenges to overcome:

  • The only available flat-rate 3G connection is limited to 8Mbps/512kbps, which in real-life conditions is more like 6Mbps/500kbps
  • 3G performance is really bad when there's a lot of multiple concurrent connections.
    • This is not a problem for smaller things, such as smartphones or 1 laptop because 1 device will commonly keep open less than 100 connections, but routing the entire home network with several computers, mobile phones, chromecast, video conferencing, company VPN, and ever growing automatic updates for all of those is just too much.
  • 3G reliability is a problem too. A few times every hour, 3G connection rtt skyrockets for some 30 seconds, causing applications to drop connections as if the connection broke. One can imagine how annoying it is to be in a business meeting via hangouts and having the connection break twice every hour forcing me to reconnect.

It's 2015. As a wannabe IT guy I refuse the notion that I can't get a reliable high speed internet connection somewhere in Europe.
So, I got a total of 8 3G links, 6 of which are used in this article, and decided to aggregate them in order to get a reliable high-speed connection.

Link aggregation is a colloquial term to describe using multiple links as one. From technical perspective, there are multiple techniques commonly used to aggregate links. Some of them provide redundancy (high availability), some provide higher network throughput, while some provide both.

Note that there are several ways to try and solve the above problems. I.e. I've been doing policy-based routing for a long time, where all traffic to Google's servers is routed through one link, other https traffic routed through another, TV routed through the next one, VPN through the next, etc. In combination with QOS this worked really well.
However, any implementation of PCC-based (per connection classifier) or any other policy-based routing is not good enough because it doesn't guarantee that i.e. link carrying my VPN or video call connection won't break. It just makes the perceived user experience better by routing individual connections over individual links. If a link breaks or its rtt skyrockets, it won't do anything to transfer that connection to another link or keep it alive. The connection will still break. So, if the link carrying my VPN or video conference traffic breaks, I'm still disconnected though other types of traffic still work because their link is going over a different link. Which is quite alright, because it's not supposed to do that. It's just supposed to balance the traffic according to some rules, not provide redundancy in any way.

Bonding to the rescue

No, not that kind of bonding dammit! Focus. Bonding as in computers. Bonding as in what you do when 1Gbps connection to your file server is the bottleneck and your boss won't allocate the money to upgrade to a 10G network!

From Google:
Network bonding refers to the combination of network interfaces on one host for redundancy and/or increased throughput. Redundancy is the key factor: we want to protect our virtualized environment from loss of service due to failure of a single physical link.

It's worth noting that there are several methods of bonding and all of them provide redundancy. Some do it in a way that decreases throughput, while others do it in a way that increases comparative to just 1 physical link.
In this article, we'll only be looking at round-robin (balance-rr in linux) and XOR (balance-xor in linux) bonding methods because they provide the most benefits for WAN link aggregation. There are quite a few other methods that can and should be preferred in other environments, such as datacenters and local networks. Wikipedia, of course, has a wonderful article on link aggregation which includes explanation of each method. In short, if you can connect everything to same switches (i.e. in a datacenter) just make sure that your switches support 802.3ad and use 802.3ad bonding method :)

Equipment used:

  • 6x 3G router that I got when subscribing to ISP's service (€30/month each)
    • to prevent confusion, we'll simply call these "modems" in the rest of the article even though they are actually routing traffic
  • 1x MikroTik RB2011UAS-2HnD (€80)
    • We'll call this one "router" in the rest of the article, even though all 8 devices are technically routers
  • 1x Dedicated server from Hetzner (€27/month at their auction)
    • This doesn't do anything other than run our virtual machine
    • FlexiPack and /29 subnet (€20/month)
    • Installed Ubuntu and KVM virtualization on it
  • 1x Virtual Machine with a routed public /29 network installed on the dedicated server above
    • RouterOS x86 Level 4 license (€30)
    • Installed RouterOS on that virtual machine
    • We'll call this one "concentrator" because it concentrates our VPN connections :-/

Setup explanation

network overview1

In general, for bonding to even work bonded network interfaces have to be physically connected (directly or using a network switch). Considering that this is all done over the internet, bonding can't work out of the box.
To overcome this hurdle, we first need to make sure that there are some directly connected network interfaces to bond.
The obvious thing would be to create some VPN between our router and concentrator and bond those VPN interfaces. But that doesn't really work as expected for several reasons:

  • I don't have a static public IP
    • I can't get it on my type of service because my ISP has some boureaucratic nonsense and apparently doesn't like taking my money
  • I don't even have access to the dynamic public IP allocated by the ISP on every reconnect
    • That's because my ISP doesn't have ipv6 and is NAT-ing ipv4 for all mobile network users, so modems don't even get a public IP.

Therefore, I can't use any VPN tunnels that rely on public IPs unless I want to fiddle with scripting on every reconnect. Instead, I'll just use some VPN that can traverse NAT (PPTP, OpenVPN...)
Considering that I'll be using this for all of my internet traffic including video streaming, I should minimize the use of compression and encryption because those cause havoc for any type of streaming by queueing and delaying packets while compressing or encrypting them.
So, less is more when deciding which tunneling method to use - good old PPTP for the win!

After we have our PPTP tunnels, we should bond them. Oooops. Forgot that PPTP tunnels aren't really bondable because they aren't full blown ethernet type of interface. They operate on a higher networking level.
So, to overcome this we need some virtual ethernet-type interfaces inside those VPN tunnels.
Enter EoIP.
EoIP is Mikrotik's wonderful way of making ethernet tunnles on top of existing IP layer. Meaning, we create EoIP network interface on each side of every one of our 6 PPTP tunnels to get 6 full ethernet links on both our router and our concentrator. Once we have that, we can treat EoIP interfaces like any real ethernet interface, meaning that we can bond them too !

Once we have our 6 ethernet connections, we can easily bond them on each side. The resulting bond interfaces will operate as if they were one single connection but actually send traffic over all 6 interfaces.

Depending on the type of bonding algorithm used, the combined throughput on the bonding interface will be greater than the throughput of any single link.

I'm not going to create copy-paste commands for MikroTik in this article. Instead, I'll try to describe everything in details so that it can be easily replicated on other platforms, such as any Linux-based distribution or specialized networking platforms (Cisco, OpenWRT, ZeroShell, Juniper...)

Interface overview

  • I've connected each "modem" to one ethernet port on the home router. It's perfectly fine to physically connect them any way that you want to, as long as internet is accessible from the home router via each link.
    • In my case, "modems" are actually all routers and are set to dial internet connection automatically and provide internet access without any user intervention
    • Modems have IPs "", "", etc.
  • eth1-6: ethernet interface on the home "router" to which modems are connected; ip addresses assigned to the interface are "", "" ,etc.
  • vpn1-6: pptp interface on the home router. Each vpn tunnel has 1 IP address automatically assigned by the concentrator when PPTP connection gets established.
      • VPN Client-side addresses:
      • VPN Server-side addresses:
  • eoip1-6: EOIP (virtual ethernet) interfaces. These don't have any IPs assigned to them. They are virtual interfaces established inside each PPTP tunnel to give us full ethernet interfaces to bond.
  • bndSrv: Bond interface that we're creating on the vpn concentrator side
  • bndCli: Bond interface that we're creating on our home router

Step 0:

  1. We need to configure our "concentrator" as PPTP server and an internet gateway for PPTP clients; meaning profiles, username/password, routing NAT, etc. I won't be covering this in this article.

  2. We need to configure our modems to work and you know... NAT/route traffic. I won't be covering this either.

Step 1: ISP Links

As previously mentioned, I'm using 6 3G routers. For all intents and purposes these behave like any other crappy WiFi router that one would get from their ISP.
Though we're using 3G links in this article, these can actually be any kind of link. So, ADSL, cable, fiber... anything. And they don't even have to be the same, so we can mix i.e. cable and 3G. One thing to keep in mind is that our performance will vary according to the links used. So, if we mix fiber and 3G performance will vary depending on the bonding mode that we select later on. So, if we're looking to use 3G only when fiber is overloaded or down, we can. But more on that later.

The point is to connect all of our "modems" to the home router in some way, set up isolated networks for every "modem" in order to avoid any networking conflicts and we're good to go.
In my case, it means connecting each modem's ethernet port to one of home router's ethernet ports and configuring IP addresses on modems and home router to actually have networking all set up. I'm using static IPs for this because I really don't want my home router to wait for modems' DHCP to allocate IP before bringing the interface up after i.e. modem or router reboot. 

I've connected each "modem" to one ethernet port on the home router.
Modems have IPs: "", "", etc.
Home router's ethernet ports eth1-6 have IPs: "", "" ,etc.
Notice that I've used /24 network in every case in order to make the configuration easier by not having to remember any subnets.

Step 2: Static Routes

Considering that we want every PPTP tunnel to go over a different underlying ISP link, we need to configure static routes.
So, if our "concentrator" has public IPs , , etc. we need to create a static route for each one of those public IPs and set its gateway to match the "modem" gateway IP.
I.e. on RouterOS we can do it like this:

/ip route
add check-gateway=ping distance=1 dst-address= gateway=
add check-gateway=ping distance=1 dst-address= gateway=
add check-gateway=ping distance=1 dst-address= gateway=
... etc. for every public IP / ISP link combination that we have.

This way, when we dial the first PPTP connection (to , it will go over 1st ISP link. 2nd connection (to will go over the 2nd link, etc.

Step 3: PPTP tunnels

Assuming that we've configured the "concentrator" to work as a PPTP server, we need to create 6 PPTP clients on the home router.
Note: if your ISP is providing you with static IPs on underlying links, you can completely skip over this step, use your modems in bridge mode and configure static IPs on your home router. Unfortunately, I can't so I need to sacrifice some performance and introduce this complexity to get static IPs to use when establishing EOIP links.

On Mikrotik configuring a PPTP client is really simple. Also, feel free to use the same username and password for all of these tunnels, as long as you've configured that user's profile on the concentrator to allow multiple links.

/interface pptp-client
add allow=pap connect-to= disabled=no max-mru=1492 max-mtu=1492 mrru=1600 name=vpn1 user=myname-1
add allow=pap connect-to= disabled=no max-mru=1492 max-mtu=1492 mrru=1600 name=vpn2 user=myname-2
add allow=pap connect-to= disabled=no max-mru=1492 max-mtu=1492 mrru=1600 name=vpn3 user=myname-3
 ... etc

  • VPN Client-side addresses:
  • VPN Server-side addresses:

Depending on your platform of choice for the concentrator, you'll need to configure this correctly.
Potential problems to watch out:
1) Make sure that your PTPP tunnel's MTU is set correctly. I.e. if you're on ADSL, you'll probably have to scale this down to 1492 bytes.
2) Make sure to enable MRRU (not MRU but MRRU !) - it will allow packets larger than the MTU to be split and recombined on PPTP level. That way even if you mess something up when setting MTU on EOIP level, it will still work but much slower than it should.
3) In case you can't set MRRU, don't worry; just make sure to configure MTU and MRU correctly, as well MSS mangling firewall rules.

 Step 4: EOIP tunnels

Now, if we configured everything correctly we should have 6 PPTP tunnels established and running just fine.
Each side of that tunnel (concentrator and home router) has 6 interfaces - concentrator has 6 dynamic PPTP interfaces, and home router has 6 persistent PPTP "interfaces".
vpn-1 tunnel has a client side IP and server side IP .

We will now create an EOIP tunnel named eoip1 that works inside the PPTP tunnel named vpn1.
We need to pay special attention to the correct tunnel-id parameter because it has to match on both sides in order to avoid conflicts (i.e. if we create multiple tunnels with same IPs).

On home router:

/interface eoip
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip1 remote-address= tunnel-id=101
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip2 remote-address= tunnel-id=102
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip3 remote-address= tunnel-id=103

On the concentrator:

/interface eoip
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip1 remote-address= tunnel-id=101
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip2 remote-address= tunnel-id=102
add clamp-tcp-mss=no keepalive=20s local-address= name=eoip3 remote-address= tunnel-id=103

 Note that we don't care about eoip tunnels' MTU here. That's because bonding them will override this value and set it to whatever we choose for the aggregated bond interface.

Step 5: Bonding

And finally, the voodoo. Note. if you can get static public IPs on both concentrator and "home router", you can skip all the steps and start from here.
Lets remember that the use case here is multiple wan link aggregation. Meaning, I want a fast reliable internet connection for my home. I'll be using it for basically all the standard needs - surfing, video conferencing, VOIP, watching movies, torrenting and everything else that a typical home office would do.
Use case it the primary thing to consider when thinking about bonding methods because as previously mentioned, there are several methods possible and they all come with different advantages and downsides.

For this specific use, there are only 2 bonding methods that would aggregate bandwidth without wasting any links on redundancy. Those are balance-rr and balance-xor.
I'll first show what I've chosen, and explain why and other modes below.

After a lot of testing, I've decided to use balance-xor method. To do that, I've created a new bond interface on both the concentrator and home router, added all these eoip links to them as slaves, and chosen balance-xor with Layer 3 and 4 hashing on both sides. This way, connections are distributed across all slaves.
Note: Make sure that you lower the bond interface's MTU. I've set mine to to 1388. If you leave MTU at 1500, it will still work (because underlying PPTP tunnels have MRRU 1600) but it will work a lot slower than it should because almost all packets are being split in two on PPTP level and reassembled on the receiving end. That's because they're exceeding PPTP's MRU limit. Naturally, lower MTU means that you'll have to configure MSS mangling for TCP SYN flags on your firewall.

After configuring the bond interface, EOIP interfaces should show up as slaves and the only thing remaining is assigning an IP address to each bond interface, configuring NAT and set it as default route.

End Results

... will vary based on the use case.
As previously mentioned, maximum speed of each one of the 6 links that I'm using is ~8Mbps, but in reality it's usually around 6Mbps.
However, those links are quite unreliable and often drop down to ~2Mbps, so with 6 of them I'll usually have like 3 links at 6Mbps while others will be anywhere between 2Mbps and 5Mbps.

Therefore, the theoretical limit that I can hit is 6x8Mbps = 48Mbps. But that's only theoretical. It will almost always be less than that, depending on current link speeds and the workload.
In balance-rr mode I've been able to reach those limits less than 10% of the time.

As we can see in the screenshots below (balance-rr mode) , different workloads give us different resutls.

Oookla's Speedtes will measure only 12Mbps because it's done using a single TCP connection which requires retransmitions in case of packets received out of order. Considering that we're using balance-rr mode which in fact causes packets to be received out of order, we are basically wasting 60% of available bandwidth to retransmit those packets.

2nd screenshot is apt-fast. A wonderful apt helper application for downloading Debian/Ubuntu/Mint... system updates from multiple update mirrors at once. So, if we set it to download using 11 concurrent connections and provide it with enough mirrors, we get some 25Mbps. We are still wasting a lot of available bandwidth on TCP retransmission, but at least we're using multiple connections to avoid transmission blocking.

3rd screenshot is Aria - a standard console download manager for Debian/Ubuntu. You'll notice that I've set -x (connections per server) to 7 and -s (total connections) to 30 and provided it with 3 mirrors in order to avoid server-imposed hrottling. This allowed me to download at some 16Mbps. Again, it's all TCP so a lot of overhead caused by out of order packets

The last screenshot shows Mikrotik's bandwidth test integrated in their RouterOS. This is a UDP test between the concentrator and my home router and as we can see, an average 24Mbps UDP download, which is quite nice for these unreliable links.

It's worth mentioning that I'd be quite happy with 12Mbps single TCP connection speed and just leave it at that.
However, this is a rare case because if one underlying 3g connection drops to say 512Kbps (spikes happen often), my total aggregated throughput will be 6x512Kbps TCP throughput will be 3Mbps. And that's not per i.e. one TCP connection, but the total maximum bandwidth. Considering that these 3g links are quite unreliable so it happens quite often.

aggregated-speedtest1 aggregated-apt-fast1 aggregated-download1 aggregated-udp1

In balance-xor mode however, things are quite different. Unfortunately, I can't really provide screenshots at this point because ... well... my 3g has been really slow for the past week or so due to ISP problems.
But the differences are as described - higher aggregated throughput for multiple TCP connections, lower aggregated throughput for UDP (limited by 1 underlying link's speed) and waaaay better user experience. That's because mosts sites use CDNs to off-load traffic, so everything loaded simply works faster by default because multiple connections are used download content.

The only problem with balance-xor that I can see is the future.
Web has slowly but surely stepped into http/2 protocol. This means that all sites are moving towards SPDY (or generic connection multiplexing in http/2 RFC).
This means that instead of opening multiple TCP connections to download content of one website from one server (browser connection pipelining), HTTP/2 protocol encourages use of a single encrypted TCP connection multiplexed to download content in parallel.
This is also bad for balance-rr because protocol encryption only makes the packets out of order retransmission more visible to the end user.
So, lets hope that projects such as Facebook's , Google's Loon and SpaceX's satellite internet gain more momentum and finally make affordable high speed internet globally available.

A bit more on bonding in layman terms

  • balance-rr
    • Advantages
      • Fastest "raw" speed
      • Highest "raw" throughput possible
      • Aggregated bandidth will be a sum of all the underlying bond members' bandwidths, but only effective if the traffic is resillient to out of order packets.
      • Works by distributing packets (layer 2), and not just connections (layer 3). Effectively, this means that it can split a single (TCP) connection across multiple links, thus aggregating bandwidth on packet level, and not just distributing multiple connections across bond members.
      • High(est?) level of redundancy with virtually no overhead - even if all but one bond members are down, it will still work and will only lose packets until link monitoring figures that the link is down. For all home/office intents and purposes, users will notice that the link speed has dropped, and basically all connections (even VPN!) will keep on working; TCP will just retransmit lost packets and won't break.
    • Disadvantages
      • Any given connection (think TCP download) is limited by the speed of the slowest bond member. So, if one bond member has 1Gbps throughput and another member has 10Mbit throughput, our TCP connection will actually be limited to only 20Mbit (10Mbit on the slow link and only 10Mbit of the 1Gbps link). That's because this bonding mode uses a literal round-robbin algorythm to distribute traffic across bond members.
      • Packets are received out of order. That's because splitting single connection's packets across multiple links of varying speed will cause packets sent first to be received 2nd or 3rd or whichever. This might not be such a huge problem if we're bonding only 2 reliable symetrical links, such as 2x1Gbps in a stable server room environment. The chance of packets arriving to destination out of order there is minimal and retransmission is cheap.
        Howerver, it's a complete mess when bonding 6 links with varying and high latency (3g). Also, the more underlying links we have, the chance of receiving packets out of order grows exponentially, thus increasing the number of packet retransmissions.  Therefore single TCP connections are in my case slower when aggregated than they are on a single link. Naturally, I can still have more links a a higher total throughput with 6 bonded links than just a single link. But it's kinda annoying to see download limited to like 2Mbps when downloading OS updates or anything using a browser, knowing that raw speed is 40Mbps.
      • Those 2 downsides result in aggregate speed effectively varying based on the maximum speed of the slowest link
      • more links = more chance of out of retransmit overheads = lower effective speed
      • bigger the bond member link difference in speed, the more packets received out of order = more retransmits = lower effective speed
    • Use cases
      • Perfect for broadcasts or unicasts of any kind
      • SIP server
      • CCTV server
      • Torrent seeding via UDP
      • Video broadcast of any kind (as long as it's actually a broadcast/multicast/unicast and not i.e. flash streaming via TCP :D)
      • Perfect for any traffic that allows packets to be received out of order
  • balance-xor
    • Advantages
      • Same as balance-rr minus splitting traffic on packet level. Meaning that it splits traffic only per layer 3 and layer 4 identifiers.
        • This means that it can take multiple TCP connections between the same hosts and move them across different bond members. It can also "move" the same persistent TCP connection from one bond member to another. But it does not "split" a TCP connection across multiple bond members.
        • This means that single connection's packets are always received in order, thus ensuring maximum throughput that the bond member can provide. In case there are multiple bond members of different speed, one TCP connection will have the throughput of one underlying link (i.e. 10Mbps) while the other one that's bound to 1Gbps link will have 1Gbps speed.
      • (Most?) Efficient use of multiple asymetric bond members because the faster link will actually get more workload than the slower one
      • Aggregated bandidth will be a sum of all the underlying bond members' bandwidths, but only effective if traffic is split across multiple connections
      • Very little resource overhead for any type of traffic
      • A lot more established connections possible than with balance-rr 
    • Disadvantages
      • Not efficient for splitting traffic on packet level
      • Any single connection will be limited by the speed of it's assigned underlying bond member.
        • In contrast to bonding-rr where a connection can be split across multiple bond members and have its speed increased that way
        • i.e. UDP connection will be limited by single bonding member's speed, whereas in bonding-rr it can be a sum of all bonding members
      • Have to use multiple connections to utilize all the available bandwidth of all bonding members
    • Use cases
      • Well... anywhere where you would use balance-rr but don't have symetrical and stable underlying bond members
        • Such as my use case where I've got 3g connections with high and varying latency (60ms) with up to 2s of rtt deviation
      • Office WAN links.
        • i.e. if you've got a 30Mbps primary cable link, and a 20Mbps ADSL backup link. It's perfectly fine to bond them this way. You get fantastic redundancy out of the box, and those 20Mbps get used all the time, not just during outages. Naturally, there are easier and less expensive ways to acchieve this, such as PCC queueing but your connections will break in case of an outage, whereas bonding will keep them running and just fail over to the other link.
      • Same goes if you've got faster links, though if it's just redundancy you're looking for, other bonding modes might be a wiser choice - especially if you've got multiple failover links and would like to prioritize them.

Final words

  • Massive thanks to Dino Vizinger for thinking this through, testing it with me and peer reviewing it.
  • Pay attention to MTU. Always pay attention to damn MTU.
  • Always avoid contracts for any Telecom/ISP services.
  • Never buy good-looking network equipment or hardware. Best hardware always looks like it time-traveled from the '60s.

btrfs - is it better or just butter

Surprised by the number of articles all around t3h interwebz about btrfs having problems and saying it corrups/hangs/overheads , I've decided to describe my really pleasant desktop use experience.

Use case is quite simple. I have 3 drives that are used to keep data on my computer. Data being video and audio files, all kinds of linux-based distro images, torrents and vm machines that I use for testing purposes. I consider all of this data expendable, meaning that I would not be sorry if any of it got deleted or corrupted. Those are just test machines and multimedia that I can download or recreate again if I need them.

However, after I've seen the performance and management benefits of using btrfs to store this stuff, and tested it all thoroughly, I've decided to have my root filesystem, as well as my /home directory on it. Backups are still being stored on a completely separate drive, but even after 9 months of really abusing btrfs for all these purposes, I have not had any kind of data corruption, performance degradation or overhead. At all. Not on disk space, or in iops.

So, here is a little scenario that might help you get started with btrfs. It's more of a showcase for desktop use, than a performance benchmark of any kind.

First of all, let's give credit to IMHO best entry level howto that I've found. You might want to check it out to get familiar with btrfs usage:

We'll be using /run/shm when copying things, so that we get clean performance data (copy from and to RAM to prevent disk latency). In Ubuntu Oneiric and later, you can modify its size like this:

mount -o remount,size=2048M /run/shm

Ignoring the OS drive, here is a list of devices used in this article. These are 3 really slow 2.5' laptop drives that I'm using because they are small and quiet. As we can see, there are already some btrfs partitions on it, but we'll focus on 5th partition of every drive - that's the only logical partition and it is currently formatted as ext4.

parted -l

Model: ATA WDC WD2500BMVS-1 (scsi)
Disk /dev/sdb: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 12.8GB 12.8GB primary btrfs
2 12.8GB 25.6GB 12.8GB primary btrfs
3 25.6GB 178GB 153GB primary btrfs
4 179GB 250GB 70.7GB extended
5 179GB 197GB 17.2GB logical ext4
Model: ATA ST9250410AS (scsi)
Disk /dev/sdc: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 12.8GB 12.8GB primary btrfs
2 12.8GB 25.6GB 12.8GB primary btrfs
3 25.6GB 178GB 153GB primary btrfs
4 179GB 250GB 70.8GB extended
5 179GB 196GB 17.2GB logical ext4
Model: ATA MAXTOR STM316081 (scsi)
Disk /dev/sdd: 160GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 10.7GB 10.7GB primary btrfs
2 10.7GB 21.5GB 10.7GB primary btrfs
3 21.5GB 88.2GB 66.7GB primary btrfs
4 89.2GB 160GB 70.8GB extended
5 89.2GB 106GB 17.2GB logical ext4


mkdir /mnt/sdb5 /mnt/sdc5 /mnt/sdd5 /mnt/raid

ls -lah /mnt/
total 0
drwxr-xr-x 1 root root 32 Mar 16 09:08 .
drwxr-xr-x 1 root root 302 Mar 6 17:59 ..
drwxr-xr-x 1 root root 0 Mar 16 09:08 raid
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdb5
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdc5
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdd5

Ext4 single device

Let's do some benchmarks using ext4 on just one device, to get a feel for the performance we can expect. All 3 drives are more or less the same (actually not, but it does not really matter).

mkfs.ext4 /dev/sdb5
tune2fs -o journal_data_writeback /dev/sdb5
mount -t ext4 -o defaults,noatime,data=writeback /dev/sdb5 /mnt/sdb5

mount |grep sdb5
/dev/sdb5 on /mnt/sdb5 type ext4 (rw,noatime,data=writeback)
df -h |grep sdb5
/dev/sdb5        16G  369M   15G   3% /mnt/sdb5

And here is our video file to be copied:

ls -lah /run/shm/video.m4v 
-rw-r--r-- 1 root root 1.8G Mar 16 09:12 /run/shm/video.m4v

In these examples, I'll use "pv" instead of "cp", so that we can see the copy progress. Note that "cp" is some 10% faster in my case, but if we use the same tool all the time, we should get consistent results.

time pv /run/shm/video.m4v > /mnt/sdb5/video.m4v

1.73GB 0:00:43 [40.4MB/s] [==================>] 100%
real    0m45.696s
user    0m0.036s
sys     0m2.728s

OK, so that's our starting value, using only one drive formatted as ext4. That was my typical setup before I started using multiple device setup, one of which is btrfs. I've tried this test several times, and it always ends up with those numbers.

Ext4 on software RAID0

Now, let's see what this test looks like when we do it using my old setup - big ext4 partition on software raid0 which consists of 3 partitions, one on each drive (/sdb5 , sdc5 and sdd5). Those are the same partitions that we'll use in our btrfs test. Keep in mind that software (kernel) raid0 and raid1 levels perform better than any affordable raid controller you can buy. Performance bennefits from using a dedicated hardware controller are visible only when that controller is real hardware controller (usually $1k or more). And even then performance difference is minimal. However, hardware controllers do have better rebuild rates.

mdadm --create /dev/md0 --chunk=4 --level=0 --raid-devices=3 /dev/sdb5 /dev/sdc5 /dev/sdd

mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
cat /proc/mdstat 
Personalities : [raid0]
md0 : active raid0 sdd5[2] sdc5[1] sdb5[0]
      50336508 blocks super 1.2 4k chunks

tune2fs -o journal_data_writeback /dev/md0p1
mount -t ext4 -o defaults,noatime,data=writeback /dev/md0p1 /mnt/raid

df -h |grep raid
/dev/md0p1       48G  853M   45G   2% /mnt/raid
time pv /run/shm/video.m4v > /mnt/raid/video.m4v
1.73GB 0:00:16 [ 107MB/s] [==================>] 100%
real    0m16.789s
user    0m0.000s
sys     0m2.860s

OK, so as expected, writing on 3 striped (raid0) devices is much faster than just one. And it's a linear performance gain, so everything working as expected.

btrfs in raid0 setup

Now for the fun part...

WARNING: do not try to specify nodesize,leafsize and sectorsize larger than 4KB. Although it will create and mount filesystem, kernel will lock up when trying to do anything else (including unmounting it or shutting machine down). This is just a wild guess, but it might have something to do with kernel PAGESIZE. Not sure if it's a feature, for what it's worth, I'm running Linux 3.2.0-18-generic #29-Ubuntu SMP Fri Mar 9 21:36:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

mkfs.btrfs -m raid0 -d raid0 -n 4096 -l 4096 -s 4096 -L testing /dev/sdb5 /dev/sdc5 /dev/sdd5

adding device /dev/sdc5 id 2
adding device /dev/sdd5 id 3
fs created label testing on /dev/sdb5
        nodesize 4096 leafsize 4096 sectorsize 4096 size 48.01GB
Btrfs Btrfs v0.19
btrfs filesystem show testing
Label: 'testing'  uuid: f765325b-fa2e-4df3-8242-8c101a914f5f
        Total devices 3 FS bytes used 28.00KB
        devid    3 size 16.00GB used 2.00GB path /dev/sdd5
        devid    1 size 16.00GB used 2.02GB path /dev/sdb5
        devid    2 size 16.00GB used 2.00GB path /dev/sdc5

mount -o noatime,compress=lzo,space_cache /dev/sdb5 /mnt/raid

df -h |grep raid
/dev/sdb5        49G   28K   45G   1% /mnt/raid
btrfs filesystem df /mnt/raid
Data, RAID0: total=3.00GB, used=0.00
Data: total=8.00MB, used=0.00
System, RAID0: total=15.94MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, RAID0: total=3.00GB, used=24.00KB
Metadata: total=8.00MB, used=0.00

As we can see, size doesn't match. That's perfectly normal and expected. And an interesting this - we can see it's automatically resized. So, btrfs expands and shrinks the filesystem on the fly. That's just fine IMHO, so we can proceed.

time pv /run/shm/video.m4v > /mnt/raid/video.m4v
1.73GB 0:00:14 [ 125MB/s] [==================>] 100%
real    0m14.571s
user    0m0.000s
sys     0m2.200s

So, results are the same as with kernel raid0, and write time drops from 45.7s (single device) to 14.5s. That's quite linear, and that's what I call cool.


I've repeated these test for 9 months, with many small files, large files, concurrent processes and everything I could think of. Results are always the same, or even much better with btrfs. I have not noticed any performance drops. No matter what file it is.

My point is that btrfs raid0 is no slower than software raid0, which is no slower than hardware raid0, so we are not going to have any penalties from using it.
What do we gain and why I prefer it to any other solution?

Simplicity. I have 3 btrfs filesystems at this time. Root, home and media (media actually has 2 subvolumes, vm and Videos with different mount options).

And just for comparison, simplicity comes to play when using a bit more complicated setups, be it several arrays, or more complex arrays such as raid10.

For example, to add a device to software raid0 setup, we have to:

  1. create that partition or device
  2. add that device to raid0
  3. extend ext4 volume to utilize the new size

With btrfs, we only have to

  1. create that partition or device
  2. add that device to raid0

With raid10, it gets a bit more complicated. Typical raid10 setup that many people use on linux is:

  • raid1 volume 1 (2 redundant devices)
  • raid1 volume 2 (2 redundant devices)
  • and then they create an LVM consisting of those 2 raid arrays. This gives us both speed, redundancy and easy recovery.

So, to add another partition to that array, we would:

  1. create the new partitions or devices
  2. add those devices to raid1
  3. extend PV to include that new raid1 arrays
  4. extend LV to utilize the new PV space
  5. extend ext4 partition to utilize the new LV size

With btrfs, we only need 2 or 3 steps:

  1. create that partition or device
  2. add that device to raid10
  3. rebalance the array (or ignore this step and allow btrfs to do it automatically over time)

And this is just when we consider RAID setups. btrfs comes with much, much more, and eliminates the need for RAID/LVM completely.
For example, it's got built-in subvolume and snapshot management. And it's cow, so snapshot usage performs really really well. But that's not covered in this blog entry.

Also, notice that I've used "compress=lzo,space_cache" mount options to mount btrfs? Yup, it comes with lzo and zlib compression, and space_cache is just there to make allocation faster. Note that "compress" does not mean "slow". Quite contrary, it's got a smart little algorythm that will use compression only for files which can benefit from it. And if you use this mount option on small files (such as root and home filesystems), you gain a LOT of performance. Also, I've found that using "lzo" instead of "zlib" compression makes thins less cpu-intensive. At least for my use case. More about that can be found at

BTW, if you prefer a btrfs crash-course video, here's one you might find entertaining:

That's all for now :D