btrfs

is it better or just butter

Surprised by the number of articles all around t3h interwebz about btrfs having problems and saying it corrups/hangs/overheads , I’ve decided to describe my really pleasant desktop use experience.

Use case is quite simple. I have 3 drives that are used to keep data on my computer. Data being video and audio files, all kinds of linux-based distro images, torrents and vm machines that I use for testing purposes. I consider all of this data expendable, meaning that I would not be sorry if any of it got deleted or corrupted. Those are just test machines and multimedia that I can download or recreate again if I need them.

However, after I’ve seen the performance and management benefits of using btrfs to store this stuff, and tested it all thoroughly, I’ve decided to have my root filesystem, as well as my /home directory on it. Backups are still being stored on a completely separate drive, but even after 9 months of really abusing btrfs for all these purposes, I have not had any kind of data corruption, performance degradation or overhead. At all. Not on disk space, or in iops.

So, here is a little scenario that might help you get started with btrfs. It’s more of a showcase for desktop use, than a performance benchmark of any kind.

First of all, let’s give credit to IMHO best entry level howto that I’ve found. You might want to check it out to get familiar with btrfs usage: BTRFS Fun

We’ll be using /run/shm when copying things, so that we get clean performance data (copy from and to RAM to prevent disk latency). In Ubuntu Oneiric and later, you can modify its size like this:

mount -o remount,size=2048M /run/shm

Ignoring the OS drive, here is a list of devices used in this article. These are 3 really slow 2.5’ laptop drives that I’m using because they are small and quiet. As we can see, there are already some btrfs partitions on it, but we’ll focus on 5th partition of every drive - that’s the only logical partition and it is currently formatted as ext4.

parted -l

Model: ATA WDC WD2500BMVS-1 (scsi)
Disk /dev/sdb: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 12.8GB 12.8GB primary btrfs
2 12.8GB 25.6GB 12.8GB primary btrfs
3 25.6GB 178GB 153GB primary btrfs
4 179GB 250GB 70.7GB extended
5 179GB 197GB 17.2GB logical ext4

Model: ATA ST9250410AS (scsi)
Disk /dev/sdc: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 12.8GB 12.8GB primary btrfs
2 12.8GB 25.6GB 12.8GB primary btrfs
3 25.6GB 178GB 153GB primary btrfs
4 179GB 250GB 70.8GB extended
5 179GB 196GB 17.2GB logical ext4

Model: ATA MAXTOR STM316081 (scsi)
Disk /dev/sdd: 160GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 10.7GB 10.7GB primary btrfs
2 10.7GB 21.5GB 10.7GB primary btrfs
3 21.5GB 88.2GB 66.7GB primary btrfs
4 89.2GB 160GB 70.8GB extended
5 89.2GB 106GB 17.2GB logical ext4
 

Preparation

mkdir /mnt/sdb5 /mnt/sdc5 /mnt/sdd5 /mnt/raid

ls -lah /mnt/
total 0
drwxr-xr-x 1 root root 32 Mar 16 09:08 .
drwxr-xr-x 1 root root 302 Mar 6 17:59 ..
drwxr-xr-x 1 root root 0 Mar 16 09:08 raid
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdb5
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdc5
drwxr-xr-x 1 root root 0 Mar 16 09:08 sdd5
 

Ext4 single device

Let’s do some benchmarks using ext4 on just one device, to get a feel for the performance we can expect. All 3 drives are more or less the same (actually not, but it does not really matter).

mkfs.ext4 /dev/sdb5
tune2fs -o journal_data_writeback /dev/sdb5
mount -t ext4 -o defaults,noatime,data=writeback /dev/sdb5 /mnt/sdb5

mount |grep sdb5
/dev/sdb5 on /mnt/sdb5 type ext4 (rw,noatime,data=writeback)

df -h |grep sdb5
/dev/sdb5        16G  369M   15G   3% /mnt/sdb5

And here is our video file to be copied:

ls -lah /run/shm/video.m4v
-rw-r--r-- 1 root root 1.8G Mar 16 09:12 /run/shm/video.m4v

In these examples, I’ll use “pv” instead of “cp”, so that we can see the copy progress. Note that “cp” is some 10% faster in my case, but if we use the same tool all the time, we should get consistent results.

time pv /run/shm/video.m4v > /mnt/sdb5/video.m4v

1.73GB 0:00:43 [40.4MB/s] [==================>] 100%
real    0m45.696s
user    0m0.036s
sys     0m2.728s

OK, so that’s our starting value, using only one drive formatted as ext4. That was my typical setup before I started using multiple device setup, one of which is btrfs. I’ve tried this test several times, and it always ends up with those numbers.

Ext4 on software RAID0

Now, let’s see what this test looks like when we do it using my old setup - big ext4 partition on software raid0 which consists of 3 partitions, one on each drive (/sdb5 , sdc5 and sdd5). Those are the same partitions that we’ll use in our btrfs test. Keep in mind that software (kernel) raid0 and raid1 levels perform better than any affordable raid controller you can buy. Performance bennefits from using a dedicated hardware controller are visible only when that controller is real hardware controller (usually $1k or more). And even then performance difference is minimal. However, hardware controllers do have better rebuild rates.

mdadm --create /dev/md0 --chunk=4 --level=0 --raid-devices=3 /dev/sdb5 /dev/sdc5 /dev/sdd
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 sdd5[2] sdc5[1] sdb5[0]
      50336508 blocks super 1.2 4k chunks

tune2fs -o journal_data_writeback /dev/md0p1
mount -t ext4 -o defaults,noatime,data=writeback /dev/md0p1 /mnt/raid

df -h |grep raid
/dev/md0p1       48G  853M   45G   2% /mnt/raid
time pv /run/shm/video.m4v > /mnt/raid/video.m4v
1.73GB 0:00:16 [ 107MB/s] [==================>] 100%
real    0m16.789s
user    0m0.000s
sys     0m2.860s

OK, so as expected, writing on 3 striped (raid0) devices is much faster than just one. And it’s a linear performance gain, so everything working as expected.

btrfs in raid0 setup

Now for the fun part…

WARNING do not try to specify nodesize,leafsize and sectorsize larger than 4KB. Although it will create and mount filesystem, kernel will lock up when trying to do anything else (including unmounting it or shutting machine down). This is just a wild guess, but it might have something to do with kernel PAGESIZE. Not sure if it’s a feature, for what it’s worth, I’m running Linux 3.2.0-18-generic #29-Ubuntu SMP Fri Mar 9 21:36:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

mkfs.btrfs -m raid0 -d raid0 -n 4096 -l 4096 -s 4096 -L testing /dev/sdb5 /dev/sdc5 /dev/sdd5
adding device /dev/sdc5 id 2
adding device /dev/sdd5 id 3
fs created label testing on /dev/sdb5
        nodesize 4096 leafsize 4096 sectorsize 4096 size 48.01GB
Btrfs Btrfs v0.19

btrfs filesystem show testing
Label: 'testing'  uuid: f765325b-fa2e-4df3-8242-8c101a914f5f
        Total devices 3 FS bytes used 28.00KB
        devid    3 size 16.00GB used 2.00GB path /dev/sdd5
        devid    1 size 16.00GB used 2.02GB path /dev/sdb5
        devid    2 size 16.00GB used 2.00GB path /dev/sdc5
mount -o noatime,compress=lzo,space_cache /dev/sdb5 /mnt/raid

df -h |grep raid
/dev/sdb5        49G   28K   45G   1% /mnt/raid

btrfs filesystem df /mnt/raid
Data, RAID0: total=3.00GB, used=0.00
Data: total=8.00MB, used=0.00
System, RAID0: total=15.94MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, RAID0: total=3.00GB, used=24.00KB
Metadata: total=8.00MB, used=0.00

time pv /run/shm/video.m4v > /mnt/raid/video.m4v
1.73GB 0:00:14 [ 125MB/s] [==================>] 100%
real    0m14.571s
user    0m0.000s
sys     0m2.200s

So, results are the same as with kernel raid0, and write time drops from 45.7s (single device) to 14.5s. That’s quite linear, and that’s what I call cool.

Conclusion

I’ve repeated these test for 9 months, with many small files, large files, concurrent processes and everything I could think of. Results are always the same, or even much better with btrfs. I have not noticed any performance drops. No matter what file it is.

My point is that btrfs raid0 is no slower than software raid0, which is no slower than hardware raid0, so we are not going to have any penalties from using it. What do we gain and why I prefer it to any other solution?

Simplicity. I have 3 btrfs filesystems at this time. Root, home and media (media actually has 2 subvolumes, vm and Videos with different mount options).

And just for comparison, simplicity comes to play when using a bit more complicated setups, be it several arrays, or more complex arrays such as raid10.

For example, to add a device to software raid0 setup, we have to:

  1. create that partition or device
  2. add that device to raid0
  3. extend ext4 volume to utilize the new size

With btrfs, we only have to

  1. create that partition or device
  2. add that device to raid0

With raid10, it gets a bit more complicated. Typical raid10 setup that many people use on linux is:

  • raid1 volume 1 (2 redundant devices)
  • raid1 volume 2 (2 redundant devices)
  • and then they create an LVM consisting of those 2 raid arrays. This gives us both speed, redundancy and easy recovery.

So, to add another partition to that array, we would:

  1. create the new partitions or devices
  2. add those devices to raid1
  3. extend PV to include that new raid1 arrays
  4. extend LV to utilize the new PV space
  5. extend ext4 partition to utilize the new LV size

With btrfs, we only need 2 or 3 steps:

  1. create that partition or device
  2. add that device to raid10
  3. rebalance the array (or ignore this step and allow btrfs to do it automatically over time)

And this is just when we consider RAID setups. btrfs comes with much, much more, and eliminates the need for RAID/LVM completely. For example, it’s got built-in subvolume and snapshot management. And it’s cow, so snapshot usage performs really really well. But that’s not covered in this blog entry.

Also, notice that I’ve used “compress=lzo,space_cache” mount options to mount btrfs? Yup, it comes with lzo and zlib compression, and space_cache is just there to make allocation faster. Note that “compress” does not mean “slow”. Quite contrary, it’s got a smart little algorythm that will use compression only for files which can benefit from it. And if you use this mount option on small files (such as root and home filesystems), you gain a LOT of performance. Also, I’ve found that using “lzo” instead of “zlib” compression makes thins less cpu-intensive. At least for my use case. More about that can be found at Phoronix’s article .

BTW, if you prefer a btrfs crash-course video, here’s one you might find entertaining: