Flashcache at Facebook: From 2010 to 2013 and beyond
Facebook Engineering
By Domas Mituzas
We recently released a new version of Flashcache, kicking off the flashcache-3.x series. We’ve spent the last few months working on this new version, and our work has resulted in some performance improvements, including increasing the average hit rate from 60% to 80% and cutting the disk operation rate nearly in half.
History of Flashcache at Facebook
Facebook first started using Flashcache in 2010. At the time, we were trying to find a way around having to choose between a SAS or SATA disk-based solution and a purely flash-based solution. None of the options were ideal: The SAS and SATA options in 2010 were slow (SATA) or required many disks (SAS), and the cost per gigabyte for flash in 2010 was high.
One possible solution to this would have been to split our databases into multiple tiers – a handful of tiers that handle the most active data and need high-performance hardware, and others that handle colder data and can run on lower-performance gear. This would have been technically possible,as our data access patterns typically conform to a Zipfian distribution: Even with all the RAM-based caches we use (memcache, TAO, InnoDB buffer pool, etc.), hot data is usually accessed 10x more than average data. But such a solution would have been relatively complex, and at our scale additional complexity is less desirable.
Looking at this in 2010, we saw an opportunity to try to solve this problem at the software layer. We evaluated adding support for an L2 cache directly into InnoDB, but concluded it was better to make it transparent to applications like MySQL. So we instead implemented Flashcache as a Linux kernel device mapper target and deployed it into production on a large scale. 
Peformance analysis and optimization
In the years that followed, the performance profile of our systems changed: InnoDB compression allowed us to store more logical data demanding more IOPS, and we moved various archival bits to other tiers and optimized our data to consume less space while maintaining accessibility. The number of disk operations required for our workload increased, and we saturated the disk IO limits on some servers. Given this, we decided to take a deeper look at Flashcache performance in production to understand whether we could make it more efficient.
The performance characteristics of different disk drive types is driven by several different factors, including the rotational speed of the disk, the speed at which the head moves, and the number of reads that can be squeezed into a single rotation. In the past, SATA drives have generally performed worse than enterprise SAS components, so we wanted to come up with optimizations in our software stack to be able to utilize our systems efficiently. 
Though in most cases tools like ‘iostat’ are useful to understand general system performance, for our needs we needed deeper inspection. We used the ‘blktrace’ facility in Linux to trace every request issued by database software and analyze how it was served by our flash- and disk-based devices. Doing so helped identify a number of areas for improvement,including three major ones: read-write distribution, cache eviction, and write efficiency.
1. Read-write distribution
Our analysis showed that a few regions on disk accounted for the majority of writes, and the distribution of reads was very uneven. We added more instrumentation in Flashcache to learn about our workload patterns and get better metrics on caching behavior.
Our situation looked like this from a high level:
To simplify cache maintenance operations our cache device was split into many small (2MB) sets, and 2MB segments from overall storage were linearly mapped to the cache. However, this architecture resulted in hot tables being collocated in the same cache sets and colder tables occupying other sets that were mostly idle. (This is not unlike the “birthday paradox,” in which –contrary to most people’s expectations – there’s a 50 percent chance that two people in a group of at least 23 will share a birthday.)
Fixing such a problem required either a fancy collocation algorithm that would take small segment cacheability into account, or an increase in the variety of data in a set. We chose the latter, by implementing relatively simple policy changes:
These changes dispersed our hot data over more of the cache. The picture below showcases some of the benefit:
Before this change, 50% of the cache accounted for 80% of the disk operations. After this change, 50% of the cache accounted for 50% of the disk operations.
2. Cache eviction 
Our database servers use small logical block sizes – 4KB or 8KB for compressed InnoDB tables, and 16KB for uncompressed. When we were using 2MB cache sets we did not see big differences between various cache eviction algorithms and chose FIFO instead of LRU. The workload changed after we increased the Flashcache set sizes, so we began evaluating alternatives to FIFO.
We didn’t need to build out full implementations to model different cache eviction algorithm behaviors, as we could use traces provided by blktrace subsystem. Eviction algorithms tend to be very simple – they have to be, since they govern all the actions going through the cache. An LRU decorator for Python took less than 20 lines to implement, and adding mid-point insertion capabilities increased the code by another 15 lines (an example can be found here). We ended up writing simple simulators to model different behaviors of eviction algorithms on our collected data, and found that LRU with mid-point insertion was effective –but we still needed to determine the best mid-point in the LRU to insert newly read blocks.
What we found was that blocks that are referenced multiple times are promoted and moved from the mid-point to the head of the LRU. If we put them at the head of the LRU when they’re first read, then many blocks that are read once will push more frequently read pages from the LRU. If we put them in the middle of the LRU (halfway from the head), then they’d be at the 50th percentile. If we put them at the head, then they’d be at the 0th percentile.
In this chart, we see that the cache is effective until the insertion point is at the 85th percentile or greater, when the cache stops working:
This behavior is workload-specific, and understanding it has allowed us to make Flashcache more effective. We are currently running Flashcache with the mid-point (implemented as LRU-2Q) insertion set to the 75th percentile. This is a conservative setting that allows for 25% old pages but is still better than standard LRU, as operational actions like rebuilding or migrating instances caused cache behavior changes that were not foreseen in our initial performance modeling. 
We run multiple database instances per machine, and the one that’s been running longest gets preferential treatment in the well established old pages area, while new ones never get out of the nursery.
3. Write efficiency
 
Another issue we wanted to address was disk write efficiency. Flashcache can act as a reliable write-behind cache, and many writes to disk can be merged beforehand on flash.
Previously, we just tried to enforce a fixed percentage of dirty pages per cache set. Since different cache sets have different behaviors,under this model we ended up either underallocating or overallocating cache to modified pages. Some segments were written out constantly, while others cached dirty pages for a week, which penalized read caching.
To address this, we implemented a straightforward dirty data eviction method that did not segregate write and read operations. In the new method, all pages are treated equally, and if cache wants to reclaim a page, it just looks at the oldest entries in the LRU. If the oldest entry is dirty, cache would schedule a background eviction of that entry and then reclaim the next clean one and use it for new data. 
This smoothed out our write operations while preserving maximum write-merging efficiency and instant write capabilities of write-behind cache. It also increased amount of space that can be used for reads, increasing our overall cache efficiency.
Future work
With these three changes implemented, we are now turning our attention to future work. We’ve already spent some time restructuring metadata structures to allow for more efficient data access, but we may still look at some changes to support our next-generation systems built on top of multi-TB cache devices and spanning tens of TB of disk storage. We’re also working on fine-grained locking to support parallel data access by multiple CPU cores.
Also, while the cost per GB for flash is coming down, it’s still not where it needs to be, and this introduces more complexities into capacity planning. SSDs have limited write cycles, so we have to make sure that we’re not writing too much. Every cache miss results in a flash data write, so having flash devices that are too small may be problematic. In this case it’s nice to have hard disks that aren’t too fast, as any cache hierarchy has to depend on steep performance differences between multiple levels.
With these improvements, Flashcache has become a building block in the Facebook stack. We’re already running thousands of servers on the new branch, with much-improved performance over flashcache-1.x. Our busiest systems got 40% read I/O reduction and 75% write I/O reduction, allowing us to perform more efficiently for more than a billion users with a flip of a kernel module.
Please check out the flashcache-3.x series at https://github.com/facebook/flashcache and let us know what you think.
Domas Mituzas is a production database engineer at Facebook.
Joe Antonucci
Link is wrong inside the article. Think you meant. https://github.com/facebook/flashcache
Kirsten Stewart
Go Domas!!
Paul Graydon
"We run multiple database instances per machine".. do you have an explanation why it was decided to do this, and what advantages you see from it?
Domas Mituzas
That allows us more efficient packing of databases per physical machine and replicate from multiple sources. That allows us to establish M:N relationships between datacenters and have different generations of hardware.

Of course, it adds operational complexity, but at large scale that is relatively tiny change in how things are done while benefit is huge.

Also, multiple instances mean multiple replication channels, multiple transaction logs, etc - we spread the load over multiple bottlenecks.
Oleksiy Kovyrin
Do you simply run multiple mysqld instances on the same box or do you use some kind of containers around them?
Alon Edelman
You guys ROCK !
Robert Obryk
Did you try using LRU2 (based on time of second last access) and, if so, in which ways was it less preferred than LRU with a nonhead insertion point?
Samuel Marks
Nice work

However have you considered using Sketches instead? - Like maybe the Cascade Filter?

Also not sure if you've been following the cache-oblivious drop in InnoDB alternatives, TokuDB is quite interesting (and open-source). Might be worth testing their performance then drafting up a cost/benefit migration analysis in an official proposal?
Nathan William McSween
Why not improve bcache? It's already mainlined, doesn't rely on dm, works directly with bio's and should in theory be more performant if everything is equal.
Domas Mituzas
Oleksiy, yes, multiple mysqlds, no containers
Domas Mituzas
part of that was memory complexity, LRU2Q is essentially as expensive as LRU from that perspective, and gains are similar IIRC.
Domas Mituzas
We're more optimizing for reads here than writes - streaming b-trees such as TokuDB are solving insert efficiency problem but you still need massive cache for read efficiency. Our web/graph patterns are read-heavy.
Domas Mituzas
There are multiple reason. One of them is that bcache was nowhere near mainline when we started and required bleeding edge kernel this year.
Our primary goal is reducing disk IOPS and cache performance is more than sufficient. Having lightweight module allows us to iterate faster on most important problems. The "more performant" depends on various factors, and for example in-memory efficiency is a major issue.
We have to take pragmatic approaches that work with our hardware layouts (large caches but even larger cache:data ratios).
Muhammad Jawad
(Y)
Samuel Marks
Okay, but what about the bloom filter idea?

Depending on how you set it up, you can reduce access performance from (number of hash functions) to O(1), e.g. as shown in:
Kun Huang, Jie Zhang, Dafang Zhang, Gaogang Xie, Kave Salamatian, Alex X. Liu, Wei Li, "A Multi-partitioning Approach to Building Fast and Accurate Counting Bloom Filters," ipdps, pp.1159-1170, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
http://www.cse.msu.edu/~alexliu/publications/CountingBloomFilter/CountingBloomFilter_IPDPS.pdf
Nathan William McSween
Assuming everything is equal, block size, etc bcache will be faster due to it working directly on bio's
Domas Mituzas
Bloom filters work for "is my stuff there". It doesn't work that nicely with range reads, nor it does with data that is always there.
Domas Mituzas
Performance at that level does not have material impact. 7200rpm disk will do only 120 rotations a second, and we have to think mostly just about that. It is relatively easy to have good enough solution for cache access, and primary impact is from reducing misses.
Facebook Engineering
For more on our workload for which flashcache+mysql are important see https://www.facebook.com/notes/facebook-engineering/linkbench-a-database-benchmark-for-the-social-graph/10151391496443920
Kartheek Muthyala
Domas!!..An interesting set of optimizations you brought in there. Nice work...
Paul Graydon
Thanks, I appreciate the answer :)
Mike Snitzer
Sure would be nice if FB invested in getting this code upstream. Instead upstream Linux now has dm-cache and bcache.
Adam Vaňkát
Hello, my domain splitboarder .cz (without space) is blocked on facebook, after it was cleaned, changed the owner and the complete content 2 times. Can you please unblock the domain after more than one year?
Gergely Kovács
Domas Mituzas: i've cloned and installed from git today, flashcache version still seems to be the old one

Flashcache Version : flashcache-2.0
git commit: 1.0-218-g0d4eff77014b

any idea when 3.0 will be available?
Kai Spell
A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete.

Script: https://fbstatic-a.akamaihd.net/rsrc.php/v2/y1/r/Fsh1emxVzSI.js:7
Valentijn Scholten
Great to see these new developments. Do you think the lack of TRIM support in flashcache is hurting your performance?
Domas Mituzas
(a bit late, it is in a separate v3 branch)
Zolker Yang
What is the size and use case of flashcache deployment in FB.
Giancarlo Attod
A newbie question: is it possible to use only a partition of the SSD instead of the whole SSD?
Doris Faye Deem
why does my flash player come up when I am playing farmville every 5 min.
Kimberlie Kae Sparks-harper
I would really like for fb to have photo search..."as it says it does"...? And to where you can put a photo of a friend r persons photo on ..the search line ..and it come by up ..who it is ? ..if that person is on face book..!..?
Kimberlie Kae Sparks-harper
I don't see where I JUST commented ..here, and I do not see my comment??