Facebook and the kernel
Architecture and statistics
In a brief overview of Facebook's architecture, he noted that there is a web tier that is CPU- and network-bound. It handles requests from users as well as sending replies. Behind that is a memcached tier that caches data, mostly from MySQL queries. Those queries are handled by a storage tier that is a collection of several different database systems: MySQL, Hadoop, RocksDB, and others.
Within Facebook, anyone can look at and change the code in its source repositories. The facebook.com site has its code updated twice daily, he said, so the barrier to getting new code in the hands of users is low. Those changes can be fixes or new features.
As an example, he noted that the "Look Back" videos, which were created by Facebook for each user and reviewed all of their posts to the service, added a huge amount of data and required a lot more network bandwidth. The process of creating and serving all of those videos was the topic of a Facebook engineering blog post. In all 720 million videos were created, which required an additional 11 petabytes of storage, as well as consuming 450 Gb/second of peak network bandwidth for people viewing the videos. The Look Back feature was conceived, provisioned, and deployed in only 30 days, he said.
The code changes quickly, so when performance or other problems crop up, he and other kernel developers can tell others in the company that "you're doing it wrong". In fact, he said, "they love that". It does mean that he has to come up with concrete suggestions on how to do it right, but Facebook is not unwilling to change its code.
Facebook runs a variety of kernel versions. The "most conservative" hosts run a 2.6.38-based kernel. Others run the 3.2 stable series with roughly 250 patches. Other servers run the 3.10 stable series with around 60 patches. Most of the patches are in the networking and tracing subsystems, with a few memory-management patches as well.
One thing that seemed to surprise Mason was the high failure tolerance that the Facebook production system has. He mentioned the 3.10 pipe race condition that Linus Torvalds fixed. It is a "tiny race", he said, but Facebook was hitting it (and recovering from it) 500 times per day. The architecture of the system is such that it could absorb that kind of failure rate without users noticing anything wrong.
Pain points
Mason asked around within Facebook to try to determine what the worst problem is that the company has with the kernel. In the end, two features were mentioned the most frequently: Stable pages and the completely fair queueing (CFQ) I/O scheduler. "I hope we never find those guys", he said with a laugh, since Btrfs implements stable pages. In addition, James Bottomley noted that Facebook already employs another CFQ developer (Jens Axboe).
Another area that was problematic for Facebook is surprises with buffered I/O latency, especially for append-only database files. Most of the time, those writes go fast, but sometimes they are quite slow. He would like to see the kernel avoid latency spikes like that.
He would like to see kernel-style spinlocks be available from user space. Rik van Riel suggested that perhaps POSIX locks could use adaptive locking, which would spin for a short time then switch to sleeping if the lock did not become available quickly. The memcached tier has a kind of user-space spinlock, Mason said, but it is "very primitive compared to the kernel".
Fine-grained I/O priorities is another wish list item for Facebook (and for the PostgreSQL developers as well). There are always cleaners and compaction threads that need to do I/O, but shouldn't hold off the higher-priority "foreground" I/O. Mason was asked about how the priorities would be specified, by I/O operation or file range, for example. In addition, he was asked about how fine-grained the priorities needed to be. Either way of specifying the priorities would be reasonable, and Facebook really only needs two (or few) priority levels: low and high.
The subject of ionice was raised again. One of the problems with that as a solution is that it only works with the (disabled by Facebook) CFQ scheduler. Bottomley suggested making ionice work with all of the schedulers, which Mason said might help. In order to do that, though, Ted Ts'o noted that the writeback daemon will have to understand the ionice settings.
Another problem area is logging. Facebook logs a lot of data and the logging workloads have to use fadvise() and madvise() to tell the kernel that those pages should not be saved in the page cache. "We should do better than that." Van Riel suggested that the page replacement patches in recent kernels may make things better. Mason said that Facebook does not mind explicitly telling the kernel which processes are sequentially accessing the log files, but continually calling IT之家advise() seems excessive.
Josef Bacik has also been working on a small change to Btrfs to allow rate limiting buffered I/Os. It was easy to do in Btrfs, Mason said, but if the idea pans out would move elsewhere for more general availability. Jan Kara was concerned that only limiting buffered I/O would be difficult since there are other kinds of I/O bound for the disk at any given time. Mason agreed, saying that the solution would not be perfect but might help.
Bottomley noted that ionice is an existing API that should be reused to help with these kinds of problems. Similar discussions of using other mechanisms in the past have run aground on "painful arguments about which API is right", he said. Just making balance_dirty_pages() aware of the ionice priority may solve 90% of the problem. Other solutions can be added later.
Mason explained that Facebook stores its logs in a large Hadoop database, but that the tools for finding problems in those logs are fairly primitive--grep essentially. He said that he would "channel Lennart [Poettering] and Kay [Sievers]" briefly to wish for a way to tag kernel messages. Bottomley's suggestion that Mason bring it up with Linus Torvalds at the next Kernel summit was met with widespread chuckling.
Danger tier
While 3.10 is fairly close to the latest kernels, Mason would like to run even more recent kernels. To that end, he is creating something he calls the "danger tier". He ported the 60 patches that Facebook currently adds to 3.10.x to the current mainline Git tree and is carving out roughly 1000 machines to test that kernel in the web tier. He will be able to gather lots of performance metrics from those systems.
As a simple example of the kinds of data he can gather, he put up a graph of request response times (without any units) that was gathered over 3 days. It showed a steady average response time line all the way at the bottom as well as the ten worst systems' response times. Those not only showed large spikes in the response times, but also that the baseline for those systems was roughly twice that of the average. He can determine which systems those are, ssh in, and try to diagnose what is happening with them.
He said that was just an example. Eventually he will be able to share more detailed information that can be used to try to diagnose problems in newer kernels and get them fixed more quickly. He asked for suggestions of metrics to gather for the future. With that, his session slot expired.
[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]
相关热词:
本站内容来源于网络,如有侵权请与我们联系,我们会及时删除,我们深感抱歉!
注:本站所有信息仅供用于网络技术学习参考,学习中请遵循相关法律法规!
本文地址: https://v30.fanwenzhu.com/xt/linux/8732.shtml
相关文章
热门TAG
win10 ecshop 主机 阿里云 解决 配置 C# C++ 解析 SQL语句 命令 Go语言 方法 CSS3 HTML5 CSS win7 MSSQL 服务器配置 IIS7.5 IIS7 IIS6 IIS CentOS 7 Linux oracle数据库 oracle phpcms discuz discuz教程最新文章
-
并进行了数次优化更改
时间:2021-01-23
-
在Linus发表文章的第二年
时间:2021-01-23
-
当一个文件被加载时
时间:2021-01-21
-
与--delete 呼应的是--exis
时间:2021-01-21
-
$top top-18:50:38up6days
时间:2021-01-21
-
CLI和程序包管理器使开发
时间:2021-01-20
-
但是有时候这个系统上跑
时间:2021-01-20
-
配置好prometheus数据源
时间:2021-01-20
热门文章
-
Anki:让记忆更轻松的开源神器
时间:2020-12-22
-
配置好prometheus数据源
时间:2021-01-20
-
如何在Linux启动时自动启动LXD容器
时间:2020-12-22
-
使用Vi/Vim编辑器:基础篇
时间:2020-12-22
-
linux系统比windows系统声音小怎么办?
时间:2021-01-08
-
使用parallel利用起你的所有CPU资源
时间:2020-12-22
-
Zsync:一个仅下载文件新的部分的传输工
时间:2020-12-22
-
Linux SecureCRT显示乱码解决方案
时间:2021-01-05
-
linux 防御SYN攻击步骤详解
时间:2020-12-23
-
在Linus发表文章的第二年
时间:2021-01-23
