The 4k Story

Fri 08 October 2010
  • 把戏 tags:
  • linux
  • mmap
  • mongodb
  • nosql
  • redis published: true comments: true

从 Redis 2.0 开始,Redis的作者就不断地被问道,你为什么要自己造一个VM轮子呢。尽管作者在FAQ里说明了,但是仍然有很多不同意见。

反向代理Varnish的开发人员,Poul-Henning Kamp 写了一篇文章,What's wrong with 1975 programming ?,锋芒毕露,矛头直指竞争对手Squid,顺便也打击一大片牵连到了Redis的作者Antirez。他说:

I have spent many years working on the FreeBSD kernel, and only rarely did I venture into userland programming, but when I had occation to do so, I invariably found that people programmed like it was still 1975.
Kamp兄有来到user-space之后一夜回到解放前的感觉,又好像摇晃着饮料瓶子对着Antirez说:你Out啦!也许是因为作者就是个内核开发者,所以Varnish对操作系统的Virtual Memory机制充分信任,把Squid对内存的手动管理称为wasted work。"So Welcome to Varnish, a 2006 architecture program. "

还有用户也提出

Redis doesn’t use OS swap. According to Salvatore Sanfilippo, the creator of Redis, it was because the page size of 4KB was too big. I personally don’t think that helps but it’d be better if Redis preallocated specified amount of buffer pool and bring related objects to the same page to increase locality of reference, instead of letting the heap manager blindly fragment objects. In my opinion, the page size of 32 bytes is too small, considering that the hardware architectures and the compilers are optimized for the conventional page size. In that scale, even the latency of reading something from RAM could be dominant (RAM is too slow for CPU, therefore it’s got L1/L2 cache), and RAM has the pipelined burst mode to pre-fetche memory contents at a few clock cycles, before they are actually requested.

5号,Antirez在博客上写了回击 What's wrong with 2006 programming?,他认为:

  • OS Swap在一些情况下会导致客户端阻塞
  • 4K大小的Page可能包含很多key,其中总有一些被访问到,导致操作系统无法swap这些page
  • 使用自己实现的Paging为程序提供了极大的自由度,包括作者提到的2.2将会引入的数据压缩、新的数据结构以及自定义的过期算法

前段时间,Foursquare用MongoDB时,因为Sharding方法一些疏漏把大量的数据集中到了一台机器上,导致一台EC2实例内存耗尽无法工作。Mongodb的内部机制就是mmap,我的同事做过相关的测试,当内存耗尽时,读写操作都使用磁盘,这时mongodb的性能是完全无法使用的。事后10gen的开发人员Horowitz总结出现问题的原因总结出现问题的原因时,其中很重要的一点是

Document size is less than 4k. Such documents, when moved, may be too small to free up pages and, thus, memory.

看了这个原因,Redis的作者Twitter上大喜:"Real world instance of my 4k page + small objects concerns"