The 4k Story

从 Redis 2.0 开始,Redis的作者就不断地被问道,你为什么要自己造一个VM轮子呢。尽管作者在FAQ里说明了,但是仍然有很多不同意见。

反向代理Varnish的开发人员,Poul-Henning Kamp 写了一篇文章,What’s wrong with 1975 programming ?,锋芒毕露,矛头直指竞争对手Squid,顺便也打击一大片牵连到了Redis的作者Antirez。他说:

I have spent many years working on the FreeBSD kernel, and only rarely did I venture into userland programming, but when I had occation to do so, I invariably found that people programmed like it was still 1975.

Kamp兄有来到user-space之后一夜回到解放前的感觉,又好像摇晃着饮料瓶子对着Antirez说:你Out啦!也许是因为作者就是个内核开发者,所以Varnish对操作系统的Virtual Memory机制充分信任,把Squid对内存的手动管理称为wasted work。”So Welcome to Varnish, a 2006 architecture program. ”

还有用户也提出

Redis doesn’t use OS swap. According to Salvatore Sanfilippo, the creator of Redis, it was because the page size of 4KB was too big. I personally don’t think that helps but it’d be better if Redis preallocated specified amount of buffer pool and bring related objects to the same page to increase locality of reference, instead of letting the heap manager blindly fragment objects. In my opinion, the page size of 32 bytes is too small, considering that the hardware architectures and the compilers are optimized for the conventional page size. In that scale, even the latency of reading something from RAM could be dominant (RAM is too slow for CPU, therefore it’s got L1/L2 cache), and RAM has the pipelined burst mode to pre-fetche memory contents at a few clock cycles, before they are actually requested.

5号,Antirez在博客上写了回击 What’s wrong with 2006 programming?,他认为:

  • OS Swap在一些情况下会导致客户端阻塞
  • 4K大小的Page可能包含很多key,其中总有一些被访问到,导致操作系统无法swap这些page
  • 使用自己实现的Paging为程序提供了极大的自由度,包括作者提到的2.2将会引入的数据压缩、新的数据结构以及自定义的过期算法

前段时间,Foursquare用MongoDB时,因为Sharding方法一些疏漏把大量的数据集中到了一台机器上,导致一台EC2实例内存耗尽无法工作。Mongodb的内部机制就是mmap,我的同事做过相关的测试,当内存耗尽时,读写操作都使用磁盘,这时mongodb的性能是完全无法使用的。事后10gen的开发人员Horowitz总结出现问题的原因总结出现问题的原因时,其中很重要的一点是

Document size is less than 4k. Such documents, when moved, may be too small to free up pages and, thus, memory.

看了这个原因,Redis的作者Twitter上大喜:”Real world instance of my 4k page + small objects concerns”

Visualize call tree of a C function

Requirement

You want to visualize a call hierarchy of a C function.

Solution

Utilities you need are listed below:

Take ‘rdbSaveBackground’ (redis/rdb.c) for example:

cflow --format=posix --omit-arguments --level-indent='0=\t' --level-indent='1=\t' --level-indent=start='\t' -m 'rdbSaveBackground' ~/osprojects/redis/src/rdb.c | cflow2dot | dot -Tjpg -o rdb.jpg

Output:
visualization of a call tree

Source: unix diary

The post is brought to you by lekhonee v0.7

Bayeux Protocol

运行一个CometD Demo非常简单,只要创建一个Maven项目即可(CometD Howtos):
$ mvn archetype:generate -DarchetypeCatalog=http://cometd.org

maven会提示用户选择archetype,包括cometd的版本1、版本2,jetty6、jetty7的实现,以及客户端dojo或jquery的实现。这里可以选择最新的:
http://cometd.org -> cometd-archetype-dojo-jetty7 (2.0.0 – CometD archetype for creating a server-side event-driven web application)

项目创建完成后执行mvn jetty:run即可,打开http://127.0.0.1:8080/{artifactId}即可。

CometD的协议包容了各种主要的浏览器,比如在Chromium 5上,dojo采用WebSocket实现;而在不支持WebSocket的Firefox 3上,通过long-polling实现。Bayuex是一个应用协议,CometD是Bayuex的实现,类似鸡与蛋的关系。

有了昨天在Chromium上看WebSocket协议的经验,先看一下CometD的WebSocket实现:
握手。客户端请求/{artifactId}/cometd/handshake
包含Header

GET /cometd-jetty/cometd/handshake HTTP/1.1
Upgrade: WebSocket
Connection: Upgrade
Host: 127.0.0.1:8080
Origin: http://127.0.0.1:8080
Cookie: JSESSIONID=12jqq6hbsfkfic8vzqpevxtrw

这是标准的WebSocket握手协议,服务端返回:

HTTP/1.1 101 Web Socket Protocol Handshake
Upgrade: WebSocket
Connection: Upgrade
WebSocket-Origin: http://127.0.0.1:8080
WebSocket-Location: ws://127.0.0.1:8080/cometd-jetty/cometd/handshake

双方完成WebSocket连接的建立。客户端通过websocket发送JSON,进行bayuex的握手:

[{"version":"1.0","minimumVersion":"0.9","channel":"/meta/handshake","supportedConnectionTypes":["websocket","long-polling","callback-polling"],”advice”:{“timeout”:60000,”interval”:0},”id”:”1″}]

服务端返回JSON,下发clientId完成握手:

[{"channel":"/meta/handshake","clientId":"8g6dbnlqr2k6jfo1tdpaeb7iw","version":"1.0","successful":true,"minimumVersion":"1.0","id":"1","supportedConnectionTypes":["websocket","long-polling","callback-polling"]}]

握手完成,bayuex连接建立。

在Demo中,客户端添加了一个handshake的listerner

    function _metaHandshake(handshake)
    {
        if (handshake.successful === true)
        {
            cometd.batch(function()
            {
                cometd.subscribe('/hello', function(message)
                {
                    dojo.byId('body').innerHTML += '<div>Server Says: ' + message.data.greeting + '</div>';
                });
                // Publish on a service channel since the message is for the server only
                cometd.publish('/service/hello', { name: 'World' });
            });
        }
    }

所以在完成握手后,客户端发送一个批量请求,subscribe /hello频道,并且向/service/hello发送json格式的消息。向/service channel发送的信息表示客户端与服务端的单独通信,不会被转发给其他客户端。
id用于区分每个请求,bayuex spec规定向/meta和/service发送的请求必须包含id字段,用于标示请求响应。
请求的内容最终聚合为一个Json

[{"channel":"/meta/subscribe","subscription":"/hello","id":"2","clientId":"8g6dbnlqr2k6jfo1tdpaeb7iw"},{"channel":"/service/hello","data":{"name":"World"},"id":"3","clientId":"8g6dbnlqr2k6jfo1tdpaeb7iw"}]

服务端发回响应,id=2的请求成功,订阅/hello频道成功

[{"channel":"/meta/subscribe","successful":true,"id":"2","subscription":"/hello"}]

之后,服务端发回/hello channel的消息

[{"channel":"/hello","data":{"greeting":"Hello, World"}},{"channel":"/service/hello","successful":true,"id":"3"}]

客户端还要定期发送连接请求保持连接

[{"channel":"/meta/connect","connectionType":"websocket","advice":{"timeout":0},"id":"4","clientId":"8g6dbnlqr2k6jfo1tdpaeb7iw"}]

服务端返回,连接成功

[{"channel":"/meta/connect","advice":{"reconnect":"retry","interval":2500,"timeout":15000},"successful":true,"id":"4"}]

connect请求是用于在客户端和服务端维持连接, Bayeux标准中提到(1, 2):

A transport MUST maintain one and only one outstanding connect message. When a HTTP response that contains a /meta/connect response terminates, the client MUST wait at least the interval specified in the last received advice before following the advice to reestablish the connection

The client MUST maintain only a single outstanding connect message. If the server does not have a current outstanding connect and a connect is not received within a configured timeout, then the server SHOULD act as if a disconnect message has been received.

至此,cometd客户端就可以在/hello频道上订阅、发布消息了。
在Chromium上,所有的操作都在一个WebSocket连接上完成。

而当断开连接时,客户端向服务端发送

[{"channel":"/meta/disconnect","id":"188","clientId":"a8iutjvfp7dtwhzrfujeonk5q"}]

服务端响应

[{"channel":"/meta/disconnect","successful":true,"id":"188"}]

Bayuex基本上就可以理解为一个websocket上的应用协议了。

再看看Firefox 3.6上的实现。Firefox 3.6不支持WebSocket,所有的通信只能通过XHR来实现。
握手,通过一个xhr post请求实现:

POST /{artifactId}/cometd/handshake HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.6) Gecko/20100628 Ubuntu/10.04 (lucid) Firefox/3.6.6
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: UTF-8,*
Keep-Alive: 115
Connection: keep-alive
Content-Type: application/json;charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://127.0.0.1:8080/{artifactId}/
Content-Length: 182
Cookie: JSESSIONID=fjnyxb28raih1cnaljrijl1ic
Pragma: no-cache
Cache-Control: no-cache

服务器端响应:

HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Set-Cookie: BAYEUX_BROWSER=df92-h8q89f416mutgbpxrwb8185u;Path=/
Content-Length: 213
Server: Jetty(7.1.5.v20100705)

[{"channel":"/meta/handshake","clientId":"9185k23lo482oq1po3ivxup2cj","version":"1.0","successful":true,"minimumVersion":"1.0","id":"1","supportedConnectionTypes":["websocket","long-polling","callback-polling"]}]

握手完成,执行客户端定义的回调。发送bayeux请求,通过一个新的XHR上
[{"channel":"/meta/subscribe","subscription":"/hello","id":"2","clientId":"9185k23lo482oq1po3ivxup2cj"},{"channel":"/service/hello","data":{"name":"World"},"id":"3","clientId":"9185k23lo482oq1po3ivxup2cj"}]

服务端同时返回三个bayuex的请求响应

[{"channel":"/meta/subscribe","successful":true,"id":"2","subscription":"/hello"},{"channel":"/hello","data":{"greeting":"Hello, World"}},{"channel":"/service/hello","successful":true,"id":"3"}]

客户端开始发送连接请求

[{"channel":"/meta/connect","connectionType":"long-polling","advice":{"timeout":0},"id":"4","clientId":"9185k23lo482oq1po3ivxup2cj"}]

注意这里使用的是long-polling方式,这是由dojo针对浏览器特性决定的。

Long-polling server implementations attempt to hold open each request until there are events to deliver; the goal is to always have a pending request available to use for delivering events as they occur, thereby minimizing the latency in message delivery.

如果没有新消息,服务端阻塞十秒后返回

[{"channel":"/meta/connect","successful":true,"id":"7"}]

客户端接收到返回立刻发起新的connect请求

当有新消息时,阻塞在服务器端的connect请求会立即返回,同时带回新的消息,如

[{"channel":"/hello","data":{"name":"555"},"id":"6"},{"channel":"/meta/connect","successful":true,"id":"619"}]

而如果是本客户端publish的新消息,会在请求成功的响应中返回,不会影响connect连接,如:

[{"channel":"/hello","data":{"name":"nihao"},"id":"715"},{"channel":"/hello","successful":true,"id":"715"}]

断开时,仍然是通过xhr post一条bayuex命令到服务端

[{"channel":"/meta/disconnect","id":"750","clientId":"9185k23lo482oq1po3ivxup2cj"}]

服务端响应:

[{"channel":"/meta/disconnect","successful":true,"id":"750"}]

至此,通过long polling方式实现bayuex的cometd客户端也描述清楚了。long-polling仍然是通过connect请求来实现pull的方式准实时,与websocket真正push的方式还是存在区别的。

The post is brought to you by lekhonee v0.7

Websocket Protocol

下午用jetty的WekSocketServlet写了一个简单的WebIM程序,正好第一次瞥见WebSocket的狰容。

服务器端
jetty 7.1.5
客户端
Chromium 5.0.375.86

通过wireshark抓包获得这样一些数据:
var _ws = new WebSocket(“ws://127.0.0.1:8080/nothing”)
这个环节创建WebSocket,浏览器与服务器端进行handshake,发送请求

GET /nothing HTTP/1.1
Upgrade: WebSocket
Connection: Upgrade
Host: 127.0.0.1:8080
Origin: http://127.0.0.1:8080

客户端发出一个Upgrade头,upgrade头在RFC2616 14.42定义

The Upgrade general-header allows the client to specify what additional communication protocols it supports and would like to use if the server finds it appropriate to switch protocols.

Upgrade必须被放入Connection头中标示这是一个Upgrade请求
Connection定义在RFC2616 14.10中:

The Connection general-header field allows the sender to specify options that are desired for that particular connection and MUST NOT be communicated by proxies over further connections.

Origin头还没有进入RFC,他的标准草案可以在这里找到,W3C的标准草案Cross-Origin Resource Sharing定义Origin Header:

The Origin header indicates where the cross-origin request or preflight request originates from.

Origin头的提出是为了解决CSRF的潜在危险,通过Origin服务器端可以获知请求的来源,进而判断其合法性。也就是说将跨域安全性检查的责任交给了服务器端,浏览器端采取信任的策略,避免了原先对跨域一棍子打死的做法。
Jetty 7的org.eclipse.jetty.servlets.CrossOriginFilter对这个头进行了处理。

此外,handshake请求的header中还允许一个Sec-WebSocket-Protocol,用于对服务器端指定一个子协议(应用协议)。

服务器端应答

HTTP/1.1 101 Web Socket Protocol Handshake
Upgrade: WebSocket
Connection: Upgrade
WebSocket-Origin: http://127.0.0.1:8080
WebSocket-Location: ws://127.0.0.1:8080/nothing

Websocket连接建立。此后,服务器端和客户端可以实现bidirectional的通信,消息体即websocket.send(msg)中的纯文本。要实现这样的机制,浏览器和服务器间需要建立至少两个连接。目前,WebSocket协议中还没有规定客户端对服务器端的连接数限制。不过关于这个限制,RFC2616(HTTP1.1)中规定

Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.

对此,另一个Bayeux协议倒是已经有了明确的限制:

the Bayeux protocol MUST NOT require any more than 2 HTTP requests to be simultaneously handled by a server in order to handle all application (Bayeux based or otherwise) requests from a client.

到此,客户端和服务器端已经可以建立双工的通信,这也是浏览器级别实现WebSocket协议的最大优点。而对于Firefox 3.x, IE x.x等等,只能在现有的HTTP连接机制上实现WebSocket,如通过long polling和callback polling的方式,但终归无法实现真正双工的通信。

The post is brought to you by lekhonee v0.7

fixing "libmozjs" missing when using mongodb on Ubuntu lucid

Problem
When running mongod/mongo/mongos, you got message like this:
mongod: error while loading shared libraries: libmozjs.so: cannot open shared object file: No such file or directory

Solution
Make sure you have xulrunner-dev installed:
sudo apt-get install xulrunner-dev

then find libmozjs on your filesystem:
sudo locate libmozjs

in lucid, it’s supposed to locate at:
/usr/lib/xulrunner-1.9.2.6/libmozjs.so

(and some other directories, such as firefox / thunderbird / seamonkey)

Just create a symbol link:
sudo ln -s /usr/lib/xulrunner-1.9.2.6/libmozjs.so /usr/lib/

try to restart mongodb:
sudo service mongodb start

take a look at process list:
ps aux | grep mongo

it works.

The post is brought to you by lekhonee v0.7