winston 发表于 2012-3-19 22:42:26

tcpcopy避免了nginx的两大问题

当利用nginx某个版本的情况下,也许你不知道这个版本有哪些问题,那怎么避免上线过程中出现的问题呢?
下面就是利用我们开发的tcpcopy来成功避免了上线过程中才会出现的两大问题(其它压力测试工具是发现不了的或者很难发现),希望对你们有所参考。
第一次是二号人物Maxim Dounin书写的keepalive模块的问题,具体如下:
http://blog.csdn.net/wangbin579/article/details/6995456

第二次是发现nginx-0.9.5~nginx-1.0.0在高压情况下处理不佳的问题,具体如下:
http://blog.csdn.net/wangbin579/article/details/7360662


回归测试之利器---tcpcopy

当你改进了代码,但不知道会不会引起新的问题或者还有一些问题你没有考虑到,怎么办?
tcpcopy可以解决你这些问题,尤其适合于回归测试。
tcpcopy的原理就是复制在线的请求(比如http请求)到测试服务器中去,由于复制过去的请求数据包的过程跟路由过程相差不大,所以可以做到几乎不会影响在线系统,但针对测试机器,其测试效果几乎等同于上线测试结果。
需要值得一提的是,针对http请求,tcpcopy复制过去的效果是最佳的。

作者:wangbin579 发表于2012-3-19 18:21:58 原文链接



winston 发表于 2012-3-19 22:43:05

利用tcpcopy,发现nginx一重大bug
上周运维组利用tcpcopy压力测试nginx hmux模块,压力达到2000多请求每秒的时候,发现nginx 每个进程 cpu100%。
利用gcore产生core文件,发现nginx处于如下死循环中:

(gdb) bt
#00x00000000004690d8 in ngx_http_upstream_get_round_robin_peer (pc=0x8a51bd8, data=0x899fc88)
    at src/http/ngx_http_upstream_round_robin.c:482
#10x00000000004a45a0 in ngx_http_upstream_get_keepalive_peer (pc=0x8a51bd8, data=0x899fc58)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#20x0000000000428c53 in ngx_event_connect_peer (pc=0x8a51bd8) at src/event/ngx_event_connect.c:24
#30x0000000000460465 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1081
#40x000000000046428a in ngx_http_upstream_next (r=0x8a4b770, u=0x8a51bc8, ft_type=2) at src/http/ngx_http_upstream.c:2892
#50x0000000000461483 in ngx_http_upstream_process_header (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1525
#60x000000000046104f in ngx_http_upstream_send_request (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1392
#70x00000000004608a7 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1182
#80x000000000045f634 in ngx_http_upstream_init_request (r=0x8a4b770) at src/http/ngx_http_upstream.c:620
#90x000000000045ee79 in ngx_http_upstream_init (r=0x8a4b770) at src/http/ngx_http_upstream.c:429
#10 0x0000000000455dd4 in ngx_http_read_client_request_body (r=0x8a4b770, post_handler=0x45ed95 <ngx_http_upstream_init>)
    at src/http/ngx_http_request_body.c:153
#11 0x000000000049e3bd in ngx_hmux_handler (r=0x8a4b770) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043eb06 in ngx_http_core_content_phase (r=0x8a4b770, ph=0x89bc6c0) at src/http/ngx_http_core_module.c:1350
#13 0x000000000043d75c in ngx_http_core_run_phases (r=0x8a4b770) at src/http/ngx_http_core_module.c:852
#14 0x000000000043d6d8 in ngx_http_handler (r=0x8a4b770) at src/http/ngx_http_core_module.c:835
#15 0x000000000044af21 in ngx_http_process_request (r=0x8a4b770) at src/http/ngx_http_request.c:1641
#16 0x00000000004498a5 in ngx_http_process_request_headers (rev=0x2af3569ee180) at src/http/ngx_http_request.c:1084
#17 0x0000000000449092 in ngx_http_process_request_line (rev=0x2af3569ee180) at src/http/ngx_http_request.c:889
#18 0x00000000004483bd in ngx_http_init_request (rev=0x2af3569ee180) at src/http/ngx_http_request.c:514
#19 0x0000000000433a6f in ngx_epoll_process_events (cycle=0x8995420, timer=79, flags=1)
    at src/event/modules/ngx_epoll_module.c:642
#20 0x0000000000425079 in ngx_process_events_and_timers (cycle=0x8995420) at src/event/ngx_event.c:245
#21 0x00000000004318ae in ngx_worker_process_cycle (cycle=0x8995420, data=0x0) at src/os/unix/ngx_process_cycle.c:795
---Type <return> to continue, or q <return> to quit---
#22 0x000000000042e621 in ngx_spawn_process (cycle=0x8995420, proc=0x4316df <ngx_worker_process_cycle>, data=0x0,
    name=0x571ef6 "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000430791 in ngx_start_worker_processes (cycle=0x8995420, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:355
#24 0x000000000042fe9a in ngx_master_process_cycle (cycle=0x8995420) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403bf3 in main (argc=1, argv=0x7fff5446ef18) at src/core/nginx.c:401


这个bug据运维人员一查,很失望,在nginx 1.0.7解决了此问题,我们利用的是0.8系列。

Fixing cpu hog with all upstream servers marked "down".

The following configuration causes nginx to hog cpu due to infinite loop
in ngx_http_upstream_get_peer():

    upstream backend {
      server 127.0.0.1:8080 down;
      server 127.0.0.1:8080 down;
    }

    server {
       ...
       location / {
         proxy_pass http://backend;
       }
    }

Make sure we don't loop infinitely in ngx_http_upstream_get_peer() but stop
after resetting peer weights once.

Return 0 if we are stuck.This is guaranteed to work as peer 0 always exists,
and eventually ngx_http_upstream_get_round_robin_peer() will do the right
thing falling back to backup servers or returning NGX_BUSY.


然后我们利用nginx 1.0.10,来测试,目前还没有重现上周发现的bug。

要是我们早点测试这个模块,也许我们就能够首先发现这个bug。
不过机会是有的,因为我们有tcpcopy,什么样的在线流量都能够制造,不怕找不到流行开源软件的bug



上午那台机器没有重现此bug,但在新的一台机器出现了。
core文件信息如下:



#00x000000000046ae0b in ngx_http_upstream_get_round_robin_peer (pc=0x1b822678, data=0x1b826e10)
    at src/http/ngx_http_upstream_round_robin.c:482
#10x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#20x0000000000429fab in ngx_event_connect_peer (pc=0x1b822678) at src/event/ngx_event_connect.c:24
#30x000000000046219c in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1103
#40x0000000000465f61 in ngx_http_upstream_next (r=0x1b821230, u=0x1b822668, ft_type=2)
    at src/http/ngx_http_upstream.c:2900
#50x00000000004631c0 in ngx_http_upstream_process_header (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1547
#60x0000000000462d8c in ngx_http_upstream_send_request (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1414
#70x00000000004625de in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1204
#80x000000000046129d in ngx_http_upstream_init_request (r=0x1b821230) at src/http/ngx_http_upstream.c:631
#90x0000000000460a6d in ngx_http_upstream_init (r=0x1b821230) at src/http/ngx_http_upstream.c:432
#10 0x000000000045782b in ngx_http_read_client_request_body (r=0x1b821230, post_handler=0x460989 <ngx_http_upstream_init>)
    at src/http/ngx_http_request_body.c:154
#11 0x00000000004a0875 in ngx_hmux_handler (r=0x1b821230) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043ffd1 in ngx_http_core_content_phase (r=0x1b821230, ph=0x1b5d4f88) at src/http/ngx_http_core_module.c:1365
#13 0x000000000043ebf5 in ngx_http_core_run_phases (r=0x1b821230) at src/http/ngx_http_core_module.c:861
#14 0x000000000043eb71 in ngx_http_handler (r=0x1b821230) at src/http/ngx_http_core_module.c:844
#15 0x000000000044c9a1 in ngx_http_process_request (r=0x1b821230) at src/http/ngx_http_request.c:1665
#16 0x000000000044b34d in ngx_http_process_request_headers (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:1111
#17 0x000000000044ab26 in ngx_http_process_request_line (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:911
#18 0x0000000000449d7d in ngx_http_init_request (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:518
#19 0x0000000000434ea7 in ngx_epoll_process_events (cycle=0x1b5ae090, timer=1352, flags=1)
    at src/event/modules/ngx_epoll_module.c:678
#20 0x0000000000426480 in ngx_process_events_and_timers (cycle=0x1b5ae090) at src/event/ngx_event.c:245
---Type <return> to continue, or q <return> to quit---
#21 0x0000000000432d0c in ngx_worker_process_cycle (cycle=0x1b5ae090, data=0x0) at src/os/unix/ngx_process_cycle.c:801
#22 0x000000000042faad in ngx_spawn_process (cycle=0x1b5ae090, proc=0x432b60 <ngx_worker_process_cycle>, data=0x0,
    name=0x57450e "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000431c12 in ngx_start_worker_processes (cycle=0x1b5ae090, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:360
#24 0x00000000004312fe in ngx_master_process_cycle (cycle=0x1b5ae090) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403c8c in main (argc=1, argv=0x7fffba823048) at src/core/nginx.c:405
(gdb) info lo
rrp = 0x1b826e10
now = 1321860194
m = 8
rc = 461502464
i = 18446744061305903636----注意这儿i的值
n = 0
c = 0x40
peer = 0x1b5cf2c0
peers = 0x1
(gdb) p *rrp
$5 = {peers = 0x1b5cf230, current = 3, tried = 0x1b826e28, data = 4}
(gdb) p rrp->peers->peer
$6 = {sockaddr = 0x1b5c63b8, socklen = 16, name = {len = 19, data = 0x1b5c63c8 "xxx.xxx.xxx.xxx:36901server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$7 = {sockaddr = 0x1b5c6418, socklen = 16, name = {len = 19, data = 0x1b5c6428 "xxx.xxx.xxx.xxx:36902server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$8 = {sockaddr = 0x1b5c6478, socklen = 16, name = {len = 19, data = 0x1b5c6488 "xxx.xxx.xxx.xxx:36903server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$9 = {sockaddr = 0x1b5c64d8, socklen = 16, name = {len = 19, data = 0x1b5c64e8 "xxx.xxx.xxx.xxx:36904keepalive"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}

(gdb) up 1
#10x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
227         rc = kp->original_get_peer(pc, kp->data);
(gdb) info lo
kp = 0x1b826de0
item = 0x6300000000
rc = 0
q = 0x1b826d51
cache = 0x0
c = 0x1b821c50

nginx版本和编译选项:
# /home/nginx/sbin/nginx -V
nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.1.2 20080704 (Red Hat 4.1.2-48)
nginx: TLS SNI support enabled
nginx: configure arguments: --prefix=/home/nginx --user=nginx --group=nginx --without-select_module --without-poll_module --with-http_ssl_module --with-http_realip_module --with-http_gzip_static_module --with-http_stub_status_module --without-http_fastcgi_module --without-http_memcached_module --without-http_empty_gif_module --with-pcre=../pcre-7.9 --with-openssl=../openssl-0.9.8r --add-module=../hmux --add-module=../ngx_http_upstream_keepalive-c6396fef9295 --with-debug
nginx部分配置如下:
events {
    worker_connections8192;
    use epoll;
    epoll_events 4096;
    accept_mutex off;
}

...
    upstream yyy{
      server xxx.xxx.xxx.xxx:36901;
      server xxx.xxx.xxx.xxx:36902;
      server xxx.xxx.xxx.xxx:36903;
      server xxx.xxx.xxx.xxx:36904;

      keepalive 1024;
    }
...

各位nginx牛人,帮我们确认一下是否就是nginx的bug。在此谢谢了
联系方式:
wangbin5790@hotmail.com

winston 发表于 2012-3-19 22:43:33

小心点,不要瞎升级nginx
最近新增了功能,顺便把nginx升级到1.0.0,结果被nginx的bug恶性到了。
下面是利用tcpcopy,从access.log来分析nginx各种版本的测试结果:   

         nginx-1.0.14 没有问题

                  $ grep '16/Mar/2012:11:37' access.log|wc -l
                  107244
                  $ grep '16/Mar/2012:11:37' access.log|wc -l
                  107043
                  
            nginx-1.0.5 没有问题
                  $ grep '16/Mar/2012:11:43' access.log|wc -l
                  103962
                  $ grep '16/Mar/2012:11:43' access.log|wc -l
                  103832
                  
            nginx-1.0.2 没有问题
                  $ grep '16/Mar/2012:11:48' access.log|wc -l
                  104612
                  $ grep '16/Mar/2012:11:48' access.log|wc -l
                  104383
            nginx-1.0.1 没有问题
                  $ grep '16/Mar/2012:11:52' access.log|wc -l
                  102697
                  $ grep '16/Mar/2012:11:52' access.log|wc -l
                  102645
                  
                  
          nginx-1.0.0有问题
                  $ grep '16/Mar/2012:10:13' access.log|wc -l
                  117089
                  $ grep '16/Mar/2012:10:13' access.log|wc -l

                  95544
                  
            nginx-0.9.6 有问题
                  $ grep '16/Mar/2012:10:52' access.log|wc -l
                  116557
                  $ grep '16/Mar/2012:10:52' access.log|wc -l
                  95552

            nginx-0.9.5 有问题

                  $ grep '16/Mar/2012:11:1' access.log|wc -l
                  1130305
                  $ grep '16/Mar/2012:11:1' access.log|wc -l   
                  941842

            nginx 0.9.4没有问题
                  $grep '16/Mar/2012:10:47' access.log|wc -l
                  114023
                  $ grep '16/Mar/2012:10:47' access.log|wc -l
                  113723
                  
            nginx 0.9.0
                  $ grep '16/Mar/2012:10:40' access.log|wc -l
                  115428
                  $ grep '16/Mar/2012:10:40' access.log|wc -l
                  115203
         
          nginx-0.8.53没有问题
            $grep '16/Mar/2012:10:30' access.log|wc -l
                117593
            $ grep '16/Mar/2012:10:30' access.log|wc -l
                117466

          nginx-0.7.63没有问题

利用tcpcopy,转发过去的请求丢失率不高的话,就可以认为正常。
从上面我们可以看出nginx-0.9.5~nginx-1.0.0请求丢失率很高,通过分析tcpcopy的日志,大量出现连接请求被拒绝的情况。
其它版本nginx没有此问题。

各位开发人员和运维人员,升级小心点,不要以为新版本就没有问题,还是老老实实利用tcpcopy来作回归测试吧。


页: [1]
查看完整版本: tcpcopy避免了nginx的两大问题