tcpcopy避免了nginx的两大问题 , ACE网络编程开发|开源框架

winston 发表于 2012-3-19 22:42:26

tcpcopy避免了nginx的两大问题

当利用nginx某个版本的情况下，也许你不知道这个版本有哪些问题，那怎么避免上线过程中出现的问题呢？
下面就是利用我们开发的tcpcopy来成功避免了上线过程中才会出现的两大问题（其它压力测试工具是发现不了的或者很难发现），希望对你们有所参考。
第一次是二号人物Maxim Dounin书写的keepalive模块的问题，具体如下：
http://blog.csdn.net/wangbin579/article/details/6995456

第二次是发现nginx-0.9.5~nginx-1.0.0在高压情况下处理不佳的问题，具体如下：
http://blog.csdn.net/wangbin579/article/details/7360662

回归测试之利器---tcpcopy

当你改进了代码，但不知道会不会引起新的问题或者还有一些问题你没有考虑到，怎么办？
tcpcopy可以解决你这些问题，尤其适合于回归测试。
tcpcopy的原理就是复制在线的请求（比如http请求）到测试服务器中去，由于复制过去的请求数据包的过程跟路由过程相差不大，所以可以做到几乎不会影响在线系统，但针对测试机器，其测试效果几乎等同于上线测试结果。
需要值得一提的是，针对http请求，tcpcopy复制过去的效果是最佳的。

作者：wangbin579 发表于2012-3-19 18:21:58 原文链接

winston 发表于 2012-3-19 22:43:05

利用tcpcopy，发现nginx一重大bug
上周运维组利用tcpcopy压力测试nginx hmux模块，压力达到2000多请求每秒的时候，发现nginx 每个进程 cpu100%。
利用gcore产生core文件，发现nginx处于如下死循环中：

(gdb) bt
#00x00000000004690d8 in ngx_http_upstream_get_round_robin_peer (pc=0x8a51bd8, data=0x899fc88)
at src/http/ngx_http_upstream_round_robin.c:482
#10x00000000004a45a0 in ngx_http_upstream_get_keepalive_peer (pc=0x8a51bd8, data=0x899fc58)
at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#20x0000000000428c53 in ngx_event_connect_peer (pc=0x8a51bd8) at src/event/ngx_event_connect.c:24
#30x0000000000460465 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1081
#40x000000000046428a in ngx_http_upstream_next (r=0x8a4b770, u=0x8a51bc8, ft_type=2) at src/http/ngx_http_upstream.c:2892
#50x0000000000461483 in ngx_http_upstream_process_header (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1525
#60x000000000046104f in ngx_http_upstream_send_request (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1392
#70x00000000004608a7 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1182
#80x000000000045f634 in ngx_http_upstream_init_request (r=0x8a4b770) at src/http/ngx_http_upstream.c:620
#90x000000000045ee79 in ngx_http_upstream_init (r=0x8a4b770) at src/http/ngx_http_upstream.c:429
#10 0x0000000000455dd4 in ngx_http_read_client_request_body (r=0x8a4b770, post_handler=0x45ed95 <ngx_http_upstream_init>)
at src/http/ngx_http_request_body.c:153
#11 0x000000000049e3bd in ngx_hmux_handler (r=0x8a4b770) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043eb06 in ngx_http_core_content_phase (r=0x8a4b770, ph=0x89bc6c0) at src/http/ngx_http_core_module.c:1350
#13 0x000000000043d75c in ngx_http_core_run_phases (r=0x8a4b770) at src/http/ngx_http_core_module.c:852
#14 0x000000000043d6d8 in ngx_http_handler (r=0x8a4b770) at src/http/ngx_http_core_module.c:835
#15 0x000000000044af21 in ngx_http_process_request (r=0x8a4b770) at src/http/ngx_http_request.c:1641
#16 0x00000000004498a5 in ngx_http_process_request_headers (rev=0x2af3569ee180) at src/http/ngx_http_request.c:1084
#17 0x0000000000449092 in ngx_http_process_request_line (rev=0x2af3569ee180) at src/http/ngx_http_request.c:889
#18 0x00000000004483bd in ngx_http_init_request (rev=0x2af3569ee180) at src/http/ngx_http_request.c:514
#19 0x0000000000433a6f in ngx_epoll_process_events (cycle=0x8995420, timer=79, flags=1)
at src/event/modules/ngx_epoll_module.c:642
#20 0x0000000000425079 in ngx_process_events_and_timers (cycle=0x8995420) at src/event/ngx_event.c:245
#21 0x00000000004318ae in ngx_worker_process_cycle (cycle=0x8995420, data=0x0) at src/os/unix/ngx_process_cycle.c:795
---Type <return> to continue, or q <return> to quit---
#22 0x000000000042e621 in ngx_spawn_process (cycle=0x8995420, proc=0x4316df <ngx_worker_process_cycle>, data=0x0,
name=0x571ef6 "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000430791 in ngx_start_worker_processes (cycle=0x8995420, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:355
#24 0x000000000042fe9a in ngx_master_process_cycle (cycle=0x8995420) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403bf3 in main (argc=1, argv=0x7fff5446ef18) at src/core/nginx.c:401

这个bug据运维人员一查，很失望，在nginx 1.0.7解决了此问题,我们利用的是0.8系列。

Fixing cpu hog with all upstream servers marked "down".

The following configuration causes nginx to hog cpu due to infinite loop
in ngx_http_upstream_get_peer():

upstream backend {
   server 127.0.0.1:8080 down;
   server 127.0.0.1:8080 down;
}

server {
   ...
   location / {
      proxy_pass http://backend;
   }
}

Make sure we don't loop infinitely in ngx_http_upstream_get_peer() but stop
after resetting peer weights once.

Return 0 if we are stuck.This is guaranteed to work as peer 0 always exists,
and eventually ngx_http_upstream_get_round_robin_peer() will do the right
thing falling back to backup servers or returning NGX_BUSY.

然后我们利用nginx 1.0.10，来测试，目前还没有重现上周发现的bug。

要是我们早点测试这个模块，也许我们就能够首先发现这个bug。
不过机会是有的，因为我们有tcpcopy，什么样的在线流量都能够制造，不怕找不到流行开源软件的bug

上午那台机器没有重现此bug，但在新的一台机器出现了。
core文件信息如下:

#00x000000000046ae0b in ngx_http_upstream_get_round_robin_peer (pc=0x1b822678, data=0x1b826e10)
at src/http/ngx_http_upstream_round_robin.c:482
#10x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#20x0000000000429fab in ngx_event_connect_peer (pc=0x1b822678) at src/event/ngx_event_connect.c:24
#30x000000000046219c in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1103
#40x0000000000465f61 in ngx_http_upstream_next (r=0x1b821230, u=0x1b822668, ft_type=2)
at src/http/ngx_http_upstream.c:2900
#50x00000000004631c0 in ngx_http_upstream_process_header (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1547
#60x0000000000462d8c in ngx_http_upstream_send_request (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1414
#70x00000000004625de in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1204
#80x000000000046129d in ngx_http_upstream_init_request (r=0x1b821230) at src/http/ngx_http_upstream.c:631
#90x0000000000460a6d in ngx_http_upstream_init (r=0x1b821230) at src/http/ngx_http_upstream.c:432
#10 0x000000000045782b in ngx_http_read_client_request_body (r=0x1b821230, post_handler=0x460989 <ngx_http_upstream_init>)
at src/http/ngx_http_request_body.c:154
#11 0x00000000004a0875 in ngx_hmux_handler (r=0x1b821230) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043ffd1 in ngx_http_core_content_phase (r=0x1b821230, ph=0x1b5d4f88) at src/http/ngx_http_core_module.c:1365
#13 0x000000000043ebf5 in ngx_http_core_run_phases (r=0x1b821230) at src/http/ngx_http_core_module.c:861
#14 0x000000000043eb71 in ngx_http_handler (r=0x1b821230) at src/http/ngx_http_core_module.c:844
#15 0x000000000044c9a1 in ngx_http_process_request (r=0x1b821230) at src/http/ngx_http_request.c:1665
#16 0x000000000044b34d in ngx_http_process_request_headers (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:1111
#17 0x000000000044ab26 in ngx_http_process_request_line (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:911
#18 0x0000000000449d7d in ngx_http_init_request (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:518
#19 0x0000000000434ea7 in ngx_epoll_process_events (cycle=0x1b5ae090, timer=1352, flags=1)
at src/event/modules/ngx_epoll_module.c:678
#20 0x0000000000426480 in ngx_process_events_and_timers (cycle=0x1b5ae090) at src/event/ngx_event.c:245
---Type <return> to continue, or q <return> to quit---
#21 0x0000000000432d0c in ngx_worker_process_cycle (cycle=0x1b5ae090, data=0x0) at src/os/unix/ngx_process_cycle.c:801
#22 0x000000000042faad in ngx_spawn_process (cycle=0x1b5ae090, proc=0x432b60 <ngx_worker_process_cycle>, data=0x0,
name=0x57450e "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000431c12 in ngx_start_worker_processes (cycle=0x1b5ae090, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:360
#24 0x00000000004312fe in ngx_master_process_cycle (cycle=0x1b5ae090) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403c8c in main (argc=1, argv=0x7fffba823048) at src/core/nginx.c:405
(gdb) info lo
rrp = 0x1b826e10
now = 1321860194
m = 8
rc = 461502464
i = 18446744061305903636----注意这儿i的值
n = 0
c = 0x40
peer = 0x1b5cf2c0
peers = 0x1
(gdb) p *rrp
$5 = {peers = 0x1b5cf230, current = 3, tried = 0x1b826e28, data = 4}
(gdb) p rrp->peers->peer
$6 = {sockaddr = 0x1b5c63b8, socklen = 16, name = {len = 19, data = 0x1b5c63c8 "xxx.xxx.xxx.xxx:36901server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$7 = {sockaddr = 0x1b5c6418, socklen = 16, name = {len = 19, data = 0x1b5c6428 "xxx.xxx.xxx.xxx:36902server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$8 = {sockaddr = 0x1b5c6478, socklen = 16, name = {len = 19, data = 0x1b5c6488 "xxx.xxx.xxx.xxx:36903server"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer
$9 = {sockaddr = 0x1b5c64d8, socklen = 16, name = {len = 19, data = 0x1b5c64e8 "xxx.xxx.xxx.xxx:36904keepalive"},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}

(gdb) up 1
#10x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
227       rc = kp->original_get_peer(pc, kp->data);
(gdb) info lo
kp = 0x1b826de0
item = 0x6300000000
rc = 0
q = 0x1b826d51
cache = 0x0
c = 0x1b821c50

nginx版本和编译选项：
# /home/nginx/sbin/nginx -V
nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.1.2 20080704 (Red Hat 4.1.2-48)
nginx: TLS SNI support enabled
nginx: configure arguments: --prefix=/home/nginx --user=nginx --group=nginx --without-select_module --without-poll_module --with-http_ssl_module --with-http_realip_module --with-http_gzip_static_module --with-http_stub_status_module --without-http_fastcgi_module --without-http_memcached_module --without-http_empty_gif_module --with-pcre=../pcre-7.9 --with-openssl=../openssl-0.9.8r --add-module=../hmux --add-module=../ngx_http_upstream_keepalive-c6396fef9295 --with-debug
nginx部分配置如下：
events {
worker_connections8192;
use epoll;
epoll_events 4096;
accept_mutex off;
}

...
upstream yyy{
   server xxx.xxx.xxx.xxx:36901;
   server xxx.xxx.xxx.xxx:36902;
   server xxx.xxx.xxx.xxx:36903;
   server xxx.xxx.xxx.xxx:36904;

   keepalive 1024;
}
...

各位nginx牛人，帮我们确认一下是否就是nginx的bug。在此谢谢了
联系方式：
wangbin5790@hotmail.com

winston 发表于 2012-3-19 22:43:33

小心点，不要瞎升级nginx
最近新增了功能，顺便把nginx升级到1.0.0，结果被nginx的bug恶性到了。
下面是利用tcpcopy，从access.log来分析nginx各种版本的测试结果：

      nginx-1.0.14 没有问题

               $ grep '16/Mar/2012:11:37' access.log|wc -l
               107244
               $ grep '16/Mar/2012:11:37' access.log|wc -l
               107043

         nginx-1.0.5 没有问题
               $ grep '16/Mar/2012:11:43' access.log|wc -l
               103962
               $ grep '16/Mar/2012:11:43' access.log|wc -l
               103832

         nginx-1.0.2 没有问题
               $ grep '16/Mar/2012:11:48' access.log|wc -l
               104612
               $ grep '16/Mar/2012:11:48' access.log|wc -l
               104383
         nginx-1.0.1 没有问题
               $ grep '16/Mar/2012:11:52' access.log|wc -l
               102697
               $ grep '16/Mar/2012:11:52' access.log|wc -l
               102645


      nginx-1.0.0有问题
               $ grep '16/Mar/2012:10:13' access.log|wc -l
               117089
               $ grep '16/Mar/2012:10:13' access.log|wc -l

               95544

         nginx-0.9.6 有问题
               $ grep '16/Mar/2012:10:52' access.log|wc -l
               116557
               $ grep '16/Mar/2012:10:52' access.log|wc -l
               95552

         nginx-0.9.5 有问题

               $ grep '16/Mar/2012:11:1' access.log|wc -l
               1130305
               $ grep '16/Mar/2012:11:1' access.log|wc -l
               941842

         nginx 0.9.4没有问题
               $grep '16/Mar/2012:10:47' access.log|wc -l
               114023
               $ grep '16/Mar/2012:10:47' access.log|wc -l
               113723

         nginx 0.9.0
               $ grep '16/Mar/2012:10:40' access.log|wc -l
               115428
               $ grep '16/Mar/2012:10:40' access.log|wc -l
               115203

      nginx-0.8.53没有问题
         $grep '16/Mar/2012:10:30' access.log|wc -l
            117593
         $ grep '16/Mar/2012:10:30' access.log|wc -l
            117466

      nginx-0.7.63没有问题

利用tcpcopy，转发过去的请求丢失率不高的话，就可以认为正常。
从上面我们可以看出nginx-0.9.5~nginx-1.0.0请求丢失率很高，通过分析tcpcopy的日志，大量出现连接请求被拒绝的情况。
其它版本nginx没有此问题。

各位开发人员和运维人员，升级小心点，不要以为新版本就没有问题，还是老老实实利用tcpcopy来作回归测试吧。

页: [1]

ACE Developer's Archiver

tcpcopy避免了nginx的两大问题