最近学习了下 ProxySQL,在非生产环境大致体验了下,感觉不错,于是小流量上了生产环境,然而居然发现有个奇怪的现象,客户端大概率连不上 MySQL 服务器,报告如下错误:

Max connect timeout reached while reaching hostgroup 3034 after 10000ms

非常郁闷,网上搜索了下,跟我遇到的情况都不符合,我确认了 runtime_mysql_servers 表中所有 server 都是 ONLINE 状态,在 monitor.mysql_server_connect_log, monitor.mysql_server_ping_logmonitor.mysql_server_replication_lag_log 几个表中的检查都非常正常,直连 MySQL 也非常快。

试验了增加下面几个参数,也没效果:

SET mysql-connect_retries_delay = 500;  -- default 1ms
SET mysql-connect_timeout_server = 5000;	 -- default 3000ms
SET mysql-connect_timeout_server_max = 35000; -- default 10000ms

SET mysql-connection_delay_multiplex_ms = 100; -- default 0ms

-- result in "Aborted connection" warnings
SET mysql-connection_max_age_ms = 900000; -- default 0ms

没办法,只能去看代码,根据错误信息,很容易找到了代码位置

MySrvC *MyHGC::get_random_MySrvC() {
  MySrvC *mysrvc=NULL;
	unsigned int j;
	unsigned int sum=0;
	unsigned int TotalUsedConn=0;
	unsigned int l=mysrvs->cnt();

  ...

    if ((len * sum) <= (TotalUsedConn * mysrvc->weight * 1.5 + 1)) {
 
  ...
}

这里 len 表示当前 host group 遍历到的一个 MySQL server 上面有多少正在使用的TCP链接,sum 是所有 ONLINE 状态的 MySQL server 的权重之和,TotalUsedConn 是此 host group 一共有多少正在使用的 TCP 链接,mysrvc->weight 是当前遍历到的这个 MySQL server 的权重,这四个变量类型都是 unsigned int,这个默认是四字节,相乘非常容易整数溢出,很值得怀疑。

查了下监控数据,故障发生时,现场是这样的:

db-1  weight=99999999 conn=22
db-2  weight=99999999 conn=20
db-3  weight=1        conn=1

这里 db-3 是主库,放进 readonly host group 里权重为 1,目的是在从库复制延迟超过限制,被踢出 host group 之后,还有个备用的主库,而平时从库正常时,主库尽可能少的承担读压力。

仿照着 ProxySQL 写了段代码验证:

#include <cstdio>
#include <cstdlib>

__thread unsigned int g_seed;

inline int fastrand() {
    g_seed = (214013*g_seed+2531011);
    return (g_seed>>16)&0x7FFF;
}

int main(int argc, char** argv) {
    unsigned int usedConns[] = { 22,       20,       1};
    unsigned int weights[]   = { 99999999, 99999999, 1};

    unsigned int sum = 0;
    unsigned int TotalUsedConn = 0;
    unsigned int l = 3;
    unsigned int j;

    for (j = 0; j < l; j++) {
        sum += weights[j];
        TotalUsedConn += usedConns[j];
        printf("j=%u weight=%u usedConn=%u sum=%u TotalUsedConn=%u\n", j, weights[j], usedConns[j], sum, TotalUsedConn);
    }
    printf("\n");

    unsigned int New_sum=0;
    unsigned int New_TotalUsedConn=0;

    for (j = 0; j < l; j++) {
        unsigned int len = usedConns[j];
        unsigned int weight = weights[j];
        printf("\nj=%u len=%u weight=%u TotalUsedConn=%u sum=%u\n", j, len, weight, TotalUsedConn, sum);
        printf("        len*sum=%u TotalUsedConn*weight=%u TotalUsedConn*weight*1.5+1=%lf\n",
                len * sum, TotalUsedConn * weight, TotalUsedConn * weight * 1.5 + 1);

        if ((len * sum) <= (TotalUsedConn * weight * 1.5 + 1)) {
            printf("j=%u old New_sum=%u New_TotalUsedConn=%u\n", j, New_sum, New_TotalUsedConn);
            New_sum += weight;
            New_TotalUsedConn += len;
            printf("j=%u now New_sum=%u New_TotalUsedConn=%u\n", j, New_sum, New_TotalUsedConn);
        } else {
            printf(" !!! NOT\n");
        }
    }

    printf("\nNew_sum=%u New_TotalUsedConn=%u\n", New_sum, New_TotalUsedConn);
    if (New_sum == 0) {
        printf("ERROR\n");
        return 0;
    }

    unsigned int k;
    if (New_sum > 32768) {
        k = rand() % New_sum;
    } else {
        k = fastrand() % New_sum;
    }

    New_sum = 0;
    for (j = 0; j < l; j++) {
        unsigned int len = usedConns[j];
        unsigned int weight = weights[j];

        printf("\nj=%u len=%u weight=%u k=%u\n", j, len, weight, k);
        if ((len * sum) <= (TotalUsedConn * weight * 1.5 + 1)) {
            New_sum += weight;
            if (k <= New_sum) {
                printf("got %u\n because k(%u) < New_sum(%u)!!!", j, k, New_sum);
                break;
            } else {
                printf(" !!! NOT ENOUGH\n");
            }
        } else {
            printf(" !!! NOT\n");
        }
    }

    return 0;
}

运行之,果不其然,整数溢出了,导致判断失误,三个 server 都被排除掉了:

$ clang -o a a.cpp
$ ./a
j=0 weight=99999999 usedConn=22 sum=99999999 TotalUsedConn=22
j=1 weight=99999999 usedConn=20 sum=199999998 TotalUsedConn=42
j=2 weight=1 usedConn=1 sum=199999999 TotalUsedConn=43


j=0 len=22 weight=99999999 TotalUsedConn=43 sum=199999999
        len*sum=105032682 TotalUsedConn*weight=5032661 TotalUsedConn*weight*1.5+1=7548992.500000
 !!! NOT

j=1 len=20 weight=99999999 TotalUsedConn=43 sum=199999999
        len*sum=3999999980 TotalUsedConn*weight=5032661 TotalUsedConn*weight*1.5+1=7548992.500000
 !!! NOT

j=2 len=1 weight=1 TotalUsedConn=43 sum=199999999
        len*sum=199999999 TotalUsedConn*weight=43 TotalUsedConn*weight*1.5+1=65.500000
 !!! NOT

New_sum=0 New_TotalUsedConn=0
ERROR

可怜的 C/C++ 语言!可怜的 C/C++ 程序员!可怜的持有性能第一、正确第二观念的语言设计者、程序编写者!

在 ProxySQL 官方代码修正这个问题前,设置多大的 weight 才是安全的呢?上面的两个乘法运算,一个是 connUsed * total_weight,一个是 totalConnUsed * weight,假定单个MySQL server 持有最多 1w 链接,100 台 MySQL server,那么:

 connUsed*total_weight=10000 * (100 * weight) <= 2^32-1
 totalConnUsed * weight = (10000 * 100) * weight <= 2^32-1

那么 weight 的上限在 4000 多一点,考虑到我们的环境没有那么多 MySQL 服务器,所以可以设置到 10000,也可以保守点设置到 1000,比起主库的 weight=1,也足够让主库承担尽量少的读压力了。