| tags: [ MySQL ] categories: [ Development ]
ProxySQL 整数溢出
最近学习了下 ProxySQL,在非生产环境大致体验了下,感觉不错,于是小流量上了生产环境,然而居然发现有个奇怪的现象,客户端大概率连不上 MySQL 服务器,报告如下错误:
Max connect timeout reached while reaching hostgroup 3034 after 10000ms
非常郁闷,网上搜索了下,跟我遇到的情况都不符合,我确认了 runtime_mysql_servers
表中所有 server 都是 ONLINE 状态,在 monitor.mysql_server_connect_log
, monitor.mysql_server_ping_log
和 monitor.mysql_server_replication_lag_log
几个表中的检查都非常正常,直连 MySQL 也非常快。
试验了增加下面几个参数,也没效果:
SET mysql-connect_retries_delay = 500; -- default 1ms
SET mysql-connect_timeout_server = 5000; -- default 3000ms
SET mysql-connect_timeout_server_max = 35000; -- default 10000ms
SET mysql-connection_delay_multiplex_ms = 100; -- default 0ms
-- result in "Aborted connection" warnings
SET mysql-connection_max_age_ms = 900000; -- default 0ms
没办法,只能去看代码,根据错误信息,很容易找到了代码位置:
MySrvC *MyHGC::get_random_MySrvC() {
MySrvC *mysrvc=NULL;
unsigned int j;
unsigned int sum=0;
unsigned int TotalUsedConn=0;
unsigned int l=mysrvs->cnt();
...
if ((len * sum) <= (TotalUsedConn * mysrvc->weight * 1.5 + 1)) {
...
}
这里 len
表示当前 host group 遍历到的一个 MySQL server 上面有多少正在使用的TCP链接,sum
是所有 ONLINE 状态的 MySQL server 的权重之和,TotalUsedConn
是此 host group 一共有多少正在使用的 TCP 链接,mysrvc->weight
是当前遍历到的这个 MySQL server 的权重,这四个变量类型都是 unsigned int
,这个默认是四字节,相乘非常容易整数溢出,很值得怀疑。
查了下监控数据,故障发生时,现场是这样的:
db-1 weight=99999999 conn=22
db-2 weight=99999999 conn=20
db-3 weight=1 conn=1
这里 db-3 是主库,放进 readonly host group 里权重为 1,目的是在从库复制延迟超过限制,被踢出 host group 之后,还有个备用的主库,而平时从库正常时,主库尽可能少的承担读压力。
仿照着 ProxySQL 写了段代码验证:
#include <cstdio>
#include <cstdlib>
__thread unsigned int g_seed;
inline int fastrand() {
g_seed = (214013*g_seed+2531011);
return (g_seed>>16)&0x7FFF;
}
int main(int argc, char** argv) {
unsigned int usedConns[] = { 22, 20, 1};
unsigned int weights[] = { 99999999, 99999999, 1};
unsigned int sum = 0;
unsigned int TotalUsedConn = 0;
unsigned int l = 3;
unsigned int j;
for (j = 0; j < l; j++) {
sum += weights[j];
TotalUsedConn += usedConns[j];
printf("j=%u weight=%u usedConn=%u sum=%u TotalUsedConn=%u\n", j, weights[j], usedConns[j], sum, TotalUsedConn);
}
printf("\n");
unsigned int New_sum=0;
unsigned int New_TotalUsedConn=0;
for (j = 0; j < l; j++) {
unsigned int len = usedConns[j];
unsigned int weight = weights[j];
printf("\nj=%u len=%u weight=%u TotalUsedConn=%u sum=%u\n", j, len, weight, TotalUsedConn, sum);
printf(" len*sum=%u TotalUsedConn*weight=%u TotalUsedConn*weight*1.5+1=%lf\n",
len * sum, TotalUsedConn * weight, TotalUsedConn * weight * 1.5 + 1);
if ((len * sum) <= (TotalUsedConn * weight * 1.5 + 1)) {
printf("j=%u old New_sum=%u New_TotalUsedConn=%u\n", j, New_sum, New_TotalUsedConn);
New_sum += weight;
New_TotalUsedConn += len;
printf("j=%u now New_sum=%u New_TotalUsedConn=%u\n", j, New_sum, New_TotalUsedConn);
} else {
printf(" !!! NOT\n");
}
}
printf("\nNew_sum=%u New_TotalUsedConn=%u\n", New_sum, New_TotalUsedConn);
if (New_sum == 0) {
printf("ERROR\n");
return 0;
}
unsigned int k;
if (New_sum > 32768) {
k = rand() % New_sum;
} else {
k = fastrand() % New_sum;
}
New_sum = 0;
for (j = 0; j < l; j++) {
unsigned int len = usedConns[j];
unsigned int weight = weights[j];
printf("\nj=%u len=%u weight=%u k=%u\n", j, len, weight, k);
if ((len * sum) <= (TotalUsedConn * weight * 1.5 + 1)) {
New_sum += weight;
if (k <= New_sum) {
printf("got %u\n because k(%u) < New_sum(%u)!!!", j, k, New_sum);
break;
} else {
printf(" !!! NOT ENOUGH\n");
}
} else {
printf(" !!! NOT\n");
}
}
return 0;
}
运行之,果不其然,整数溢出了,导致判断失误,三个 server 都被排除掉了:
$ clang -o a a.cpp
$ ./a
j=0 weight=99999999 usedConn=22 sum=99999999 TotalUsedConn=22
j=1 weight=99999999 usedConn=20 sum=199999998 TotalUsedConn=42
j=2 weight=1 usedConn=1 sum=199999999 TotalUsedConn=43
j=0 len=22 weight=99999999 TotalUsedConn=43 sum=199999999
len*sum=105032682 TotalUsedConn*weight=5032661 TotalUsedConn*weight*1.5+1=7548992.500000
!!! NOT
j=1 len=20 weight=99999999 TotalUsedConn=43 sum=199999999
len*sum=3999999980 TotalUsedConn*weight=5032661 TotalUsedConn*weight*1.5+1=7548992.500000
!!! NOT
j=2 len=1 weight=1 TotalUsedConn=43 sum=199999999
len*sum=199999999 TotalUsedConn*weight=43 TotalUsedConn*weight*1.5+1=65.500000
!!! NOT
New_sum=0 New_TotalUsedConn=0
ERROR
可怜的 C/C++ 语言!可怜的 C/C++ 程序员!可怜的持有性能第一、正确第二观念的语言设计者、程序编写者!
在 ProxySQL 官方代码修正这个问题前,设置多大的 weight 才是安全的呢?上面的两个乘法运算,一个是 connUsed * total_weight
,一个是 totalConnUsed * weight
,假定单个MySQL server 持有最多 1w 链接,100 台 MySQL server,那么:
connUsed*total_weight=10000 * (100 * weight) <= 2^32-1
totalConnUsed * weight = (10000 * 100) * weight <= 2^32-1
那么 weight 的上限在 4000 多一点,考虑到我们的环境没有那么多 MySQL 服务器,所以可以设置到 10000,也可以保守点设置到 1000,比起主库的 weight=1
,也足够让主库承担尽量少的读压力了。