2023-09-03
doris
故障报告
一、问题现象
24日 晚上 21:25 thing-14节点的 fe 不可用
2023-08-24 21:24:57,976 WARN (doris-mysql-nio-pool-21504|1159845) [ReadListener.lambda$handleEvent$0():58] Exception happened in one session([remote ip: 10.64.18.82]).
java.io.IOException: Error happened when receiving packet.
at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:534) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:50) ~[doris-fe.jar:1.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2023-08-24 21:24:57,976 WARN (doris-mysql-nio-pool-21500|1159757) [ReadListener.lambda$handleEvent$0():58] Exception happened in one session([remote ip: 10.64.0.130]).
java.io.IOException: Error happened when receiving packet.
at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:534) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:50) ~[doris-fe.jar:1.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2023-08-24 21:24:57,977 WARN (heartbeat mgr|32) [HeartbeatMgr.runAfterCatalogReady():139] get bad heartbeat response: type: BROKER, status: BAD, msg: java.net.ConnectException: Connection refused (Connection refused), name: hdfs_broker, host: 10.64.3.138, port: 8000
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95589|1159853) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95568|1159778) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95550|1159748) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95500|1159479) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95547|1159682) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95548|1159683) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95566|1159776) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,980 WARN (thrift-server-pool-95550|1159748) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,984 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95587|1159844) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285774: errCode = 2, detailMessage = transaction [49285774] is already committed, not pre-committed.
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95547|1159682) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95568|1159778) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95548|1159683) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95566|1159776) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,985 WARN (thrift-server-pool-95500|1159479) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,986 WARN (thrift-server-pool-95550|1159748) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,986 WARN (thrift-server-pool-95534|1159668) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,986 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,987 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,987 WARN (thrift-server-pool-95547|1159682) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,987 WARN (thrift-server-pool-95500|1159479) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,988 WARN (thrift-server-pool-95548|1159683) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,988 WARN (thrift-server-pool-95534|1159668) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:24:57,988 WARN (thrift-server-pool-95568|1159778) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,989 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:57,992 WARN (thrift-server-pool-95550|1159748) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:24:58,015 WARN (thrift-server-pool-95588|1159852) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285772: errCode = 2, detailMessage = transaction [49285772] is already committed, not pre-committed.
2023-08-24 21:24:58,026 WARN (thrift-server-pool-95594|1159880) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285711: errCode = 2, detailMessage = transaction [49285711] is already committed, not pre-committed.
2023-08-24 21:24:58,029 WARN (thrift-server-pool-95574|1159784) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285692: errCode = 2, detailMessage = transaction [49285692] is already committed, not pre-committed.
2023-08-24 21:24:58,030 WARN (thrift-server-pool-95593|1159879) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285722: errCode = 2, detailMessage = transaction [49285722] is already visible, not pre-committed.
2023-08-24 21:25:18,171 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95548|1159683) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95572|1159782) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95566|1159776) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 19177, signature: 49285704
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95593|1159879) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10002, signature: 49285783
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95537|1159671) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10002, signature: 49285792
2023-08-24 21:25:18,174 WARN (thrift-server-pool-95574|1159784) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10003, signature: 49285685
2023-08-24 21:25:18,174 WARN (doris-mysql-nio-pool-21504|1159845) [ConnectProcessor.processOnce():533] Null packet received from network. remote: 10.64.18.68:22908
2023-08-24 21:25:18,174 WARN (doris-mysql-nio-pool-21504|1159845) [ReadListener.lambda$handleEvent$0():58] Exception happened in one session([remote ip: 10.64.18.68]).
java.io.IOException: Error happened when receiving packet.
at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:534) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:50) ~[doris-fe.jar:1.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2023-08-24 21:25:18,175 WARN (thrift-server-pool-95576|1159786) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,175 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,175 WARN (thrift-server-pool-95559|1159769) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,175 WARN (thrift-server-pool-95500|1159479) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,175 WARN (thrift-server-pool-91242|1130575) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,176 WARN (thrift-server-pool-95572|1159782) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,176 WARN (thrift-server-pool-90434|1121706) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,176 WARN (thrift-server-pool-93493|1146530) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 19177, signature: 49285783
2023-08-24 21:25:18,176 WARN (thrift-server-pool-93416|1146005) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 19177, signature: 49285704
2023-08-24 21:25:18,176 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,176 WARN (thrift-server-pool-95582|1159839) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 19177, signature: 49285783
2023-08-24 21:25:18,177 WARN (thrift-server-pool-95534|1159668) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,177 WARN (thrift-server-pool-91097|1129145) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10003, signature: 49285685
2023-08-24 21:25:18,177 WARN (thrift-server-pool-91184|1129790) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10002, signature: 49285792
2023-08-24 21:25:18,177 WARN (thrift-server-pool-95447|1158984) [MasterImpl.finishTask():122] cannot find task. type: PUBLISH_VERSION, backendId: 10002, signature: 49285783
2023-08-24 21:25:18,177 WARN (thrift-server-pool-95572|1159782) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,178 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,179 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,179 WARN (thrift-server-pool-95596|1159886) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,179 WARN (thrift-server-pool-91184|1129790) [FrontendServiceImpl.loadTxnBegin():759] duplicate request for stream load. request id: 7547ca8125770199-c8c667cd4db77bb4, txn: 49285874
2023-08-24 21:25:18,179 WARN (thrift-server-pool-95447|1158984) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,181 WARN (heartbeat mgr|32) [HeartbeatMgr.runAfterCatalogReady():139] get bad heartbeat response: type: BROKER, status: BAD, msg: java.net.ConnectException: Connection refused (Connection refused), name: hdfs_broker, host: 10.64.3.138, port: 8000
2023-08-24 21:25:18,181 WARN (doris-mysql-nio-pool-21506|1159883) [ConnectProcessor.processOnce():533] Null packet received from network. remote: 10.64.17.140:46231
2023-08-24 21:25:18,181 WARN (doris-mysql-nio-pool-21506|1159883) [ReadListener.lambda$handleEvent$0():58] Exception happened in one session([remote ip: 10.64.17.140]).
java.io.IOException: Error happened when receiving packet.
at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:534) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:50) ~[doris-fe.jar:1.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2023-08-24 21:25:18,183 WARN (thrift-server-pool-91242|1130575) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,183 WARN (leaderCheckpointer|138) [Checkpoint.checkMemoryEnoughToDoCheckpoint():313] the memory used percent 99 exceed the checkpoint memory threshold: 70
2023-08-24 21:25:18,184 WARN (thrift-server-pool-95534|1159668) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,185 WARN (thrift-server-pool-95468|1159269) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,186 WARN (thrift-server-pool-95577|1159833) [FrontendServiceImpl.loadTxn2PC():895] failed to commit txn 49285866: errCode = 2, detailMessage = transaction [49285866] is already committed, not pre-committed.
2023-08-24 21:25:18,189 WARN (thrift-server-pool-95497|1159476) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,189 WARN (thrift-server-pool-95596|1159886) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,189 WARN (thrift-server-pool-90434|1121706) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,189 WARN (thrift-server-pool-95458|1159258) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,189 WARN (thrift-server-pool-95500|1159479) [FrontendServiceImpl.loadTxn2PC():895] failed to abort txn -1: errCode = 2, detailMessage = transaction [-1] not found
2023-08-24 21:25:18,190 WARN (thrift-server-pool-95534|1159668) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,190 WARN (thrift-server-pool-95587|1159844) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,190 WARN (thrift-server-pool-91184|1129790) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,190 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 21:25:18,199 WARN (thrift-server-pool-95572|1159782) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
2023-08-24 22:53:21,671 WARN (UNKNOWN 10.64.2.34_9011_1675406978447(-1)|1) [Catalog.notifyNewFETypeTransfer():2318] notify new FE type transfer: UNKNOWN
2023-08-24 22:53:21,694 WARN (RepNode 10.64.2.34_9011_1675406978447(-1)|65) [Catalog.notifyNewFETypeTransfer():2318] notify new FE type transfer: FOLLOWER
25日早上后 thing-15节点 也不可用 ,切不可通过重启fe来恢复。
2023-08-25 10:14:50,112 INFO (replayer|78) [Catalog.replayJournal():2444] replayed journal id is 144013318, replay to journal id is 155973007
2023-08-25 10:14:50,119 ERROR (replayer|78) [BDBJournalCursor.<init>():84] Can not find the key:144013319, fail to get journal cursor. will exit.
并且再次启动失败
二、持续时间
14节点:2023-08-24 21:25 ~ 2023-08-24 22:53
15节点: 25号早上8点 到 11点
三、问题原因
通过最开始的日志发现
2023-08-24 21:25:18,183 WARN (leaderCheckpointer|138) [Checkpoint.checkMemoryEnoughToDoCheckpoint():313] the memory used percent 99 exceed the checkpoint memory threshold: 70
有一条内存满了的日志,在往上找报了一个找不到表的信息
2023-08-24 21:24:57,978 WARN (thrift-server-pool-95475|1159276) [FrontendServiceImpl.loadTxnBegin():766] failed to begin: errCode = 7, detailMessage = unknown table, tableName=t_device_info_wide
所以说是fe内存满了 导致宕机。
第二天早上的 15 机器宕机也是内存的问题,这个写 t_device_info_wide 表的任务还没有下线导致的内存增加。
但是重启15机器的 fe 报了 元数据错误,这个意思就是 15 节点元数据损坏了,起不来了。
四、解决问题
上面本质原因是 找不到表 t_device_info_wide 导致内存溢出。先删掉写 表t_device_info_wide的任务 修改FE节点 内存 从 8G 到 16G【内存默认配置在fe.conf配置文件当中】 ,并且从监控上看也是内存占用很高,需要增加了。

15节点元数据算坏,只能手动停止FE进程 。手动删除元数据(可以备份到其他目录)
登录 客户端 执行 删除 15节点FE的命令
删除 FE 节点
使用以下命令删除对应的 FE 节点:
ALTER SYSTEM DROP FOLLOWER[OBSERVER] "fe_host:edit_log_port";
手动从master 同步元数据 并启动15节点
首先第一次启动时,需执行以下命令:
./bin/start_fe.sh --helper leader_fe_host:edit_log_port --daemon
其中 leader_fe_host 为 Master 所在节点 ip, edit_log_port 在 Master 的配置文件 fe.conf 中。--helper 参数仅在 follower 和 observer 第一次启动时才需要。
启动成功后,在把新的FE添加进去
将 Follower 或 Observer 加入到集群
添加 Follower 或 Observer。使用 mysql-client 连接到已启动的 FE,并执行:
ALTER SYSTEM ADD FOLLOWER "follower_host:edit_log_port";
或
ALTER SYSTEM ADD OBSERVER "observer_host:edit_log_port";
最终问题解决。
五、总结
fe 有守护进程,但是没有起作用,宕机没有起来。没有对进程的存活监控。没有对进程内存使用的监控报警。
虽然fe是高可用的方案,但是各业务系统都是连接的单个fe,存在单点故障。
解决方案 连接多个节点或者对fe节点[8030 9030 等端口]做一个lb负载均衡操作,这样单个节点有问题后会自动切换到另外一个节点,并且还可以做到负载均衡,对单个节点压力较小。
六、参考资料
https://doris.apache.org/zh-CN/docs/1.2/admin-manual/cluster-management/elastic-expansion/