gs_sdr
背景信息
Vastbase提供了gs_sdr工具,用于在不借助额外存储介质的情况下实现跨Region的异地容灾。提供流式容灾搭建,容灾升主,计划内主备切换,容灾解除、容灾状态监控功能、显示帮助信息和显示版本号信息等功能。
前提条件
需以安装Vastbase的操作系统用户执行gs_sdr命令。
语法
容灾搭建
gs_sdr -t start -m [primary|disaster_standby] [-U DR_USERNAME] [-W DR_PASSWORD] [-X XMLFILE] [--json JSONFILE] [--time-out=SECS] [-l LOGFILE]
容灾升主
gs_sdr -t failover [-l LOGFILE]
计划内主备切换
gs_sdr -t switchover -m [primary|disaster_standby] [--time-out=SECS] [-l LOGFILE]
容灾解除
gs_sdr -t stop [-X XMLFILE] [--json JSONFILE] [-l LOGFILE]
容灾状态监控
gs_sdr -t query [-l LOGFILE]
参数说明
gs_sdr参数可以分为如下几类:
通用参数:
-t
gs_sdr命令的类型。
取值范围:start、failover、switchover、stop、query。
-l
指定日志文件及存放路径。
默认值:$GAUSSLOG/om/gs_sdr-YYYY-MM-DD_hhmmss.log
-?, --help
显示帮助信息。
-V, --version
显示版本号信息。
搭建容灾参数:
-m
期望该集群在容灾关系中成为的角色。
取值范围:primary(主集群)或disaster_standby(灾备集群)。
-U
具有流复制权限的容灾用户名称。
-W
容灾用户密码。
- 搭建容灾关系前,主集群需创建容灾用户,用于容灾鉴权,主备集群必须使用相同的容灾用户名和密码,一次容灾搭建后,该用户密码不可修改。若需修改容灾用户名与密码,需要解除容灾,使用新的容灾用户重新进行搭建。容灾用户密码中不可包含以下字符:“| ;&$<>`\'“{}()[]~*?!\n空白”。
- -U、-W 参数如果搭建命令行未带,则在搭建过程中支持交互式输入。
-X
集群安装时的xml,xml中也可以配置容灾信息用于容灾搭建,即在安装xml的基础上扩展三个字段”localStreamIpmap1”、”remoteStreamIpmap1”、”remotedataPortBase”。
新增字段的配置示例如下,每行信息均有注释进行说明。
<!-- 每台服务器上的节点部署信息 --> <DEVICELIST> <DEVICE sn="pekpomdev00038"> <!-- 当前主机上需要部署的主DN个数 --> <PARAM name="dataNum" value="1"/> <!-- 主DN的基础端口号 --> <PARAM name="dataPortBase" value="26000"/> <!-- 本集群dn分片各节点用于SSH可信通道的IP与流复制的IP映射关系 --> <PARAM name="localStreamIpmap1" value="(10.244.44.216,172.31.12.58),(10.244.45.120,172.31.0.91)"/> <!-- 对端集群dn分片各节点用于SSH可信通道的IP与流复制的IP映射关系 --> <PARAM name="remoteStreamIpmap1" value="(10.244.45.144,172.31.2.200),(10.244.45.40,172.31.0.38),(10.244.46.138,172.31.11.145),(10.244.48.60,172.31.9.37),(10.244.47.240,172.31.11.125)"/> <!-- 对端集群的主dn端口号 --> <PARAM name="remotedataPortBase" value="26000"/> </DEVICE>
--json
带有容灾信息的json文件。
json文件的配置示例如下,#后的文字为说明信息。
{"remoteClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.45.144", "dataIp": "172.31.2.200"}, {"ip": "10.244.45.40", "dataIp": "172.31.0.38"}, {"ip": "10.244.46.138", "dataIp": "172.31.11.145"}, {"ip": "10.244.48.60", "dataIp": "172.31.9.37"}, {"ip": "10.244.47.240", "dataIp": "172.31.11.125"}]]}, "localClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.44.216", "dataIp": "172.31.12.58"}, {"ip": "10.244.45.120", "dataIp": "172.31.0.91"}]]}} 参数说明: # remoteClusterConf:对端集群的dn分片信息。其中port为对端集群主dn的端口,{"ip": "10.244.45.144", "dtaIp": "172.31.2.200"}为对端集群dn分片上各节点用于SSH可信通道的IP与流复制的IP映射关系。 # localClusterConf:本集群的dn分片信息。其中port为本集群主dn的端口,{"ip": "10.244.44.216", "dtaIp": "172.31.12.58"}为本集群dn分片上各节点用于SSH可信通道的IP与流复制的IP映射关系。
-X与--json参数支持二选一方式进行配置容灾信息,如果命令行中两个参数全部下发,则以json为准。
--time-out=SECS
指定超时时间,主集群会等待备集群连接的超时时间,超时则判定失败,om脚本自动退出。单位:s。
取值范围:正整数,建议值1200。
默认值:1200
容灾升主参数:
无
容灾解除参数:
-X
集群安装时的xml,需要额外配置容灾信息,即扩展三个字段”localStreamIpmap1”、”remoteStreamIpmap1”、”remotedataPortBase”。
--json
带有本端及对端容灾信息的json文件。
-X, --json的配置方式请参考本节容灾搭建参数配置。
容灾查询参数:
无
容灾状态查询结果说明如下:
示例
主集群搭建容灾关系:
gs_sdr -t start -m primary -X /opt/install_streaming_primary_cluster.xml --time-out=1200 -U 'hadr_user' -W 'Admin@123' -------------------------------------------------------------------------------- Streaming disaster recovery start 2b9bc268d8a111ecb679fa163e2f2d28 -------------------------------------------------------------------------------- Start create streaming disaster relationship ... Got step:[-1] for action:[start]. Start first step of streaming start. Start common config step of streaming start. Start generate hadr key files. Streaming key files already exist. Finished generate and distribute hadr key files. Start encrypt hadr user info. Successfully encrypt hadr user info. Start save hadr user info into database. Successfully save hadr user info into database. Start update pg_hba config. Successfully update pg_hba config. Start second step of streaming start. Successfully check cluster status is: Normal Successfully check instance status. Successfully check cm_ctl is available. Successfully check cluster is not under upgrade opts. Start checking disaster recovery user. Successfully check disaster recovery user. Start prepare secure files. Start copy hadr user key files. Successfully copy secure files. Start fourth step of streaming start. Starting reload wal_keep_segments value: 16384. Successfully reload wal_keep_segments value: 16384. Start fifth step of streaming start. Successfully set [/omm/CMServer/backup_open][0]. Start sixth step of streaming start. Start seventh step of streaming start. Start eighth step of streaming start. Waiting main standby connection.. Main standby already connected. Successfully check cluster status is: Normal Start ninth step of streaming start. Starting reload wal_keep_segments value: {'6001': '128'}. Successfully reload wal_keep_segments value: {'6001': '128'}. Successfully removed step file. Successfully do streaming disaster recovery start.
备集群搭建容灾关系:
gs_sdr -t start -m disaster_standby -X /opt/install_streaming_standby_cluster.xml --time-out=1200 -U 'hadr_user' -W 'Admin@123' -------------------------------------------------------------------------------- Streaming disaster recovery start e34ec1e4d8a111ecb617fa163e77e94a -------------------------------------------------------------------------------- Start create streaming disaster relationship ... Got step:[-1] for action:[start]. Start first step of streaming start. Start common config step of streaming start. Start update pg_hba config. Successfully update pg_hba config. Start second step of streaming start. Successfully check cluster status is: Normal Successfully check instance status. Successfully check cm_ctl is available. Successfully check cluster is not under upgrade opts. Start build key files from remote cluster. Start copy hadr user key files. Successfully build and distribute key files to all nodes. Start fourth step of streaming start. Start fifth step of streaming start. Successfully set [/omm/CMServer/backup_open][2]. Stopping the cluster by node. Successfully stopped the cluster by node for streaming cluster. Start sixth step of streaming start. Start seventh step of streaming start. Start eighth step of streaming start. Starting the cluster. Successfully started primary instance. Please wait for standby instances. Waiting cluster normal... Successfully started standby instances. Successfully check cluster status is: Normal Start ninth step of streaming start. Successfully removed step file. Successfully do streaming disaster recovery start.
计划内主集群降备:
gs_sdr -t switchover -m disaster_standby -------------------------------------------------------------------------------- Streaming disaster recovery switchover 6897d15ed8a411ec82acfa163e2f2d28 -------------------------------------------------------------------------------- Start streaming disaster switchover ... Streaming disaster cluster switchover... Successfully check cluster status is: Normal Parse cluster conf from file. Successfully parse cluster conf from file. Successfully check cluster is not under upgrade opts. Got step:[-1] for action:[switchover]. Stopping the cluster. Successfully stopped the cluster. Starting the cluster. Successfully started primary instance. Please wait for standby instances. Waiting cluster normal... Successfully started standby instances. Start checking truncation, please wait... Stopping the cluster. Successfully stopped the cluster. Starting the cluster. Successfully started primary instance. Please wait for standby instances. Waiting cluster normal... Successfully started standby instances. . The cluster status is Normal. Successfully removed step file. Successfully do streaming disaster recovery switchover.
计划内备集群升主:
gs_sdr -t switchover -m primary -------------------------------------------------------------------------------- Streaming disaster recovery switchover 20542bbcd8a511ecbbdbfa163e77e94a -------------------------------------------------------------------------------- Start streaming disaster switchover ... Streaming disaster cluster switchover... Waiting for cluster and instances normal... Successfully check cluster status is: Normal Parse cluster conf from file. Successfully parse cluster conf from file. Successfully check cluster is not under upgrade opts. Waiting for switchover barrier... Got step:[-1] for action:[switchover]. Stopping the cluster by node. Successfully stopped the cluster by node for streaming cluster. Starting the cluster. Successfully started primary instance. Please wait for standby instances. Waiting cluster normal... Successfully started standby instances. Successfully check cluster status is: Normal Successfully removed step file. Successfully do streaming disaster recovery switchover.
灾备集群容灾升主:
gs_sdr -t failover -------------------------------------------------------------------------------- Streaming disaster recovery failover 65535214d8a611ecb804fa163e2f2d28 -------------------------------------------------------------------------------- Start streaming disaster failover ... Got step:[-1] for action:[failover]. Successfully check cluster status is: Normal Successfully check cluster is not under upgrade opts. Parse cluster conf from file. Successfully parse cluster conf from file. Got step:[-1] for action:[failover]. Starting drop all node replication slots Finished drop all node replication slots Stopping the cluster by node. Successfully stopped the cluster by node for streaming cluster. Start remove replconninfo for instance:6001 Start remove replconninfo for instance:6002 Start remove replconninfo for instance:6003 Start remove replconninfo for instance:6005 Start remove replconninfo for instance:6004 Successfully removed replconninfo for instance:6001 Successfully removed replconninfo for instance:6004 Successfully removed replconninfo for instance:6003 Successfully removed replconninfo for instance:6002 Successfully removed replconninfo for instance:6005 Start remove pg_hba config. Finished remove pg_hba config. Starting the cluster. Successfully started primary instance. Please wait for standby instances. Waiting cluster normal... Successfully started standby instances. Successfully check cluster status is: Normal Try to clean hadr user info. Successfully clean hadr user info from database. Successfully removed step file. Successfully do streaming disaster recovery failover.
主集群容灾解除:
gs_sdr -t stop -X /opt/install_streaming_standby_cluster.xml -------------------------------------------------------------------------------- Streaming disaster recovery stop dae8539ed8a611ecade9fa163e77e94a -------------------------------------------------------------------------------- Start remove streaming disaster relationship ... Got step:[-1] for action:[stop]. Start first step of streaming stop. Start second step of streaming start. Successfully check cluster status is: Normal Check cluster type succeed. Successfully check cluster is not under upgrade opts. Start third step of streaming stop. Start remove replconninfo for instance:6001 Start remove replconninfo for instance:6002 Successfully removed replconninfo for instance:6001 Successfully removed replconninfo for instance:6002 Start remove cluster file. Finished remove cluster file. Start fourth step of streaming stop. Start remove pg_hba config. Finished remove pg_hba config. Start fifth step of streaming start. Starting drop all node replication slots Finished drop all node replication slots Start sixth step of streaming stop. Successfully check cluster status is: Normal Try to clean hadr user info. Successfully clean hadr user info from database. Successfully removed step file. Successfully do streaming disaster recovery stop.
查询容灾状态:
gs_sdr -t query -------------------------------------------------------------------------------- Streaming disaster recovery query 1201b062d8a411eca83efa163e2f2d28 -------------------------------------------------------------------------------- Start streaming disaster query ... Successfully check cluster is not under upgrade opts. Start check archive. Start check recovery. Start check RPO & RTO. Successfully execute streaming disaster recovery query, result: {'hadr_cluster_stat': 'archive', 'hadr_failover_stat': '', 'hadr_switchover_stat': '', 'RPO': '0', 'RTO': '0'}