gs_ddr

背景信息

资源池化双集群灾备工具gs_ddr主要实现资源池化同城双中心部署过程中，自动化搭建主集群和灾备集群关系、进行灾备集群切换、强制灾备集群升主等功能。

前提条件

主集群通过OM安装部署，gs_preinstall预安装后，gs_install阶段通过dorado-cluster-mode参数指定primary。
灾备集群通过OM安装部署，gs_preinstall预安装后，gs_install阶段通过dorado-cluster-mode参数指定standby。
双集群的首次建立关系，需要保证主集群与灾备集群配置的同步共享盘XLOG卷对应的LUN复制关系状态为分裂，确保已取消从资源保护状态，保证主备集群都可以读写LUN；搭建的过程中按提示进行操作。
需以集群安装用户身份执行gs_ddr命令。

语法

资源池化双集群关系化搭建

gs_ddr -t start -m [primary|disaster_standby] [-X XMLFILE] [--json JSONFILE] [--time-out=SECS] [-l LOGFILE]

资源池化灾备集群主动升主

gs_ddr -t failover [-r | --restart] [-l LOGFILE]

资源池化双集群关系自动化切换

gs_ddr -t switchover -m [primary|disaster_standby] [-r | --restart] [--time-out=SECS] [-l LOGFILE]

资源池化主集群和容灾集群关系解除
```
gs_ddr -t stop [-X XMLFILE] [--json JSONFILE]
```
资源池化双集群状态查询
```
gs_ddr -t query [-l LOGFILE]
```

参数说明

gs_ddr参数可以分为如下几类：

通用参数

-t

gs_ddr命令的类型。

取值范围：start、failover、switchover、stop、query。
-r | --restart

在进行“switchover”或“failover”时重启集群。
-l

指定日志文件及存放路径。

默认值：$GAUSSLOG/om/gs_ddr-YYYY-MM-DD_hhmmss.log
-?, --help

显示帮助信息。
-V, --version

显示版本号信息。

搭建容灾参数

-m

期望该集群在容灾关系中成为的角色。

取值范围：primary（主集群）或disaster_standby（灾备集群）
-X

集群安装时的xml
--json

带有容灾信息的json文件。

json文件的配置示例如下：
```
{"remoteClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.45.144", "dataIp": "172.31.2.200"}, {"ip": "10.244.45.40", "dataIp": "172.31.0.38"}, {"ip": "10.244.46.138", "dataIp": "172.31.11.145"}, {"ip": "10.244.48.60", "dataIp": "172.31.9.37"}, {"ip": "10.244.47.240", "dataIp": "172.31.11.125"}]]}, "localClusterConf": {"port": 26000, "shards": [[{"ip": "10.244.44.216", "dataIp": "172.31.12.58"}, {"ip": "10.244.45.120", "dataIp": "172.31.0.91"}]]}}
```
参数说明：
- remoteClusterConf：对端集群的dn分片信息。其中port为对端集群主dn的端口，{“ip”: “10.244.45.144”, “dataIp”: “172.31.2.200”}为对端集群dn分片上各节点用于SSH可信通道的IP与流复制的IP映射关系。
- localClusterConf：本集群的dn分片信息。其中port为本集群主dn的端口，{“ip”: “10.244.44.216”, “dataIp”: “172.31.12.58”}为本集群dn分片上各节点用于SSH可信通道的IP与流复制的IP映射关系。
-X与--json参数支持二选一方式进行配置容灾信息，如果命令行中两个参数全部下发，则以json为准。
--time-out=SECS

指定超时时间，主集群会等待备集群连接的超时时间，超时则判定失败，om脚本自动退出。单位：s。

取值范围：正整数，建议值1200。

默认值：1200

需要注意的是，build和start集群都有自己的超时时间设置。对于build集群，默认的超时时间为1209600秒（14天），如果在这个时间内没有完成构建操作，将自动退出。而对于start集群，默认的超时时间为3600247秒（一周），即一周内如果没有完成启动操作，将自动退出。如果不指定--time-out=SECS参数，那么在build集群中，超时时间为1200秒后不会自动退出；而在start集群中，超时时间为1200秒后也不会自动退出。

容灾升主参数

无

主集群和容灾集群切换参数

主集群需要指定 -m disaster_standby，容灾集群需要指定 -m primary。

容灾解除参数

-X

集群安装时的xml
--json

带有本端及对端容灾信息的json文件。

-X、--json的配置方式请参考本节容灾搭建参数配置。

容灾查询参数

无。

容灾状态查询结果说明如下：

项目	含义	值	说明	备注
ddr_cluster_stat	池化容灾中数据库实例状态	normal	标识该数据库实例未参与池化容灾	-
		full_backup	主数据库实例数据全量复制中	池化容灾中仅主数据库实例有此状态
		archive	主数据库实例日志流式复制中	池化容灾中仅主数据库实例有此状态
		backup_fail	主数据库实例数据全量复制失败	池化容灾中仅主数据库实例有此状态
		archive_fail	主数据库实例日志池化复制失败	池化容灾中仅主数据库实例有此状态
		switchover	计划内主备倒换中	池化容灾中主备数据库实例皆有此状态
		restore	灾备数据库实例数据全量恢复中	池化容灾中仅灾备数据库实例有此状态
		restore_fail	灾备数据库实例全备恢复失败	池化容灾中仅灾备数据库实例有此状态
		recovery	灾备数据库实例日志池化复制中	池化容灾中仅灾备数据库实例有此状态
		recovery_fail	灾备数据库实例日志池化复制失败	池化容灾中仅灾备数据库实例有此状态
		promote	灾备数据库实例升主中	池化容灾中仅灾备数据库实例有此状态
		promote_fail	灾备数据库实例升主失败	池化容灾中仅灾备数据库实例有此状态
ddr_switchover_stat	池化容灾主备数据库实例计划内倒换进度展示	百分比	倒换进度展示	-
ddr_failover_stat	池化容灾灾备数据库实例升主进度展示	百分比	倒换进度展示	-

示例

主集群搭建容灾关系。

gs_ddr -t start -m primary -X /opt/software-omm/xml/cluster_config.xml --json /home/omm/json_file
--------------------------------------------------------------------------------
Dorado disaster recovery start c1d7276eb6a711ee857204e8925383b2
--------------------------------------------------------------------------------
Start create dorado storage disaster relationship.
Got the step for action:[start].
Successfully check cluster status is: Normal.
Successfully check instance status.
Start update pg_hba config.
Starting set application_name param
Successfully set application_name param.
Stopping the cluster.
Successfully stopped the cluster.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal.
Successfully started standby instances.
Please ensure that the "Remote Replication Pairs" configured correctly between the primary cluster and the disaster recovery cluster, with Replication Mode in "Synchronous" state.
Ready to move on (yes/no)? yes
Waiting for the main standby connection.
Main standby already connected.
Successfully check cluster status is: Normal.
Successfully removed step file.
Successfully do dorado disaster recovery start.

备集群搭建容灾关系。

gs_ddr -t start -m disaster_standby -X /opt/software-omm/xml/cluster_config.xml --json /home/omm/json_file
--------------------------------------------------------------------------------
Dorado disaster recovery start eb7068ceb6a711eeb0f4989449022b00
--------------------------------------------------------------------------------
Start create dorado storage disaster relationship.
Got the step for action:[start].
Successfully check cluster status is: Normal.
Successfully check instance status.
Start update pg_hba config.
Starting set application_name param
Successfully set application_name param.
Stopping the cluster.
Successfully stopped the cluster.
Start start dssserver in main standby node.
Successfully Start dssserver on node [host154]
Start build main standby datanode in disaster standby cluster.
Successfully build main standby in disaster standby cluster on node [host154]
Stop dssserver instance on main standby node.
Successfully stop dssserver before start cluster on node [host154]
Start set all dss instance CLUSTER_RUN_MODE.
Successfully set dss cfg CLUSTER_RUN_MODE to cluster_standby.
Please ensure that the "Remote Replication Pairs" configured correctly between the primary cluster and the disaster recovery cluster, with Replication Mode in "Synchronous" state.
Ready to move on (yes/no)? yes
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal.
Successfully started standby instances.
Successfully check cluster status is: Normal.
Successfully removed step file.
Successfully do dorado disaster recovery start.

计划内主集群降备。

gs_ddr  -t switchover -m disaster_standby
--------------------------------------------------------------------------------
Dorado disaster recovery switchover 20b22160b6aa11eeb98804e8925383b2
--------------------------------------------------------------------------------
Start dorado disaster switchover.
Waiting for cluster and all instances normal.
Parse cluster conf from file.
Successfully parse cluster conf from file.
Got the step for action:[switchover].
Waiting for cluster and all instances normal.
Stopping the cluster.
Successfully stopped the cluster.
Please manually switchover the primary and secondary replication relationship of the "Remote Replication Pairs" in Device Manager, and ensure the "Local Resource Role" is Secondary.Ready to move on (yes/no)? yes
Start set all dss instance CLUSTER_RUN_MODE.
Successfully set dss cfg CLUSTER_RUN_MODE to cluster_standby.
Starting the cluster.
Successfully started primary instance. Please wait for standby instances.
Waiting cluster normal.
Successfully started standby instances.
The cluster status is Normal.
Successfully removed step file.
Successfully do dorado disaster recovery switchover.

计划内备集群升主。

gs_ddr  -t switchover -m primary
--------------------------------------------------------------------------------
Dorado disaster recovery switchover 1f9fb738b6aa11ee818e989449022b00
--------------------------------------------------------------------------------
Start dorado disaster switchover.
Waiting for cluster and all instances normal.
Parse cluster conf from file.
Successfully parse cluster conf from file.
Got the step for action:[switchover].
Waiting for cluster and all instances normal.
Please ensure that the "Remote Replication Pairs" configured correctly, and check the "Local Resource Role" is Primary.Ready to move on (yes/no)? yes
Start reload cm_agent and cm_server param.
Successfully reload cm guc param on all nodes.
Start set all dss instance CLUSTER_RUN_MODE.
Successfully set dss cfg CLUSTER_RUN_MODE to cluster_primary.
Start failover main standby datanode in disaster standby cluster.
Successfully Failover main standby in disaster standby cluster on node [host154]
Waiting cluster normal.
Successfully started datanode instances.
Waiting for the main standby connection.
Main standby already connected.
Successfully removed step file.
Successfully do dorado disaster recovery switchover.

灾备集群容灾升主。

gs_ddr  -t failover
--------------------------------------------------------------------------------
Dorado disaster recovery failover dd019cd8b6aa11ee9ec104e8925383b2
--------------------------------------------------------------------------------
Start dorado disaster recovery failover.
Got the step for action:[failover].
Successfully check cluster status is: Normal.
Parse cluster conf from file.
Successfully parse cluster conf from file.
Please ensure that the "Remote Replication Pairs" configured correctly, and check the "Local Resource Role" is Primary.Ready to move on (yes/no)? yes
Start reload cm_agent and cm_server param.
Successfully reload cm guc param on all nodes.
Start set all dss instance CLUSTER_RUN_MODE.
Successfully set dss cfg CLUSTER_RUN_MODE to cluster_primary.
Start failover main standby datanode in disaster standby cluster.
Successfully Failover main standby in disaster standby cluster on node [host58]
Waiting cluster normal.
Successfully started datanode instances.
Successfully removed step file.
Finished remove streaming dir.
Successfully do dorado disaster recovery failover.

主集群容灾解除。

gs_ddr -t stop -X /opt/software-omm/xml/cluster_config.xml --json /home/omm/json_file
--------------------------------------------------------------------------------
Dorado disaster recovery stop 6fafea88b6ad11eeae2d04e8925383b2
--------------------------------------------------------------------------------
Start remove dorado disaster recovery relationship.
Got the step for action:[stop].
Dorado disaster recover tmp dir [/data/omm/openGauss/tmp/ddr_cabin] not exist.
Successfully check cluster status is: Normal.
Check cluster type succeed.
Starting remove all node dn instances repl infos.
Successfully remove all node dn instances repl infos.
Start remove pg_hba config.
Finished remove pg_hba config.
Start remove cluster file.
Finished remove cluster file.
Successfully check cluster status is: Normal.
Finished remove streaming dir.
Successfully do dorado disaster recovery stop.

查询容灾状态。

gs_ddr -t query
--------------------------------------------------------------------------------
Dorado disaster recovery query 296a5a9cb6b511eeaacf04e8925383b2
--------------------------------------------------------------------------------
Start dorado disaster query.
Start check archive.
Start check recovery.
Successfully executed dorado disaster recovery query, result:
{'ddr_cluster_stat': 'normal', 'ddr_failover_stat': '', 'ddr_switchover_stat': ''}

VastbaseG100