Contents
  1. 1. 改造enterpoint
  2. 2. statefulset配置
  3. 3. 观察状态
    1. 3.1. pod重启后ip保持不变
    2. 3.2. pvc静态化
    3. 3.3. 扩容和缩容
      1. 3.3.1. 扩容
      2. 3.3.2. 缩容
    4. 3.4. 模拟主机故障

相对于deployment来说statefulset就是有状态的一组容器,适合用来配置有状态集群。状态就是容器名固定,容器ip固定,挂载硬盘固定。
zookeeper集群相对来说比较简单,使用它来验证一下。

改造enterpoint

参考zookeeper的官方docker镜像的配置方式,他是使用docker-entrypoint.sh作为入口,结合传入容器的环境变量实时生成zoo.cfg配置的。
根据statefulset的podname命名规则和固定的service name对应关系,我们可以计算出最重要的集群配置。参考官方脚本并做改动,使脚本符合我们生成集群配置的需求。
docker-entrypoint.sh 改动后:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#!/bin/bash
# ZOO_SERVICE_NAME=k8s headless svc
# ZOO_COUNT=replicas num,节点数
set -e

# Allow the container to be started with `--user`
if [[ "$1" = 'zkServer.sh' && "$(id -u)" = '0' ]]; then
if [ "$ZOO_DATA_DIR"x != "/data"x ]; then
mkdir -p $ZOO_DATA_DIR
mkdir -p $ZOO_DATA_LOG_DIR
fi
chown -R zookeeper "$ZOO_DATA_DIR" "$ZOO_DATA_LOG_DIR" "$ZOO_LOG_DIR"
exec gosu zookeeper "$0" "$@"
fi

# Generate the config only if it doesn't exist
if [[ ! -f "$ZOO_CONF_DIR/zoo.cfg" ]]; then
CONFIG="$ZOO_CONF_DIR/zoo.cfg"
{
echo "dataDir=$ZOO_DATA_DIR"
echo "dataLogDir=$ZOO_DATA_LOG_DIR"

echo "tickTime=$ZOO_TICK_TIME"
echo "initLimit=$ZOO_INIT_LIMIT"
echo "syncLimit=$ZOO_SYNC_LIMIT"

echo "autopurge.snapRetainCount=$ZOO_AUTOPURGE_SNAPRETAINCOUNT"
echo "autopurge.purgeInterval=$ZOO_AUTOPURGE_PURGEINTERVAL"
echo "maxClientCnxns=$ZOO_MAX_CLIENT_CNXNS"
echo "standaloneEnabled=$ZOO_STANDALONE_ENABLED"
echo "admin.enableServer=$ZOO_ADMINSERVER_ENABLED"
} >> "$CONFIG"

if [[ -n $ZOO_ELECTION_PORT_BIND_RETRY ]]; then
echo "electionPortBindRetry=$ZOO_ELECTION_PORT_BIND_RETRY" >> "$CONFIG"
fi
if [[ -n $ZOO_4LW_COMMANDS_WHITELIST ]]; then
echo "4lw.commands.whitelist=$ZOO_4LW_COMMANDS_WHITELIST" >> "$CONFIG"
fi

for cfg_extra_entry in $ZOO_CFG_EXTRA; do
echo "$cfg_extra_entry" >> "$CONFIG"
done

name=${HOSTNAME%%-*}
if [ ! $ZOO_SERVICE_NAME ]; then
ZOO_SERVICE_NAME=$name
fi

# 根据节点数计算出集群的节点配置,podid从0开始,N-1结束
if [[ $ZOO_COUNT -gt 0 ]]; then
for id in $(seq 1 $ZOO_COUNT); do
podid=$[id-1]
server="server.${id}=${name}-${podid}.${ZOO_SERVICE_NAME}:2888:3888;2181"
echo "$server" >> "$CONFIG"
done
else
echo "ZOO_COUNT not exist."
exit 1
fi

fi

# Write myid only if it doesn't exist
# myid=pod_id+1
PODID=${HOSTNAME##*-}
ZOO_MY_ID=$[PODID+1]
if [[ ! -f "$ZOO_DATA_DIR/myid" ]]; then
echo "${ZOO_MY_ID:-1}" > "$ZOO_DATA_DIR/myid"
fi

sleep 180
exec "$@"

statefulset配置

statefulset的使用必须有持久化,可以用nfs或者其他,这里使用云硬盘。另外需要配置headless service,使dns能解析每个节点的ip。
设置集群环境变量ZOO_COUNT和ZOO_SERVICE_NAME
headless service配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Service # 对象类型为Service
metadata:
name: zksvc
labels:
app: zookeeper
spec:
ports:
- name: zk # Pod间通信的端口名称
port: 2181 # Pod间通信的端口号
- name: zk2888
port: 2888
- name: zk3888
port: 3888
selector:
app: zookeeper # 选择标签为app:zookeeper的Pod
clusterIP: None # 必须设置为None,表示Headless Service

statefulset配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zksvc # headless service的名称
replicas: 3
selector:
matchLabels:
app: zookeeper
template:
metadata:
labels:
app: zookeeper
spec:
containers:
- name: zookeeper
image: zookeeper:3.6.3
env:
- name: ZOO_SERVICE_NAME # headless service的名称,传给entrypoint组合集群节点的hostname
value: zksvc
- name: ZOO_COUNT # 集群节点数=replicas
value: "3"
- name: ZOO_DATA_DIR
value: /zkdata/data
- name: ZOO_DATA_LOG_DIR
value: /zkdata/datalog
- name: ZOO_MAX_CLIENT_CNXNS
value: "600"
- name: ZOO_STANDALONE_ENABLED
value: "false"
- name: ZOO_ADMINSERVER_ENABLED
value: "false"
- name: ZOO_AUTOPURGE_SNAPRETAINCOUNT
value: "3"
- name: ZOO_AUTOPURGE_PURGEINTERVAL
value: "24"
- name: ZOO_4LW_COMMANDS_WHITELIST
value: "ruok,srvr"
- name: ZOO_ELECTION_PORT_BIND_RETRY
value: 300
- name: JVMFLAGS
value: "-Xms128m -Xmx128m"
ports:
- containerPort: 2181
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
exec:
command:
- /bin/bash
- -c
- |
ok=`echo ruok|nc localhost 2181`
if [ "$ok"x = "imok"x ]
then
exit 0
else
echo "zk is not ok. 2181 close"
exit 1
fi
timeoutSeconds: 1
volumeMounts: # Pod挂载的存储
- name: data
mountPath: /zkdata
- name: zkentrypoint
mountPath: /docker-entrypoint.sh
subPath: docker-entrypoint.sh
volumes:
- name: zkentrypoint
configMap:
name: zkcm
defaultMode: 360
securityContext:
fsGroup: 1000
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: cbs-premium # 持久化存储的类型
  • ZOO_ELECTION_PORT_BIND_RETRY electionPortBindRetry 绑定leader election port重试次数,默认重试3次,在k8s环境需要设大,等待dns解析,不然后容易出错
  • fsGroup指定磁盘挂载后的属组,没有指定就是root
  • defaultMode指定entrypoint挂载之后的属性,注意给他执行权限,需要进行十进制到八进制转化,这里360相当于八进制550
  • volumeClaimTemplates是动态创建pvc,需要先配置好storage class

使用kustomize组织一下就可以运行创建zk集群了。

1
2
3
4
5
6
7
8
9
10
cat <<EOF >./kustomization.yaml
namespace: default
resources:
- zksvc.yml
- zkcluster.yml
configMapGenerator:
- name: zkcm
files:
- docker-entrypoint.sh
EOF

观察状态

pod重启后ip保持不变

pod命名规则{statefulset name}-{id},重启顺序从2-0

1
2
3
4
5
6
7
8
9
10
11
ubuntu@k8s-dev-m1:~$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
zk-0 1/1 Running 0 29m 172.31.0.11 k8s-dev-node4 <none> <none>
zk-1 1/1 Running 0 29m 172.31.0.129 k8s-dev-node6 <none> <none>
zk-2 1/1 Running 0 81s 172.31.0.41 k8s-dev-node4 <none> <none>
ubuntu@k8s-dev-m1:~$
ubuntu@k8s-dev-m1:~$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
zk-0 1/1 Running 1 130m 172.31.0.11 k8s-dev-node4 <none> <none>
zk-1 1/1 Running 1 129m 172.31.0.129 k8s-dev-node6 <none> <none>
zk-2 1/1 Running 1 102m 172.31.0.41 k8s-dev-node4 <none> <none>

pvc静态化

pvc命名有一定规则,{volumeClaimTemplates name}-{statefulset name}-{id}

1
2
3
4
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-zk-0 Bound pvc-0dfd9fa6-9540-4f90-adca-1a62eb9dd6ec 10Gi RWO cbs-premium 7h2m
data-zk-1 Bound pvc-e70c6a5a-48d2-465d-b235-ae1835ad3ec7 10Gi RWO cbs-premium 6h50m
data-zk-2 Bound pvc-28d360a8-9577-4e2b-bbec-20fbdd146dd9 10Gi RWO cbs-premium 5h30m

扩容和缩容

创建一个补丁,修改replicas和ZOO_COUNT就能实现扩缩容:

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zk
spec:
replicas: 5
template:
spec:
containers:
- name: zookeeper
env:
- name: ZOO_COUNT
value: "5"

修改kustomization.yaml

1
2
3
4
5
6
7
8
9
10
namespace: default
resources:
- zksvc.yml
- zkcluster.yml
configMapGenerator:
- name: zkcm
files:
- docker-entrypoint.sh
patchesStrategicMerge:
- replicas.yml

应用补丁后观察集群的变化过程,StatefulSet创建顺序是从0到N-1,终止顺序则是相反。
默认情况下,statefulset控制器以串行的方式运行创建各pod副本,如果想要以并行的方式创建和删除pod资源,则可以设定.spec.podManagementPolicy字段值为”Parallel”,
默认值为”OrderadReady”。

扩容

  1. 先新增2个新pod

    1
    2
    3
    4
    5
    6
    NAME                       READY   STATUS    RESTARTS   AGE   IP             NODE                   NOMINATED NODE   READINESS GATES
    zk-0 1/1 Running 0 22h 172.31.0.12 k8s-dev-node4 <none> <none>
    zk-1 1/1 Running 0 22h 172.31.0.164 k8s-dev-node6 <none> <none>
    zk-2 1/1 Running 0 22h 172.31.0.34 k8s-dev-node4 <none> <none>
    zk-3 1/1 Running 0 54s 172.31.0.91 k8s-dev-node3.bxr.cn <none> <none>
    zk-4 0/1 Pending 0 9s <none> <none> <none> <none>
  2. 逐步重启旧pod

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    NAME                       READY   STATUS              RESTARTS   AGE   IP             NODE                   NOMINATED NODE   READINESS GATES
    zk-0 1/1 Running 0 22h 172.31.0.12 k8s-dev-node4 <none> <none>
    zk-1 1/1 Running 0 22h 172.31.0.164 k8s-dev-node6 <none> <none>
    zk-2 0/1 ContainerCreating 0 7s <none> k8s-dev-node4 <none> <none>
    zk-3 1/1 Running 0 90s 172.31.0.91 k8s-dev-node3.bxr.cn <none> <none>
    zk-4 1/1 Running 0 45s 172.31.0.20 k8s-dev-node4 <none> <none>
    ubuntu@k8s-dev-m1:~/k8s/zkcluster$
    ubuntu@k8s-dev-m1:~/k8s/zkcluster$ kubectl get po -o wide
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    zk-0 1/1 Running 0 22h 172.31.0.12 k8s-dev-node4 <none> <none>
    zk-1 0/1 Terminating 0 22h 172.31.0.164 k8s-dev-node6 <none> <none>
    zk-2 1/1 Running 0 16s 172.31.0.14 k8s-dev-node4 <none> <none>
    zk-3 1/1 Running 0 99s 172.31.0.91 k8s-dev-node3.bxr.cn <none> <none>
    zk-4 1/1 Running 0 54s 172.31.0.20 k8s-dev-node4 <none> <none>

缩容

  1. 先关闭2个新节点

    1
    2
    3
    4
    5
    NAME                       READY   STATUS        RESTARTS   AGE
    zk-0 1/1 Running 0 39m
    zk-1 1/1 Running 0 39m
    zk-2 1/1 Running 0 40m
    zk-3 0/1 Terminating 0 41m
  2. 依次重启剩下的节点

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    ubuntu@k8s-dev-m1:~/k8s/zkcluster$ kubectl get po
    NAME READY STATUS RESTARTS AGE
    zk-0 1/1 Running 0 39m
    zk-1 1/1 Running 0 40m
    zk-2 0/1 ContainerCreating 0 7s

    ubuntu@k8s-dev-m1:~/k8s/zkcluster$ kubectl get po
    NAME READY STATUS RESTARTS AGE
    zk-0 1/1 Running 0 79s
    zk-1 1/1 Running 0 2m2s
    zk-2 1/1 Running 0 2m39s

模拟主机故障

使用cordon命令冻结主机,然后delete pod观察集群状态
原始状态:

1
2
3
4
5
ubuntu@k8s-dev-m1:~$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
zk-0 1/1 Running 0 3h17m 172.31.0.57 k8s-dev-node4 <none> <none>
zk-1 1/1 Running 0 3h18m 172.31.0.184 k8s-dev-node6 <none> <none>
zk-2 1/1 Running 0 3h18m 172.31.0.43 k8s-dev-node4 <none> <none>

模拟node6故障下线后,zk-1顺利漂移到了其他node

1
2
3
4
NAME                       READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
zk-0 1/1 Running 0 4h41m 172.31.0.57 k8s-dev-node4 <none> <none>
zk-1 1/1 Running 0 65s 172.31.0.113 k8s-dev-node3 <none> <none>
zk-2 1/1 Running 0 4h42m 172.31.0.43 k8s-dev-node4 <none> <none>
Contents
  1. 1. 改造enterpoint
  2. 2. statefulset配置
  3. 3. 观察状态
    1. 3.1. pod重启后ip保持不变
    2. 3.2. pvc静态化
    3. 3.3. 扩容和缩容
      1. 3.3.1. 扩容
      2. 3.3.2. 缩容
    4. 3.4. 模拟主机故障