当在coreos上运行calico rkt容器时,“EtcdException:无法获取服务器列表”

我有两个coreos stable v1122.2.0机器,每个机器都配有tls的etcd2。

我使用https://github.com/coreos/etcd/tree/master/hack/tls-setup创build了证书。

现在我正在尝试configurationcalico-node在rkt的我的coreos主节点上运行。

我有以下在云configurationconfiguration:

write_files: - path: "/etc/kubernetes/cni/net.d/10-calico.conf" content: | { "name": "calico", "type": "flannel", "delegate": { "type": "calico", "etcd_endpoints": "https://10.79.218.2:2379,https://10.79.218.3:2379", "log_level": "none", "log_level_stderr": "info", "hostname": "10.79.218.2", "policy": { "type": "k8s", "k8s_api_root": "http://127.0.0.1:8080/api/v1/" } } } - path: "/etc/kubernetes/manifests/policy-controller.yaml" content: | apiVersion: v1 kind: Pod metadata: name: calico-policy-controller namespace: calico-system spec: hostNetwork: true containers: # The Calico policy controller. - name: k8s-policy-controller image: calico/kube-policy-controller:v0.2.0 env: - name: ETCD_ENDPOINTS value: "https://10.79.218.2:2379,https://10.79.218.3:2379" - name: K8S_API value: "http://127.0.0.1:8080" - name: LEADER_ELECTION value: "true" # Leader election container used by the policy controller. - name: leader-elector image: quay.io/calico/leader-elector:v0.1.0 imagePullPolicy: IfNotPresent args: - "--election=calico-policy-election" - "--election-namespace=calico-system" - "--http=127.0.0.1:4040" ... units: - name: calico-node.service enable: true command: start content: | [Unit] Description=Calico per-host agent Requires=network-online.target After=network-online.target [Service] Slice=machine.slice Environment=CALICO_DISABLE_FILE_LOGGING=true Environment=HOSTNAME=10.79.218.2 Environment=IP=10.79.218.2 Environment=FELIX_FELIXHOSTNAME=10.79.218.2 Environment=CALICO_NETWORKING=false Environment=NO_DEFAULT_POOLS=true Environment=ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379 ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \ --volume=modules,kind=host,source=/lib/modules,readOnly=false \ --mount=volume=modules,target=/lib/modules \ --trust-keys-from-https quay.io/calico/node:v0.19.0 KillMode=mixed Restart=always TimeoutStartSec=0 [Install] WantedBy=multi-user.target 

请忽略空格缩进..我不认为我复制/粘贴正确:)

当我尝试启动calico-node服务时,出现以下错误:

 Sep 14 05:45:17 localhost systemd[1]: Started Calico per-host agent. Sep 14 05:45:17 localhost rkt[1644]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci Sep 14 05:45:18 localhost rkt[1644]: image: using image from local store for image name quay.io/calico/node:v0.19.0 Sep 14 05:45:25 localhost rkt[1644]: Traceback (most recent call last): Sep 14 05:45:25 localhost rkt[1644]: File "startup.py", line 292, in <module> Sep 14 05:45:25 localhost rkt[1644]: client = IPAMClient() Sep 14 05:45:25 localhost rkt[1644]: File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__ Sep 14 05:45:25 localhost rkt[1644]: "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca)) Sep 14 05:45:25 localhost rkt[1644]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m Sep 14 05:45:25 localhost rkt[1644]: Calico node failed to start Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Unit entered failed state. Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Failed with result 'exit-code'. Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Service hold-off time over, scheduling restart. Sep 14 05:45:25 localhost systemd[1]: Stopped Calico per-host agent. Sep 14 05:45:25 localhost systemd[1]: Started Calico per-host agent. Sep 14 05:45:25 localhost rkt[1714]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci Sep 14 05:45:26 localhost rkt[1714]: image: using image from local store for image name quay.io/calico/node:v0.19.0 Sep 14 05:45:28 localhost rkt[1714]: Traceback (most recent call last): Sep 14 05:45:28 localhost rkt[1714]: File "startup.py", line 292, in <module> Sep 14 05:45:28 localhost rkt[1714]: client = IPAMClient() Sep 14 05:45:28 localhost rkt[1714]: File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__ Sep 14 05:45:28 localhost rkt[1714]: "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca)) Sep 14 05:45:28 localhost rkt[1714]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m 

2-25行

所以我得到Invalid ETCD_CA_CERT_FILE. 。 我没有真正指定印花什么钥匙使用..所以我想我错过了一些configuration。

我在/ etc / ssl / etcd中有以下相关的键

 8 -rw-------. 1 etcd etcd 1050 Sep 14 05:45 ca.pem 8 -rw-------. 1 etcd etcd 289 Sep 14 05:45 etcd1-key.pem 8 -rw-------. 1 etcd etcd 1058 Sep 14 05:45 etcd1.pem 8 -rw-------. 1 etcd etcd 227 Sep 12 03:49 server1-key.pem 8 -rw-------. 1 etcd etcd 822 Sep 12 03:49 server1.pem 

我尝试添加Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pemEnvironment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem -node systemd文件,但是我得到完全相同的结果。

有任何想法吗 ?

更新

所以我试图手动运行印花布,而不是systemd。 并且还添加了calico需要的所有必需的环境variables

 export CALICO_DISABLE_FILE_LOGGING=true export HOSTNAME=10.79.218.2 export IP=10.79.218.2 export FELIX_FELIXHOSTNAME=10.79.218.2 export CALICO_NETWORKING=false export NO_DEFAULT_POOLS=true export ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379 export ETCD_AUTHORITY=10.79.218.2:2379 export ETCD_SCHEME=https export ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem export ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem export ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem 

当我尝试执行印花布容器时:

 /usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \ --volume=modules,kind=host,source=/lib/modules,readOnly=false \ --mount=volume=modules,target=/lib/modules \ --trust-keys-from-https quay.io/calico/node:v0.19.0 

我明白了

 image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci image: using image from local store for image name quay.io/calico/node:v0.19.0 Traceback (most recent call last): File "startup.py", line 292, in <module> client = IPAMClient() File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 221, in __init__ ETCD_CERT_FILE_ENV, etcd_cert)) pycalico.datastore_errors.DataStoreError: Cannot read ETCD_KEY_FILE and/or ETCD_CERT_FILE. Both must be readable file paths. Values provided: ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem, ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem 

我将证书文件的文件权限更改为666,但是这不能解决问题。 我知道这些证书是有效的,因为etcd tls正常工作。 所以我错过了什么?

更新2

看来我错过了将证书目录安装在印花布容器上。

所以现在我正在运行印花布容器

 /usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd/,readOnly=true --inherit-env --stage1-from-dir=stage1-fly.aci --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --trust-keys-from-https quay.io/calico/node:v0.19.0 --mount volume=etcd-ssl,target=/etc/ssl/etcd 

我得到以下输出:

 image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci image: using image from local store for image name quay.io/calico/node:v0.19.0 Traceback (most recent call last): File "startup.py", line 292, in <module> client = IPAMClient() File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 246, in __init__ allow_reconnect=True) File "/usr/lib/python2.7/site-packages/etcd/client.py", line 204, in __init__ set(self.machines)) File "/usr/lib/python2.7/site-packages/etcd/client.py", line 299, in machines return self.machines File "/usr/lib/python2.7/site-packages/etcd/client.py", line 301, in machines raise etcd.EtcdException("Could not get the list of servers, " etcd.EtcdException: Could not get the list of servers, maybe you provided the wrong host(s) to connect to? Calico node failed to start 

我有点接近..但仍然没有解决scheme。

更新3

我尝试通过运行export ETCD_ENDPOINTS=https://10.79.218.2:2379将ETCD_ENDPOINTS设置到coreos机器上的etcd服务器,现在当我尝试运行calico rkt图像时,我得到:

 image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci image: using image from local store for image name quay.io/calico/node:v0.19.0 Traceback (most recent call last): File "startup.py", line 295, in <module> main() File "startup.py", line 251, in main warn_if_hostname_conflict(ip) File "startup.py", line 192, in warn_if_hostname_conflict current_ipv4, _ = client.get_host_bgp_ips(hostname) File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 132, in wrapped "running?" % (fn.__name__, e.message)) pycalico.datastore_errors.DataStoreError: get_host_bgp_ips: Error accessing etcd (Connection to etcd failed due to SSLError(CertificateError("hostname '10.79.218.2' doesn't match u'etcd'",),)). Is etcd running? Calico node failed to start 

我也遇到了这个问题,最终通过查看etcd连接逻辑和库使用的代码以及Calico团队的Slack通道中的一些指针,find了问题的根源。

问题是因为Calico的当前版本(至less0.22.0)使用不支持TLS证书中的IP SAN(Subject Alt Name)的Python etcd客户端。 这意味着您正在使用的证书无法与configuration的etcd服务器正确关联。

这在这个GitHub问题中有描述。

为了解决这个问题,你必须等待,直到一个新版本的urllib库被制作,它被etcd客户端拾取,并且创build一个新版本,并且Calico被更新为使用新的etcd客户端。 或者,也可以在SAN字段中使用FQDN而不是IP地址重新生成证书。 这意味着您需要确保您的服务器可以通过这些名称进行访问,无论是使用DNS还是正确设置/etc/hosts 。 用于生成证书的OpenSSLconfiguration应该包含这样的内容:

 [alt_names] DNS.1 = $ENV::FQDN 

描述如何生成证书的链接使用CFSSL,所以我build议阅读它的文档,了解如何更改为使用主机名而不是IP地址。 我相信这可能像修改JSONconfiguration一样简单,如下所示:

 "hosts": [ "example.com", "www.example.com" ], 

我发现,在这个脆弱的库中,我可以成功,如果:客户端打开连接到IP地址; 服务器的证书断言主题中的IP地址; 并且服务器证书在“使用者备用名称”列表中没有任何DNStypes条目。 以下是从openssl x509 -text ...select的输出的示例服务器证书,在客户端使用IP地址10.10.10.1打开连接以识别服务器时起作用:

 ... Subject: CN=10.10.10.1 ... X509v3 extensions: X509v3 Basic Constraints: CA:FALSE X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment X509v3 Subject Alternative Name: IP Address:100.127.0.2, IP Address:100.127.0.2, IP Address:10.10.10.1 ... 

此外,还有更新版本的Calico图像。 我只听说过关于calico/node:v0.23.0两件坏事calico/node:v0.23.0 。 一个来自其他人— https://calicousers.slack.com/archives/kubernetes/p1478206011002345 。 我自己做了一些这个图像的testing,并且只提交一个问题, https://github.com/projectcalico/calico-containers/issues/1107 。 目前有1.0.0 beta和rc1,我没有听说过有关他们的坏事。