root@ly:~# curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash Downloading https://get.helm.sh/helm-v3.19.0-linux-amd64.tar.gz Verifying checksum… Done. Preparing to install helm into /usr/local/bin helm installed into /usr/local/bin/helm

root@ly:# curl -x http://192.168.101.7:8983 https://ipinfo.io/ip 23.95.128.150 root@ly:# curl –socks5 http://192.168.101.7:8983 https://ipinfo.io/ip 23.95.128.150 root@ly:~#

docker设置代理,之后安装 root@ly:/opt/sre-lab/infra# cd /opt/sre-lab/infra && docker compose -f gitlab-compose.yml up -d WARN[0000] /opt/sre-lab/infra/gitlab-compose.yml: the attribute version is obsolete, it will be ignored, please remove it to avoid potential confusion [+] Running 2/10 ⠴ gitlab [⣀⣿⡀⣿⠀⠀⠀⠀⠀] Pulling 34.6s ⠹ 953cdd413371 Downloading [=================> ] 10.2MB/29.72MB 12.3s ✔ 05346a3e21a7 Download complete 3.5s ⠹ 5d603ffc0d9c Downloading [=======> ] 2.913MB/18.36MB 12.3s ✔ a3198e8161fd Download

gitlab启动日志报错 2025-09-23_17:04:23.77985 2025/09/23 17:04:22 [emerg] 1433#0: bind() to 0.0.0.0:8082 failed (98: Address already in use) 2025-09-23_17:04:24.28181 2025/09/23 17:04:22 [emerg] 1433#0: bind() to 0.0.0.0:8082 failed (98: Address already in use) 2025-09-23_17:04:24.78366 2025/09/23 17:04:22 [emerg] 1433#0: bind() to 0.0.0.0:8082 failed (98: Address already in use) 2025-09-23_17:04:25.28580 2025/09/23 17:04:22 [emerg] 1433#0: still could not bind() 2025-09-23_17:04:25.30501 2025/09/23 17:04:25 [emerg] 1434#0: bind() to 0.0.0.0:8082 failed (98: Address already in use) 2025-09-23_17:04:25.80607 2025/09/23 17:04:25 [emerg] 1434#0: bind() to 0.0.0.0:8082 failed (98: Address already in use)

==> /var/log/gitlab/gitlab-kas/current <== 2025-09-23_17:04:26.12712 {“time”:“2025-09-23T17:04:26.126821467Z”,“level”:“ERROR”,“msg”:“Failed to get receptive agents”,“mod_name”:“kas2agentk_tunnel”,“error”:“Get "http://192.168.101.100:8082/api/v4/internal/kubernetes/receptive_agents": read tcp 172.18.0.2:50070->192.168.101.100:8082: read: connection reset by peer”}

==> /var/log/gitlab/nginx/current <== 2025-09-23_17:04:26.30885 2025/09/23 17:04:25 [emerg] 1434#0: bind() to 0.0.0.0:8082 failed (98: Address already in use) 2025-09-23_17:04:26.81137 2025/09/23 17:04:25 [emerg] 1434#0: bind() to 0.0.0.0:8082 failed (98: Address already in us

现象: root@ly:/opt/sre-lab/infra# curl -I http://192.168.101.100:8082/没有返回。win机器,ubuntu机器都是一样的。

可以看出问题原因是8082端口被占用了。 结论:容器内 Nginx 正在尝试监听 8082(因 external_url 带端口),而你的宿主映射是 8082:80(或配置残留导致重复监听),引发容器内 8082 端口冲突。把配置统一为“容器内监听 80,宿主暴露 8082”即可。

方案: 停止容器 docker compose -f gitlab-compose.yml down 修改 compose:external_url 去掉端口,原本配置文件重复了。

GitLab 初始管理员账号与密码获取/重置 http://192.168.101.100:8082/ 登录账号:root 初始密码位置(容器启动后自动生成,24 小时后过期) 宿主机(你映射了配置目录): sudo cat /opt/sre-lab/infra/gitlab/config/initial_root_password 或容器内: docker exec -it gitlab bash -lc ‘cat /etc/gitlab/initial_root_password’ 若文件不存在/密码已过期,重置 root 密码: docker exec -it gitlab gitlab-rake “gitlab:password:reset”

在 GitLab UI -> Admin -> Runners 获取注册令牌

sudo docker exec -it gitlab-runner gitlab-runner register这一步操作是什么意思。

这是“把 Runner 实例注册到你的 GitLab”的步骤。注册后,GitLab 才能把 CI 任务派发给这台 Runner 执行。 你需要做的事(交互式): 在 GitLab 后台:Admin Area → Runners → 复制“注册令牌”(或到项目 Settings → CI/CD → Runners 复制项目令牌)。 执行注册命令(进容器内执行): sudo docker exec -it gitlab-runner gitlab-runner register 按提示依次填写: GitLab URL: http://192.168.101.100:8082 Registration token: 粘贴上一步令牌 Description: 如 runner-docker Tags: 比如 docker(可空) Run untagged jobs: y/n 按需 Executor: 选 docker Default Docker image: 比如 docker:24 或 alpine:3

非交互(一次到位)示例: 将 URL/Token 替换为你的实际值 sudo docker exec -it gitlab-runner gitlab-runner register
–non-interactive
–url “http://192.168.101.100:8082”
–registration-token “YOUR_TOKEN”
–executor “docker”
–description “runner-docker”
–docker-image “docker:24”
–run-untagged=“true”

重要说明(选择一种方式构建镜像): 如果你的 CI 用 docker:dind(.gitlab-ci.yml 里有 services: docker:24-dind 和 DOCKER_HOST),则需要把 Runner 设置为 privileged=true(修改 runner 的 config.toml),且不需要挂载 /var/run/docker.sock。 如果使用“挂载宿主机 Docker 套接字”(你的 docker-compose 已挂载 /var/run/docker.sock),则不要在 CI 里使用 dind 服务与 DOCKER_HOST;直接用 image: docker:24 即可构建和 push。 验证: GitLab 后台 Runners 页面应显示 Online(绿色)。 提交一次简单流水线,确认作业能被该 Runner 接单并运行。

位置:http://192.168.101.100:8082/admin CI/CD—》Runners 这个是在页面上找到的 create instance runner ejyBu5Qcd8R3M4tTFAHN

root@ly:/opt/sre-lab/infra# sudo docker exec -it gitlab-runner gitlab-runner register Runtime platform arch=amd64 os=linux pid=16 revision=139a0ac0 version=18.4.0 Running in system-mode.

Enter the GitLab instance URL (for example, https://gitlab.com/): http://192.168.101.100:8082 Enter the registration token: ejyBu5Qcd8R3M4tTFAHN Enter a description for the runner: [fae04b0b6d45]: runner-docker Enter tags for the runner (comma-separated):

Enter optional maintenance note for the runner: y WARNING: Support for registration tokens and runner parameters in the ‘register’ command has been deprecated in GitLab Runner 15.6 and will be replaced with support for authentication tokens. For more information, see https://docs.gitlab.com/ci/runners/new_creation_workflow/ Registering runner… succeeded correlation_id=01K5VSM44W9V7TX220FT7A9FD6 runner=ejyBu5Qcd Enter an executor: docker-windows, kubernetes, docker-autoscaler, parallels, docker, docker+machine, instance, custom, shell, ssh, virtualbox: docker
Enter the default Docker image (for example, ruby:3.3): docker:24 Runner registered successfully. Feel free to start it, but if it’s running already the config should be automatically reloaded!

Configuration (with the authentication token) was saved in “/etc/gitlab-runner/config.toml”

image-20250924012539915

登录hurber

http://192.168.101.100:8083/harbor/projects

image-20250924012959549

root@ly:/opt/sre-lab/infra/harbor# docker login 192.168.101.100:8083 -u admin -p ‘Harbor12345’ WARNING! Using –password via the CLI is insecure. Use –password-stdin. Error response from daemon: Get “https://192.168.101.100:8083/v2/”: http: server gave HTTP response to HTTPS client root@ly:/opt/sre-lab/infra/harbor# docker tag demo/web:latest 192.168.101.100:8083/library/demo-web:1.0.0 Error response from daemon: No such image: demo/web:latest .界面已经登录上去,命令登录报错

原因 Harbor 按你的安装是 HTTP(8083),Docker 默认按 HTTPS 访问私有仓库,未配置“insecure registry”时会报: http: server gave HTTP response to HTTPS client “No such image: demo/web:latest” 是因为本机没有这个镜像,需要先构建或用现成镜像打标签再推送。 一步到位修复(Docker 配置 + 测试推送)

先绕过代理测试(只对这次命令生效) 本机直连端口: curl —noproxy 192.168.101.100 http://192.168.101.100:8083/v2/ 或 curl —noproxy ’’ http://127.0.0.1:8083/v2/ 容器内自检(确认 Harbor 内部 OK): docker exec -it nginx sh -c “apk add curl || true; curl -sI http://127.0.0.1:8080/v2/” 临时关闭当前 shell 的代理再测(避免 curl/apt 都走代理) unset http_proxy https_proxy all_proxy HTTP_PROXY HTTPS_PROXY ALL_PROXY curl -I http://192.168.101.100:8083/ 永久加入 no_proxy(避免内网地址被代理) 立刻生效: export no_proxy=“127.0.0.1,localhost,192.168.101.100” export NO_PROXY=“127.0.0.1,localhost,192.168.101.100” 持久化到 /etc/environment(重新登录生效): 在 http_proxy/https_proxy 同时加入上述 no_proxy/NO_PROXY

  1. 给 Docker 声明 HTTP 私有仓库(insecure-registries)

sudo tee /etc/docker/daemon.json >/dev/null «‘EOF’ { “proxies”: { “default”: { “httpProxy”: “http://192.168.101.7:8983”, “httpsProxy”: “http://192.168.101.7:8983”, “noProxy”: “localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,192.168.101.100” } }, “registry-mirrors”: [“https://docker.m.daocloud.io”], “insecure-registries”: [“192.168.101.100:8083”], “dns”: [“8.8.8.8”,“1.1.1.1”] } EOF

sudo systemctl restart docker 3) 登录 Harbor(用 stdin 更安全) echo ‘Harbor12345’ | docker login 192.168.101.100:8083 -u admin –password-stdin 4) 用现成镜像测试推送(先用 busybox/nginx 验证) docker pull busybox docker tag busybox 192.168.101.100:8083/library/busybox:latest docker push 192.168.101.100:8083/library/busybox:latest 提示 access denied 时:先在 Harbor UI 确认 library 项目存在(默认有),或改推送到你新建的项目(如 demo)。 5) 再推你的业务镜像(若本地没有就先构建) 快速构建一个示例镜像并推送: cat > Dockerfile «‘EOF’ FROM nginx:alpine RUN echo “hello harbor” > /usr/share/nginx/html/index.html EOF

docker build -t 192.168.101.100:8083/library/demo-web:1.0.0 . docker push 192.168.101.100:8083/library/demo-web:1.0.0 备注 你已通过 UI 登录 Harbor,但 Docker CLI 与守护进程仍按 HTTPS 访问,必须配置 insecure-registries 才能用 HTTP 仓库。 如果后续把 Harbor 切换到 HTTPS(推荐生产),删除 insecure-registries,并在 Docker 节点信任 Harbor 证书即可。

root@ly:/opt/sre-lab/infra/harbor# sudo systemctl restart docker
root@ly:/opt/sre-lab/infra/harbor# sudo systemctl daemon-reload
root@ly:/opt/sre-lab/infra/harbor# sudo systemctl restart docker
root@ly:/opt/sre-lab/infra/harbor# systemctl show -p Environment docker
Environment=HTTP_PROXY=http://192.168.101.7:8983 HTTPS_PROXY=http://192.168.101.7:8983 NO_PROXY=localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.sv>
root@ly:/opt/sre-lab/infra/harbor# 
root@ly:/opt/sre-lab/infra/harbor# docker info | grep -i proxy
 HTTP Proxy: http://192.168.101.7:8983
 HTTPS Proxy: http://192.168.101.7:8983
 No Proxy: localhost,127.0.0.1,::1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local

root@ly:/opt/sre-lab/infra/harbor# cat /etc/docker/daemon.json
{
  "insecure-registries": ["192.168.101.100:8083"],
  "registry-mirrors": ["https://docker.m.daocloud.io"],
  "dns": ["8.8.8.8","1.1.1.1"]
}

docker重启之后出现问题了 root@ly:/opt/sre-lab/infra/harbor# cat /etc/docker/daemon.json { “insecure-registries”: [“192.168.101.100:8083”], “registry-mirrors”: [“https://docker.m.daocloud.io”], “dns”: [“8.8.8.8”,“1.1.1.1”] } root@ly:/opt/sre-lab/infra/harbor# echo ‘Harbor12345’ | docker login 192.168.101.100:8083 -u admin –password-stdin Error response from daemon: Get “http://192.168.101.100:8083/v2/”: dial tcp 192.168.101.100:8083: connect: connection refused root@ly:/opt/sre-lab/infra/harbor# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a46bb0d51727 goharbor/harbor-jobservice:v2.10.0 “/harbor/entrypoint.…” 17 minutes ago Exited (128) 8 minutes ago harbor-jobservice faeaeea248b7 goharbor/nginx-photon:v2.10.0 “nginx -g ‘daemon of…” 17 minutes ago Exited (128) 8 minutes ago nginx 18ecff9c565e goharbor/harbor-core:v2.10.0 “/harbor/entrypoint.…” 17 minutes ago Exited (128) 8 minutes ago harbor-core 8a50610b28cb goharbor/trivy-adapter-photon:v2.10.0 “/home/scanner/entry…” 17 minutes ago Exited (128) 8 minutes ago trivy-adapter 3e0d90913c45 goharbor/harbor-portal:v2.10.0 “nginx -g ‘daemon of…” 17 minutes ago Exited (128) 8 minutes ago harbor-portal 9ea1bff7fcc4 goharbor/redis-photon:v2.10.0 “redis-server /etc/r…” 17 minutes ago Exited (128) 8 minutes ago redis 49b5e050eae0 goharbor/harbor-db:v2.10.0 “/docker-entrypoint.…” 17 minutes ago Exited (128) 8 minutes ago harbor-db ed34aec5018c goharbor/registry-photon:v2.10.0 “/home/harbor/entryp…” 17 minutes ago Exited (128) 8 minutes ago registry 691e281a1639 goharbor/harbor-registryctl:v2.10.0 “/home/harbor/start.…” 17 minutes ago Exited (128) 8 minutes ago registryctl 9f2876672621 goharbor/harbor-log:v2.10.0 “/bin/sh -c /usr/loc…” 17 minutes ago Up 57 seconds (healthy) 127.0.0.1:1514->10514/tcp harbor-log fae04b0b6d45 gitlab/gitlab-runner:alpine “/usr/bin/dumb-init …” 32 minutes ago Up 57 seconds gitlab-runner bf9ff52b5d39 gitlab/gitlab-ce:latest “/assets/init-contai…” 36 minutes ago Up 57 seconds (health: starting) 443/tcp, 0.0.0.0:2224->22/tcp, [::]:2224->22/tcp, 0.0.0.0:8082->80/tcp, [::]:8082->80/tcp gitlab

Harbor 容器大多已退出(Exited 128),所以 8083 拒绝连接。按下面把 Harbor 正常拉起,再测试登录与推送。 步骤

  1. 进入 Harbor 目录,重新准备并启动 cd /opt/sre-lab/infra/harbor/harbor

如果你在上级目录,就先 cd harbor 到含 docker-compose.yml 的目录

重新生成配置(读取 harbor.yml)

./prepare

先停掉残留

docker compose down # 若命令不存在,用 docker-compose

启动

docker compose up -d 2) 查看关键容器日志,确认健康

docker ps -a | grep harbor docker logs -f harbor-db docker logs -f harbor-core docker logs -f nginx 3) 验证 8083 服务起来 curl –noproxy 192.168.101.100 http://192.168.101.100:8083/v2/

  1. Docker CLI 登录与测试推送 echo ‘Harbor12345’ | docker login 192.168.101.100:8083 -u admin –password-stdin

docker pull busybox docker tag busybox 192.168.101.100:8083/library/busybox:latest docker push 192.168.101.100:8083/library/busybox:latest

终于登录成功了!! 2025-09-23 17:49:57.028 UTC [1] LOG: listening on Unix socket “/run/postgresql/.s.PGSQL.5432” 2025-09-23 17:49:57.033 UTC [7] LOG: database system was shut down at 2025-09-23 17:37:43 UTC 2025-09-23 17:49:57.037 UTC [1] LOG: database system is ready to accept connections ^Croot@ly:/opt/sre-lab/infra/harbor# echo ‘Harbor12345’ | docker login 192.168.101.100:8083 -u admin –password-stdin

WARNING! Your credentials are stored unencrypted in ‘/root/.docker/config.json’. Configure a credential helper to remove this warning. See https://docs.docker.com/go/credential-store/

Login Succeeded

可以看到有推送的镜像了: http://192.168.101.100:8083/harbor/projects/1/repositories

image-20250924015145450

root@ly:/opt/sre-lab/infra/harbor# sudo tee /etc/rancher/k3s/registries.yaml >/dev/null «‘EOF’

mirrors: “192.168.101.100:8083”: endpoint: - “http://192.168.101.100:8083” EOF root@ly:/opt/sre-lab/infra/harbor# sudo systemctl restart k3s root@ly:/opt/sre-lab/infra/harbor#

root@ly:/opt/sre-lab/helm-values# helm upgrade –install kps prometheus-community/kube-prometheus-stack -n monitoring –create-namespace -f /opt/sre-lab/helm-values/kps-values.yaml Error: Kubernetes cluster unreachable: Get “http://localhost:8080/version”: dial tcp 127.0.0.1:8080: connect: connection refused root@ly:/opt/sre-lab/helm-values#

root@ly:/opt/sre-lab/helm-values# sudo systemctl status k3s ● k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; preset: enabled) Active: active (running) since Wed 2025-09-24 01:52:23 CST; 2min 16s ago Docs: https://k3s.io Process: 146715 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 146716 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 146719 (k3s-server) Tasks: 20 Memory: 338.5M (peak: 343.6M) CPU: 13.903s CGroup: /system.slice/k3s.service ├─146719 “/usr/local/bin/k3s server” └─146736 “containerd "

Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.102614 146719 log.go:32] “RunPodSandbox from runtime service failed” err=“rpc error: code = Unknown desc = failed t> Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.102668 146719 kuberuntime_sandbox.go:70] “Failed to create sandbox for pod” err=“rpc error: code = Unknown desc = f> Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.102689 146719 kuberuntime_manager.go:1252] “CreatePodSandbox for pod failed” err=“rpc error: code = Unknown desc = > Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.102728 146719 pod_workers.go:1301] “Error syncing pod, skipping” err=“failed to "CreatePodSandbox" for "helm-ins> Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.935767 146719 log.go:32] “RunPodSandbox from runtime service failed” err=“rpc error: code = Unknown desc = failed t> Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.935810 146719 kuberuntime_sandbox.go:70] “Failed to create sandbox for pod” err=“rpc error: code = Unknown desc = f> Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.935829 146719 kuberuntime_manager.go:1252] “CreatePodSandbox for pod failed” err=“rpc error: code = Unknown desc = > Sep 24 01:54:22 ly k3s[146719]: E0924 01:54:22.936122 146719 pod_workers.go:1301] “Error syncing pod, skipping” err=“failed to "CreatePodSandbox" for "helm-ins> Sep 24 01:54:25 ly k3s[146719]: E0924 01:54:25.687615 146719 resource_quota_controller.go:446] “Unhandled Error” err=“unable to retrieve the complete list of serv> Sep 24 01:54:26 ly k3s[146719]: I0924 01:54:26.123794 146719 garbagecollector.go:787] “failed to discover some groups” groups=“map["metrics.k8s.io/v1beta1":"st>

看日志是 Pod Sandbox 创建失败,多为 kubeconfig/代理/CNI 未就绪导致。按顺序修复:

  1. 当前会话先用本机直连 API(不要走代理)

kubeconfig

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

取消代理

unset http_proxy https_proxy all_proxy HTTP_PROXY HTTPS_PROXY ALL_PROXY

本地直连白名单

export NO_PROXY=“127.0.0.1,localhost,192.168.101.100,10.42.0.0/16,10.43.0.0/16,.svc,.cluster.local” kubectl get nodes -o wide

  1. 给 k3s/containerd 配置代理(让系统镜像能拉下来),并设置 NO_PROXY

sudo mkdir -p /etc/systemd/system/k3s.service.d sudo tee /etc/systemd/system/k3s.service.d/http-proxy.conf >/dev/null «‘EOF’ [Service] Environment=“HTTP_PROXY=http://192.168.101.7:8983” Environment=“HTTPS_PROXY=http://192.168.101.7:8983” Environment=“NO_PROXY=127.0.0.1,localhost,192.168.101.100,10.42.0.0/16,10.43.0.0/16,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local” EOF sudo systemctl daemon-reload sudo systemctl restart k3s 3) 验证容器运行时能拉镜像

sudo k3s crictl info | head -n1 sudo k3s crictl pull docker.io/library/busybox:latest 4) 检查 CNI 与系统组件状态 ls /etc/cni/net.d kubectl -n kube-system get pods -o wide

flannel/metrics-server/coredns 应该逐步变为 Running

  1. 如仍有 “CreatePodSandbox” 错误,重建 flannel Pod kubectl -n kube-system delete pod -l app=flannel

等 30-60s,再看:

kubectl -n kube-system get pods 6) 集群就绪后再装监控 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update

确保本会话仍未设置 http/https 代理

env | grep -i proxy helm upgrade –install kps prometheus-community/kube-prometheus-stack
-n monitoring –create-namespace
-f /opt/sre-lab/helm-values/kps-values.yaml

排错要点 kubeconfig 使用的 server 通常是 https://127.0.0.1:6443,务必在 NO_PROXY 里包含 127.0.0.1。 若 crictl pull 仍超时,优先看 sudo journalctl -u k3s -f 的拉取错误,再调整代理或换镜像源。

oot@ly:/opt/sre-lab/helm-values# helm upgrade –install kps prometheus-community/kube-prometheus-stack \

-n monitoring –create-namespace
-f /opt/sre-lab/helm-values/kps-values.yaml Release “kps” does not exist. Installing it now. NAME: kps LAST DEPLOYED: Wed Sep 24 02:00:24 2025 NAMESPACE: monitoring STATUS: deployed REVISION: 1 NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl –namespace monitoring get pods -l “release=kps”

Get Grafana ‘admin’ user password by running:

kubectl –namespace monitoring get secrets kps-grafana -o jsonpath=”{.data.admin-password}” | base64 -d ; echo

Access Grafana local instance:

export POD_NAME=$(kubectl –namespace monitoring get pod -l “app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kps” -oname) kubectl –namespace monitoring port-forward $POD_NAME 3000

Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

GRANFA报错 Detected a time difference of 6h 17m 26.361s between your browser and the server. You may see unexpected time-shifted query results due to the time drift.

解决:需要同步时间,安装chrony

后面需要做的

5. 安装 Argo Rollouts(金丝雀/蓝绿)

kubectl create namespace argo-rollouts || true
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# UI(Dashboard)可选安装:
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/dashboard-install.yaml
kubectl -n argo-rollouts port-forward svc/argo-rollouts-dashboard 3100:3100 # 本地查看

root@ly:~# kubectl -n argo-rollouts port-forward svc/argo-rollouts-dashboard 3100:3100 Forwarding from 127.0.0.1:3100 -> 3100 Forwarding from [::1]:3100 -> 3100 它表示同时在本机的 IPv6 回环地址上做了同样的端口转发。 “Forwarding from 127.0.0.1:3100 -> 3100”:在 IPv4 回环地址 127.0.0.1 的 3100 端口监听,转发到集群内目标的 3100 端口。 “Forwarding from [::1]:3100 -> 3100”:在 IPv6 回环地址 ::1 的 3100 端口监听,转发到同一个目标的 3100 端口。 因此你可以用 http://127.0.0.1:3100、http://localhost:3100(可能走 IPv4 或 IPv6)、或 http://[::1]:3100 访问,且仅限本机。 若想允许外网/局域网访问,可加参数(有安全风险) kubectl -n argo-rollouts port-forward svc/argo-rollouts-dashboard 3100:3100 –address 0.0.0.0

或同时开启 IPv4/IPv6

kubectl -n argo-rollouts port-forward svc/argo-rollouts-dashboard 3100:3100 –address 0.0.0.0,::

http://192.168.101.100:3100/rollouts/

image-20250924083325085

示例 Rollout(以 demo-web 为例):

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: demo-web
  namespace: dev
spec:
  replicas: 3
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 60}
        - setWeight: 50
        - pause: {duration: 120}
      trafficRouting:
        nginx: {}
  selector:
    matchLabels: { app: demo-web }
  template:
    metadata: { labels: { app: demo-web } }
    spec:
      containers:
        - name: web
          image: 192.168.101.100:8083/library/demo-web:1.0.0
          ports:
            - containerPort: 8080

如何使用这个 Rollout 示例(基于 NGINX 流量切分) 你这段 YAML只定义了 Rollout,本身还需要 Service 与 Ingress 才能按权重切流。下面给出一套最小可用清单和步骤。 前置条件(一次性): 已安装 Argo Rollouts 控制器与 kubectl-argo-rollouts 插件 已安装 NGINX Ingress Controller 镜像 192.168.101.100:8083/library/demo-web:1.0.0 可被集群拉取(必要时配置镜像仓库凭据) 一次性部署清单 把下面保存为 rollout-demo.yaml 并应用: apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: demo-web namespace: dev spec: replicas: 3 strategy: canary: stableService: demo-web canaryService: demo-web-canary trafficRouting: nginx: stableIngress: demo-web-ingress steps: - setWeight: 20 - pause: {duration: 60} - setWeight: 50 - pause: {duration: 120} selector: matchLabels: app: demo-web template: metadata: labels: app: demo-web spec: containers: - name: web image: 192.168.101.100:8083/library/demo-web:1.0.0 ports: - containerPort: 8080

apiVersion: v1 kind: Service metadata: name: demo-web namespace: dev spec: ports: - port: 80 targetPort: 8080 selector: app: demo-web

apiVersion: v1 kind: Service metadata: name: demo-web-canary namespace: dev spec: ports: - port: 80 targetPort: 8080 selector: app: demo-web

apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: demo-web-ingress namespace: dev annotations: kubernetes.io/ingress.class: “nginx” spec: rules: - host: demo.example.com http: paths: - path: / pathType: Prefix backend: service: name: demo-web port: number: 80

部署与验证 创建命名空间与应用: kubectl create ns dev kubectl apply -f rollout-demo.yaml 观察发布进度: kubectl argo rollouts get rollout demo-web -n dev –watch 访问: 将 demo.example.com 解析到你的 NGINX Ingress 对外地址,然后访问 http://demo.example.com/ 触发金丝雀发布 更新镜像(Argo Rollouts会按 steps 20%→50% 切流): kubectl argo rollouts set image demo-web web=192.168.101.100:8083/library/demo-web:1.0.1 -n dev 手动推进/跳过暂停: kubectl argo rollouts promote demo-web -n dev 回滚/中止: kubectl argo rollouts abort demo-web -n dev 备注: 如果暂时没有 Ingress,也可以先去掉 trafficRouting,只做金丝雀副本比例,但无法对外按流量权重切分。 也可用 Dashboard 观察:端口转发后打开 http://127.0.0.1:3100。

有报错: root@ly:# kubectl create ns dev namespace/dev created root@ly:# kubectl apply -f rollout-demo.yaml rollout.argoproj.io/demo-web created service/demo-web created service/demo-web-canary created Warning: annotation “kubernetes.io/ingress.class” is deprecated, please use ‘spec.ingressClassName’ instead ingress.networking.k8s.io/demo-web-ingress created root@ly:# kubectl argo rollouts get rollout demo-web -n dev –watch error: unknown command “argo” for “kubectl” root@ly:# kubectl argo rollouts get rollout demo-web -n dev –watch error: unknown command “argo” for “kubectl” root@ly:# helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx “ingress-nginx” has been added to your repositories root@ly:# helm upgrade –install ingress-nginx ingress-nginx/ingress-nginx -n ingress-nginx –create-namespace Error: Kubernetes cluster unreachable: Get “http://localhost:8080/version”: dial tcp 127.0.0.1:8080: connect: connection refused root@ly:~# 你现在遇到的两类问题与解决办法 未安装 kubectl Argo Rollouts 插件,导致 kubectl argo rollouts … 报 “unknown command” Helm 连接不到集群,报 “http://localhost:8080/version connect refused”,通常是 kubeconfig 未正确指向

  1. 安装 kubectl-argo-rollouts 插件(解决 unknown command “argo”) curl -fsSL -o /usr/local/bin/kubectl-argo-rollouts
    https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 chmod +x /usr/local/bin/kubectl-argo-rollouts kubectl argo rollouts version 安装好后再看发布状态: kubectl argo rollouts get rollout demo-web -
  2. 修复 Helm 连接集群失败(KUBECONFIG 指向问题) 你的 kubectl apply 能成功,说明 kubectl 找到了 kubeconfig;而 Helm 报错通常是它没拿到同一份 kubeconfig。按照你的集群类型选择其一: kubeadm(常见): mkdir -p ~/.kube cp -i /etc/kubernetes/admin.conf ~/.kube/config chown $(id -u):$(id -g) /.kube/config export KUBECONFIG=/.kube/config k3s: mkdir -p ~/.kube cp -i /etc/rancher/k3s/k3s.yaml ~/.kube/config

如果不是在主机本机操作,把其中的 127.0.0.1 替换为 k3s server IP

sed -i ’s/127.0.0.1/<你的ServerIP>/’ ~/.kube/config

export KUBECONFIG=~/.kube/config

立即验证: kubectl cluster-info helm version helm ls -A 重新安装 Ingress-NGINX: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm upgrade –install ingress-nginx ingress-nginx/ingress-nginx
-n ingress-nginx –create-namespace kubectl get pods -n ingress-nginx -w 小技巧:若不想改环境变量,可直接在命令上指定 kubeconfig:

helm –kubeconfig ~/.kube/config upgrade –install ingress-nginx ingress-nginx/ingress-nginx
-n ingress-nginx –create-namespace

  1. 更新 Ingress 的新写法(消除弃用告警) 直接在线 Patch 已创建的 Ingress: kubectl patch ingress demo-web-ingress -n dev –type=json \
-p '[{"op":"remove","path":"/metadata/annotations/kubernetes.io~1ingress.class"}, {"op":"add","path":"/spec/ingressClassName","value":"nginx"}]'
  1. 验证整链路

Rollout 状态(装好插件后)

kubectl argo rollouts get rollout demo-web -n dev –watch

NGINX Ingress 就绪

kubectl get pods -n ingress-nginx kubectl get svc -n ingress-nginx

访问域名前,先把 demo.example.com 解析到 Ingress 对外地址

然后浏览器访问 http://demo.example.com/

把 demo.example.com 解析到 Ingress 对外地址的具体做法 确认 Ingress 对外地址 优先看是否有负载均衡 IP(EXTERNAL-IP): kubectl get svc -n ingress-nginx ingress-nginx-controller 若有 EXTERNAL-IP,例如 10.1.2.3,则后续直接用该 IP 和 80 端口。 若没有 EXTERNAL-IP(显示 ),使用 NodePort 方式: # 查看 80 对应的 NodePort kubectl get svc -n ingress-nginx ingress-nginx-controller -o jsonpath=’{.spec.ports[?(@.port==80)].nodePort}{"\n”}’ # 任选一台节点的 IP kubectl get nodes -o wide

root@ly:# kubectl get svc -n ingress-nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-nginx-controller LoadBalancer 10.43.212.123 80:31177/TCP,443:30129/TCP 71s ingress-nginx-controller-admission ClusterIP 10.43.129.34 443/TCP 71s root@ly:# kubectl get svc -n ingress-nginx ingress-nginx-controller NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-nginx-controller LoadBalancer 10.43.212.123 80:31177/TCP,443:30129/TCP 3m54s root@ly:# kubectl get svc -n ingress-nginx ingress-nginx-controller -o jsonpath=’{.spec.ports[?(@.port==80)].nodePort}{"\n”}’ 31177 root@ly:# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ly Ready control-plane,master 9h v1.33.4+k3s1 192.168.101.100 Ubuntu 24.04.3 LTS 6.14.0-29-generic containerd://2.0.5-k3s2

记下节点 IP(如 192.168.101.100)和上面查到的 NodePort(如 31177 echo “192.168.101.100 demo.example.com” | sudo tee -a /etc/hosts Windows(以管理员运行): Windows(以管理员运行):

  1. 打开:C:\Windows\System32\drivers\etc\hosts

  2. 末尾新增一行: 有 EXTERNAL-IP 时: 10.1.2.3 demo.example.com 用 NodePort 时(仍然写节点 IP;端口稍后在 URL 里带上): 192.168.101.100 demo.example.com

  3. 保存后执行: ipconfig /flushdns

    不改 hosts 的快速验证(命令行):

将 Host 头设为 demo.example.com,直接打到 Ingress IP(或节点 IP:NodePort)

curl -H “Host: demo.example.com” http://10.1.2.3/ -i

或 NodePort

curl -H “Host: demo.example.com” http://192.168.101.100:31177/ -i

访问 有 EXTERNAL-IP(Service=LoadBalancer,监听 80/443): 浏览器打开 http://demo.example.com/ NodePort 场景: 浏览器打开 http://demo.example.com:/(例如 http://demo.example.com:31690/

root@ly:~# curl -H “Host: demo.example.com” http://192.168.101.100:31177/ -i HTTP/1.1 503 Service Temporarily Unavailable Date: Wed, 24 Sep 2025 00:55:19 GMT Content-Type: text/html Content-Length: 190 Connection: keep-alive

503 Service Temporarily Unavailable

503 Service Temporarily Unavailable


nginx

root@ly:# kubectl get ingressclass NAME CONTROLLER PARAMETERS AGE nginx k8s.io/ingress-nginx 16m traefik traefik.io/ingress-controller 7h3m root@ly:# kubectl get ing -n dev demo-web-ingress -o yaml | egrep ‘ingressClassName|kubernetes.io/ingress.class’ {“apiVersion”:“networking.k8s.io/v1”,“kind”:“Ingress”,“metadata”:{“annotations”:{“kubernetes.io/ingress.class”:“nginx”},“name”:“demo-web-ingress”,“namespace”:“dev”},“spec”:{“rules”:[{“host”:“demo.example.com”,“http”:{“paths”:[{“backend”:{“service”:{“name”:“demo-web”,“port”:{“number”:80}}},“path”:”/”,“pathType”:“Prefix”}]}}]}} kubernetes.io/ingress.class: nginx root@ly:~# kubectl describe ing -n dev demo-web-ingress Name: demo-web-ingress Labels: Namespace: dev Address:
Ingress Class: Default backend: Rules: Host Path Backends


demo.example.com
/ demo-web:80 () Annotations: kubernetes.io/ingress.class: nginx Events: Type Reason Age From Message


Normal Sync 16m nginx-ingress-controller Scheduled for sync root@ly:~# kubectl -n dev port-forward svc/demo-web 8080:80 error: unable to forward port because pod is not running. Current status=Pending

看起来有两个直接问题: Ingress 未被接管:Ingress Class: (只写了弃用注解,没写 spec.ingressClassName) 后端 Pod 未运行:port-forward 提示 Pending,所以没有可用 Endpoints,Ingress 必然 502/503 按序处理(每步成功再继续):

  1. 修正 Ingress 的类名 kubectl patch ingress demo-web-ingress -n dev –type=json
    -p ‘[{“op”:“remove”,“path”:"/metadata/annotations/kubernetes.io~1ingress.class”}, {“op”:“add”,“path”:"/spec/ingressClassName”,“value”:“nginx”}]’ kubectl describe ing -n dev demo-web-ingress | egrep ‘Ingress Class|Rules|Backends’ 确认已显示 Ingress Class: nginx。

root@ly:# kubectl -n dev port-forward svc/demo-web 8080:80 error: unable to forward port because pod is not running. Current status=Pending root@ly:# kubectl patch ingress demo-web-ingress -n dev –type=json \

-p ‘[{“op”:“remove”,“path”:"/metadata/annotations/kubernetes.io1ingress.class”}, {“op”:“add”,“path”:"/spec/ingressClassName”,“value”:“nginx”}]’ ingress.networking.k8s.io/demo-web-ingress patched root@ly:# kubectl describe ing -n dev demo-web-ingress | egrep ‘Ingress Class|Rules|Backends’ Ingress Class: nginx Rules: Host Path Backends

  1. 查明 Pod 为何 Pending kubectl get pods -n dev -l app=demo-web -o wide –show-labels

任选一个 Pending 的 Pod:

kubectl describe pod -n dev <pod名> kubectl get events -n dev –sort-by=.lastTimestamp | tail -n 50

Events: Type Reason Age From Message


Normal Scheduled 26m default-scheduler Successfully assigned dev/demo-web-698c68b94d-bcpj6 to ly Normal Pulling 23m (x5 over 26m) kubelet Pulling image “192.168.101.100:8083/library/demo-web:1.0.0” Warning Failed 23m (x5 over 26m) kubelet Failed to pull image “192.168.101.100:8083/library/demo-web:1.0.0”: rpc error: code = NotFound desc = failed to pull and unpack image “192.168.101.100:8083/library/demo-web:1.0.0”: failed to resolve reference “192.168.101.100:8083/library/demo-web:1.0.0”: 192.168.101.100:8083/library/demo-web:1.0.0: not found Warning Failed 23m (x5 over 26m) kubelet Error: ErrImagePull Normal BackOff 77s (x111 over 26m) kubelet Back-off pulling image “192.168.101.100:8083/library/demo-web:1.0.0” Warning Failed 77s (x111 over 26m) kubelet Error: ImagePullBackOff

3m Normal Pulling pod/demo-web-698c68b94d-twfbs Pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 23m Warning Failed pod/demo-web-698c68b94d-twfbs Failed to pull image “192.168.101.100:8083/library/demo-web:1.0.0”: rpc error: code = NotFound desc = failed to pull and unpack image “192.168.101.100:8083/library/demo-web:1.0.0”: failed to resolve reference “192.168.101.100:8083/library/demo-web:1.0.0”: 192.168.101.100:8083/library/demo-web:1.0.0: not found 23m Normal Pulling pod/demo-web-698c68b94d-bxw2q Pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 23m Warning Failed pod/demo-web-698c68b94d-bxw2q Error: ErrImagePull 23m Warning Failed pod/demo-web-698c68b94d-bxw2q Failed to pull image “192.168.101.100:8083/library/demo-web:1.0.0”: rpc error: code = NotFound desc = failed to pull and unpack image “192.168.101.100:8083/library/demo-web:1.0.0”: failed to resolve reference “192.168.101.100:8083/library/demo-web:1.0.0”: 192.168.101.100:8083/library/demo-web:1.0.0: not found 23m Warning Failed pod/demo-web-698c68b94d-bcpj6 Error: ErrImagePull 23m Warning Failed pod/demo-web-698c68b94d-bcpj6 Failed to pull image “192.168.101.100:8083/library/demo-web:1.0.0”: rpc error: code = NotFound desc = failed to pull and unpack image “192.168.101.100:8083/library/demo-web:1.0.0”: failed to resolve reference “192.168.101.100:8083/library/demo-web:1.0.0”: 192.168.101.100:8083/library/demo-web:1.0.0: not found 23m Normal Pulling pod/demo-web-698c68b94d-bcpj6 Pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 93s Warning Failed pod/demo-web-698c68b94d-bxw2q Error: ImagePullBackOff 93s Normal BackOff pod/demo-web-698c68b94d-bxw2q Back-off pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 92s Normal BackOff pod/demo-web-698c68b94d-twfbs Back-off pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 92s Warning Failed pod/demo-web-698c68b94d-twfbs Error: ImagePullBackOff 90s Warning Failed pod/demo-web-698c68b94d-bcpj6 Error: ImagePullBackOff 90s Normal BackOff pod/demo-web-698c68b94d-bcpj6 Back-off pulling image “192.168.101.100:8083/library/demo-web:1.0.0” 77s Normal Sync ingress/demo-web-ingress Scheduled for sync 根据原因修复: 若是 ImagePullBackOff/ErrImagePull(私有仓库 192.168.101.100:8083 需认证): kubectl -n dev create secret docker-registry regcred
–docker-server=192.168.101.100:8083
–docker-username=admin –docker-password=Harbor12345

然后给 Rollout.spec.template.spec.imagePullSecrets 加:

imagePullSecrets: - name: regcred

并重新 kubectl apply -f rollout-demo.yaml

若是 Unschedulable(资源不足/污点/节点不可达):调整资源请求、去污点或选择有资源的节点。 若是 CrashLoopBackOff:kubectl logs -n dev <pod名> 看启动报错。

  1. 确认 Service 有 Endpoints kubectl get svc -n dev demo-web demo-web-canary -o wide kubectl get ep -n dev demo-web demo-web-canary -o wide ENDPOINTS 必须非空,且端口应为 8080(你的容器监听)。 若不确定应用监听端口,在 Pod 内核验: kubectl exec -it -n dev <pod名> – sh -c ‘ss -ltnp | grep 8080 || netstat -ltnp | grep 8080 || true’ kubectl exec -it -n dev <pod名> – sh -c ‘curl -sS -m 2 -i http://127.0.0.1:8080/ || true’
  2. 再次通过 NodePort 验证 用“任一节点IP:31177”(不要用 10.43.212.123 这种 ClusterIP): curl -H “Host: demo.example.com” http://192.168.101.100:31177/ -i 把第 2 步的 describe pod 和第 3 步的 svc/ep 输出贴上来,我可以直接给出针对性的修复命令。

方案 A:直接用公共镜像(端口 8080,无需改 Service)

  1. 修正 Ingress 类名(只需一次): kubectl patch ingress demo-web-ingress -n dev –type=json
    -p ‘[{“op”:“remove”,“path”:"/metadata/annotations/kubernetes.io~1ingress.class”}, {“op”:“add”,“path”:"/spec/ingressClassName",“value”:“nginx”}]’
  2. 把 Rollout 的镜像改为可用的公共镜像 registry.k8s.io/echoserver:1.10: kubectl -n dev patch rollout demo-web –type=json
    -p ‘[{“op”:“replace”,“path”:"/spec/template/spec/containers/0/image",“value”:“registry.k8s.io/echoserver:1.10”}]’