Proxmox VE 5.4 to 6.2 ライブアップグレード話 前半戦
Posted on 2020/08/12(Wed) 12:05 in technical
前書き
自宅サーバーのProxmox VE 5.1 - 3 nodes cluster環境の構築 で作ったクラスタをアップデートしつつ運用中です。(今は5.4)
以前は無理だと思っていたのですが、どうやらCorosyncから上手く移行していくとVMを動かしたままライブアップグレード出来そうなことが分かりました。
そこで、Proxmox VE 5.xのEOLが2020/7/31に到来してしまったので、遅ればせながら6.xへのアップグレードを実施しました。
前半戦はVMを落とすことなくProxmox VE 5.4を6.2にアップグレードするところまで、後半戦ではCephクラスタをLuminous(12.x)からNautilus(14.x)にアップグレードする話を書くつもりです。
構成
- Proxmox VE 5.4 - 3 nodes cluster
- Ceph cluster - 3 nodes cluster
- Hardware: TX1320 M2
- /dev/sda: system - 32GB SSD
- /dev/sdb: ceph data (bluestore) - 1000GB SSD
- eno1: bond0 - 1GbE
- eno2: bond0 - 1GbE
- bond0
- Untagged: Admin and VM network
- VLAN 200: WAN network (for BBR)
- VMは大体15台くらい
実施手順
公式手順 https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0 に沿って作業します。
Corosync upgrade
アップグレード前のチェックツール pve5to6
で状態を確認します。
VMの設定回り等でエラーがある場合は事前に修正しますが、個々の設定によるので今回は割愛します。
Detect "FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!"
全ノードで実行します。
# pve5to6
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =
Checking for package updates..
PASS: all packages uptodate
Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2
Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.
Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.
= CHECKING CLUSTER HEALTH/SETTINGS =
PASS: systemd unit 'pve-cluster.service' is in state 'active'
PASS: systemd unit 'corosync.service' is in state 'active'
PASS: Cluster Filesystem is quorate.
Analzying quorum settings and state..
INFO: configured votes - nodes: 3
INFO: configured votes - qdevice: 0
INFO: current expected votes: 3
INFO: current total votes: 3
Checking nodelist entries..
PASS: pve03: ring0_addr is configured to use IP address '192.168.122.28'
WARN: pve01: ring0_addr 'pve01' resolves to '192.168.122.26'.
Consider replacing it with the currently resolved IP address.
WARN: pve02: ring0_addr 'pve02' resolves to '192.168.122.27'.
Consider replacing it with the currently resolved IP address.
Checking totem settings..
PASS: Corosync transport set to implicit default.
PASS: Corosync encryption and authentication enabled.
INFO: run 'pvecm status' to get detailed cluster status..
= CHECKING INSTALLED COROSYNC VERSION =
FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!
= CHECKING HYPER-CONVERGED CEPH STATUS =
INFO: hyper-converged ceph setup detected!
INFO: getting Ceph status/health information..
WARN: Ceph health reported as 'HEALTH_WARN'.
Use the PVE dashboard or 'ceph -s' to determine the specific issues and try to resolve them.
INFO: getting Ceph OSD flags..
PASS: all PGs have been scrubbed at least once while running Ceph Luminous.
INFO: getting Ceph daemon versions..
PASS: single running version detected for daemon type monitor.
PASS: single running version detected for daemon type manager.
SKIP: no running instances detected for daemon type MDS.
PASS: single running version detected for daemon type OSD.
PASS: single running overall version detected for all Ceph daemon types.
WARN: 'noout' flag not set - recommended to prevent rebalancing during upgrades.
INFO: checking Ceph config..
WARN: No 'mon_host' entry found in ceph config.
It's recommended to add mon_host with all monitor addresses (without ports) to the global section.
PASS: 'ms_bind_ipv6' not enabled
WARN: [global] config section contains 'keyring' option, which will prevent services from starting with Nautilus.
Move 'keyring' option to [client] section instead.
= CHECKING CONFIGURED STORAGES =
storage 'www' is not online
PASS: storage 'rdb_vm' enabled and active.
PASS: storage 'local-lvm' enabled and active.
PASS: storage 'rdb_ct' enabled and active.
PASS: storage 'local' enabled and active.
WARN: storage 'www' enabled but not active!
= MISCELLANEOUS CHECKS =
INFO: Checking common daemon services..
PASS: systemd unit 'pveproxy.service' is in state 'active'
PASS: systemd unit 'pvedaemon.service' is in state 'active'
PASS: systemd unit 'pvestatd.service' is in state 'active'
INFO: Checking for running guests..
WARN: 7 running guest(s) detected - consider migrating or stopping them.
INFO: Checking if the local node's hostname 'pve01' is resolvable..
INFO: Checking if resolved IP is configured on local node..
PASS: Resolved node IP '192.168.122.26' configured and active on single interface.
INFO: Check node certificate's RSA key size
PASS: Certificate 'pve-root-ca.pem' passed Debian Busters security level for TLS connections (4096 >= 2048)
PASS: Certificate 'pve-ssl.pem' passed Debian Busters security level for TLS connections (2048 >= 2048)
PASS: Certificate 'pveproxy-ssl.pem' passed Debian Busters security level for TLS connections (2048 >= 2048)
INFO: Checking KVM nesting support, which breaks live migration for VMs using it..
PASS: KVM nested parameter set, but currently no VM with a 'vmx' or 'svm' flag is running.
INFO: Checking VMs with OVMF enabled and bad efidisk sizes...
PASS: No VMs with OVMF and problematic efidisk found.
= SUMMARY =
TOTAL: 39
PASSED: 29
SKIPPED: 1
WARNINGS: 8
FAILURES: 1
ATTENTION: Please check the output for detailed information!
Try to solve the problems one at a time and then run this checklist tool again.
Fix "FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!"
全てのノードで実行します。
# echo "deb http://download.proxmox.com/debian/corosync-3/ stretch main" > /etc/apt/sources.list.d/corosync3.list
# apt update
Hit:1 http://security.debian.org stretch/updates InRelease
Ign:2 http://ftp.jp.debian.org/debian stretch InRelease
Hit:3 http://ftp.jp.debian.org/debian stretch Release
Hit:5 http://download.proxmox.com/debian/ceph-luminous stretch InRelease
Get:6 http://download.proxmox.com/debian/corosync-3 stretch InRelease [1,977 B]
Hit:7 http://download.proxmox.com/debian stretch InRelease
Get:8 http://download.proxmox.com/debian/corosync-3 stretch/main amd64 Packages [38.0 kB]
Fetched 40.0 kB in 3s (10.8 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
7 packages can be upgraded. Run 'apt list --upgradable' to see them.
# apt list --upgradeable
Listing... Done
corosync/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcmap4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcorosync-common4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcpg4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libqb0/stable 1.0.5-1~bpo9+2 amd64 [upgradable from: 1.0.3-1~bpo9]
libquorum5/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libvotequorum8/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
# apt dist-upgrade --download-only -y
# pvecm status
Quorum information
------------------
Date: Mon Aug 10 13:24:45 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/796
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.122.26 (local)
0x00000002 1 192.168.122.27
0x00000003 1 192.168.122.28
全ノードで準備が整ったら、サービスの停止、パッケージのアップデート、サービスの再起動を実施します。
今回は全ノードで同時にコマンドを実行しました。(RLoginで言うところの同時送信)
# systemctl stop pve-ha-lrm
# systemctl stop pve-ha-crm
# apt dist-upgrade -y
# pvecm status
Quorum information
------------------
Date: Mon Aug 10 13:25:29 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.d
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.122.26 (local)
0x00000002 1 192.168.122.27
0x00000003 1 192.168.122.28
# systemctl start pve-ha-lrm
# systemctl start pve-ha-crm
# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-08-10 13:25:35 JST; 1min 3s ago
Process: 277005 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 277019 (pve-ha-lrm)
Tasks: 1 (limit: 4915)
Memory: 80.6M
CPU: 458ms
CGroup: /system.slice/pve-ha-lrm.service
└─277019 pve-ha-lrm
Aug 10 13:25:35 pve01 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
Aug 10 13:25:35 pve01 pve-ha-lrm[277019]: starting server
Aug 10 13:25:35 pve01 pve-ha-lrm[277019]: status change startup => wait_for_agent_lock
Aug 10 13:25:35 pve01 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-08-10 13:25:35 JST; 1min 8s ago
Process: 276958 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
Main PID: 277004 (pve-ha-crm)
Tasks: 1 (limit: 4915)
Memory: 80.8M
CPU: 454ms
CGroup: /system.slice/pve-ha-crm.service
└─277004 pve-ha-crm
Aug 10 13:25:34 pve01 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Aug 10 13:25:35 pve01 systemd[1]: pve-ha-crm.service: PID file /var/run/pve-ha-crm.pid not readable (yet?) after start: No such file or directory
Aug 10 13:25:35 pve01 pve-ha-crm[277004]: starting server
Aug 10 13:25:35 pve01 pve-ha-crm[277004]: status change startup => wait_for_quorum
Aug 10 13:25:35 pve01 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
# pvecm status
Quorum information
------------------
Date: Mon Aug 10 13:27:34 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.d
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.122.26 (local)
0x00000002 1 192.168.122.27
0x00000003 1 192.168.122.28
Corosync status
# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-08-10 13:25:22 JST; 1h 0min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 276851 (corosync)
Tasks: 9 (limit: 4915)
Memory: 149.1M
CPU: 1min 629ms
CGroup: /system.slice/corosync.service
└─276851 /usr/sbin/corosync -f
Aug 10 13:25:24 pve01 corosync[276851]: [TOTEM ] A new membership (1.9) was formed. Members joined: 2
Aug 10 13:25:24 pve01 corosync[276851]: [QUORUM] This node is within the primary component and will provide service.
Aug 10 13:25:24 pve01 corosync[276851]: [QUORUM] Members[2]: 1 2
Aug 10 13:25:24 pve01 corosync[276851]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 13:25:25 pve01 corosync[276851]: [KNET ] rx: host: 3 link: 0 is up
Aug 10 13:25:25 pve01 corosync[276851]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 10 13:25:25 pve01 corosync[276851]: [TOTEM ] A new membership (1.d) was formed. Members joined: 3
Aug 10 13:25:25 pve01 corosync[276851]: [QUORUM] Members[3]: 1 2 3
Aug 10 13:25:25 pve01 corosync[276851]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 10 13:25:25 pve01 corosync[276851]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 8885
# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
addr = 192.168.122.26
status:
nodeid 1: localhost
nodeid 2: connected
nodeid 3: connected
# corosync-cpgtool
Group Name PID Node ID
pve_kvstore_v1\x00
276820 1 (192.168.122.26)
644968 3 (192.168.122.28)
620591 2 (192.168.122.27)
pve_dcdb_v1\x00
276820 1 (192.168.122.26)
644968 3 (192.168.122.28)
620591 2 (192.168.122.27)
Clear "CHECKING INSTALLED COROSYNC VERSION"
再び pve5to6
コマンドを実行して = CHECKING INSTALLED COROSYNC VERSION = が PASS: corosync 3.x installed.
となっていることを確認します。
# pve5to6
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =
Checking for package updates..
PASS: all packages uptodate
Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2
Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.
Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.
= CHECKING CLUSTER HEALTH/SETTINGS =
PASS: systemd unit 'pve-cluster.service' is in state 'active'
PASS: systemd unit 'corosync.service' is in state 'active'
PASS: Cluster Filesystem is quorate.
Analzying quorum settings and state..
INFO: configured votes - nodes: 3
INFO: configured votes - qdevice: 0
INFO: current expected votes: 3
INFO: current total votes: 3
Checking nodelist entries..
PASS: pve03: ring0_addr is configured to use IP address '192.168.122.28'
WARN: pve01: ring0_addr 'pve01' resolves to '192.168.122.26'.
Consider replacing it with the currently resolved IP address.
WARN: pve02: ring0_addr 'pve02' resolves to '192.168.122.27'.
Consider replacing it with the currently resolved IP address.
Checking totem settings..
PASS: Corosync transport set to implicit default.
PASS: Corosync encryption and authentication enabled.
INFO: run 'pvecm status' to get detailed cluster status..
= CHECKING INSTALLED COROSYNC VERSION =
PASS: corosync 3.x installed.
= CHECKING HYPER-CONVERGED CEPH STATUS =
<snip>
Proxmox VE 6 packages upgrade
Corosyncのバージョンアップに成功したら、アップグレード対象ではないノードにVMを寄せることで、1台ずつProxmox VE 6にアップグレードしていくことが出来るようになります。
既に pve5to6
コマンドでPVE 5.4パッケージが最新化されていることは確認済みです。
# pve5to6
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =
Checking for package updates..
PASS: all packages uptodate
Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2
Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.
Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.
<snip>
アップグレードの前処理
全ノードのaptのURLをBusterに更新して、ファイルのダウンロードまで実施します。
- no-subscription repositoryを使用しているので、そちらも更新します
- Cephを使用しているので、そちらも更新します(これはLuminousからNautilusにアップグレードするための変更ではなく、Debian stretch から buster にするための変更)
# sed -i 's/stretch/buster/g' /etc/apt/sources.list
# sed -i -e 's/stretch/buster/g' /etc/apt/sources.list.d/pve-no-subscription.list
# echo "deb http://download.proxmox.com/debian/ceph-luminous buster main" > /etc/apt/sources.list.d/ceph.list
# apt update
# apt dist-upgrade --download-only -y
1台目のアップグレード処理
1台目で起動中のVMをすべて別のノードに移動
ここはよくある方法で実施します。大体こういう感じです。
# qm migrate 1012 pve02 --online
1台目をPVE 6にupgradeする
PVE 5.4からPVE 6.2にアップグレードします。(実施日の都合上 6.0 ではなくいきなり 6.2 です)
# apt dist-upgrade
以下のような通知が出ますが、Enterで先に進めます。
W: (pve-apt-hook) !! ATTENTION !!
W: (pve-apt-hook) You are attempting to upgrade from proxmox-ve '5.4-2' to proxmox-ve '6.2-1'. Please make sure to read the Upgrade notes at
W: (pve-apt-hook) https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0
W: (pve-apt-hook) before proceeding with this operation.
W: (pve-apt-hook)
W: (pve-apt-hook) Press enter to continue, or C^c to abort.
issueのアップグレードは Y
にしておきます。
Configuration file '/etc/issue'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** issue (Y/I/N/O/D/Z) [default=N] ? Y
libc6のアップグレードでサービスを自動的に再起動するかは Yes
にしました。
/etc/systemd/timesyncd.conf の上書きは N
にしました。(内部NTPサーバーの設定が入っているので)
Configuration file '/etc/systemd/timesyncd.conf'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** timesyncd.conf (Y/I/N/O/D/Z) [default=N] ? N
/etc/apt/sources.list.d/pve-enterprise.list
は要らないので N
にしました。
Configuration file '/etc/apt/sources.list.d/pve-enterprise.list'
==> Deleted (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** pve-enterprise.list (Y/I/N/O/D/Z) [default=N] ? N
パッケージのアップグレードが完了したら再起動します。
# reboot
SEIL/x86の問題?
1台目をアップグレード後、PVE 5.4 -> 6.2にSEIL/x86をライブマイグレーションしたとき、なぜかVRRPがmaster->backup, backup->masterになるのを繰り返してしまい、watch-groupに入っていたPPPoE接続が切れてしまう問題を踏みました。
VRRPサービスの再起動で直りましたが、原因はよくわかりません。
# vrrp lan1 disable
# vrrp lan1 enable
2台目以降のアップグレード処理
1台目のアップグレードが適切に完了していればクラスタの中に PVE 5.4/6.2 が混在した状態となり、ライブマイグレーションが可能になります。
1台目と同様、2台目以降もVMを別のノードに移動させた後アップグレードします。(手順割愛)
なお 6.2 -> 5.4 のライブマイグレーションは問題なく実行可能でしたが 5.4 -> 6.2 のライブマイグレーションはパラメータミスマッチのエラーが出て失敗します。
通常ライブマイグレーション出来るかどうかは pve5to6
スクリプトの = MISCELLANEOUS CHECKS =
で検出されているはずなので、事前に問題を解消しておけば特段問題は出ないはずです。
アップグレード後の後処理
Corosyncを3系にアップグレードするために一時的に追加していたリポジトリを削除します。
# rm /etc/apt/sources.list.d/corosync3.list
終わり
ちょっとSEIL/x86がトラブった以外は、ほぼ問題なくProxmox VE 5.4から6.2へのアップグレードができました。
この後はCephクラスタを Luminous から Nautilus にアップグレードする作業があるのですが、それはまた後程。