Ceph 基础

1. Ceph组件：

1.1 OSD（Object Storage Daemon）

　　功能：Ceph OSDs（对象存储守护程序ceph-osd）：提供数据存储，操作系统上的一个磁盘就是一个OSD守护程序，用于处理ceph集群数据复制、回复、重新平衡，并通过检查其他Ceph OSD守护程序的心跳来向Ceph监视器和管理器提供一些监视信息，实现冗余和高可用性至少需要3个Ceph OSD。

1.2 Mon （monitor）：ceph的监视器

　　功能：一个主机上运行的一个守护进程，用于维护集群状态映射（maintains maps of the cluster state），如ceph集群中有多少存储池、每个存储池有多少个PG以及存储池的PG的映射关系等，一个ceph集群至少有一个Mon（1,3,5,7...），Ceph守护程序相互协调所需的关键集群状态有：monitor map，manager map，the OSD map，the MDS map和the CRUSH map。

1.3 Mgr（Manager）管理器

　　功能：一个主机上运行的一个守护进程，Ceph Manager守护程序负责跟踪运行时，指标和Ceph集群的当前状态，包括存储利用率，当前性能指标和系统负载。还托管基于Python的模块来管理和公开Ceph集群信息，包括基于Web的Ceph仪表板和REST API。高可用至少需要两个管理器。

2. Ceph的数据读写流程：

计算文件到对象的映射，得到oid（object id）= ino+non：
- ino：iNode number （INO），File的元数据序列号，File的唯一id
- ono：object number （ONO），File切分产生的某个object的序号，默认以4M切分一个块大小
通过hash算法计算出文件对应的pool中的PG：

　　　通过一致性HASH计算object到PG，Object --> PG映射的hash（oid）&mask --> pgid

通过CRUSH把对象映射到PG中的OSD

　　　通过CRUSH算法计算PG到OSD，PG --> OSD映射：[CRUSH(pgid)->(osd1,osd2,osd3)]

PG中的主OSD将对象写入到硬盘
主OSD将数据同步到备份OSD，并等待备份OSD返回确认
主OSD将写入完成返回给客户端。

说明：

　　Pool：存储池、分区，存储池的大小取决于底层的存储空间。

　　PG（placement group）：一个pool内部可以有多个PG存在，Pool和PG都是抽象的逻辑概念，一个pool中有多少个PG可以通过公式计算。

　　OSD（Object storage Daemon，对象存储设备）：每一块磁盘都是一个osd，一个主机由一个或多个osd组成。

　　ceph集群部署好之后，要先创建存储池才能向ceph写入数据，文件在向ceph保存之前要先进行一致性hash计算，计算后会把文件保存在某个对应的PG中，此文件一定属于某个pool的一个PG，在通过PG保存在OSD上。数据对象在写到主OSD之后再同步到从OSD以实现数据的高可用。

3. 安装

3.1 环境

IP	主机名	处理器	系统盘	数据盘
192.168.3.101/172.31.0.101	ceph-mon1-101	2c2g	50g
192.168.3.102/172.31.0.102	ceph-mon2-102	2c2g	50g
192.168.3.103/172.31.0.103	ceph-mon3-103	2c2g	50g
192.168.3.104/172.31.0.104	ceph-mgr1-104	2c2g	50g
192.168.3.105/172.31.0.105	ceph-mgr2-105	2c2g	50g
192.168.3.106/172.31.0.106	ceph-node1-106	2c2g	50g	20g*5
192.168.3.107/172.31.0.107	ceph-node2-107	2c2g	50g	20g*5
192.168.3.108/172.31.0.108	ceph-node3-108	2c2g	50g	20g*5
192.168.3.109/172.31.0.109	ceph-node4-109	2c2g	50g	20g*5
192.168.3.110/172.31.0.110	ceph-deploy-110	2c2g	50g

3.2 环境准备

所有服务器均需配置好时间同步


sudo apt update
sudo apt install chrony -y
sudo vim /etc/chrony/chrony.conf
# 修改为阿⾥云时钟同步服务器
# 公⽹
server ntp.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp1.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp2.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp3.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp4.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp5.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp6.aliyun.com minpoll 4 maxpoll 10 iburst
server ntp7.aliyun.com minpoll 4 maxpoll 10 iburst
# 重启服务
sudo systemctl restart chrony
sudo systemctl status chrony
sudo systemctl enable chrony
# 查看是否激活
sudo chronyc activity
# 查看时钟同步状态
sudo timedatectl status
# 写⼊系统时钟
sudo hwclock -w
# 安装python2.7
sudo apt install python2.7 -y
ln -sv /usr/bin/python2.7 /usr/bin/python2

所有服务器均需安装国内源地址


sudo wget -q -O- 'https://mirrors.tuna.tsinghua.edu.cn/ceph/keys/release.asc' | sudo apt-key add -
sudo echo "deb https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific bionic main" >> /etc/apt/sources.list

3.3 创建普通用户

创建启动ceph的普通用户，需要具有sudo权限

groupadd -r -g 2022 ceph && useradd -r -m -s /bin/bash -u 2022 -g 2022 ceph && echo ceph:123456 | chpasswd
echo "ceph ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

验证用户

3.4 配置hosts解析

root@ceph-mon1-101:~# cat >> /etc/hosts << EOF
172.31.0.110 ceph-deploy.example.local ceph-deploy ceph-deploy-110
172.31.0.101 ceph-mon1.example.local ceph-mon1 ceph-mon1-101
172.31.0.102 ceph-mon2.example.local ceph-mon2 ceph-mon2-102
172.31.0.103 ceph-mon3.example.local ceph-mon3 ceph-mon3-103
172.31.0.104 ceph-mgr1.example.local ceph-mgr1 ceph-mgr1-104
172.31.0.105 ceph-mgr2.example.local ceph-mgr2 ceph-mgr2-105
172.31.0.106 ceph-node1.example.local ceph-node1 ceph-node1-106
172.31.0.107 ceph-node2.example.local ceph-node2 ceph-node2-107
172.31.0.108 ceph-node3.example.local ceph-node3 ceph-node3-108
172.31.0.109 ceph-node4.example.local ceph-node4 ceph-node4-109
EOF

3.5 配置密钥并分发至每一个ceph节点

root@deploy-110:~# su - ceph
ceph@deploy-110:~$ ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ceph/.ssh/id_rsa): 
Created directory '/home/ceph/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/ceph/.ssh/id_rsa.
Your public key has been saved in /home/ceph/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:8L9ioi0SRA9jT/njmpLqJDc9cyGWHBB6V7EAJOeAfj8 ceph@ceph-mon1-101
The key's randomart image is:
+---[RSA 2048]----+
|+.*o..o.         |
|.*=.oo .         |
|ooo*o.o          |
| o.+oooo         |
| .. *...S        |
|  .o E.. .       |
|..oo+oo   .      |
|o.+.+=. o  .     |
|oo o.o.o ..      |
+----[SHA256]-----+
ceph@deploy-110:~$ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
ceph@deploy-110:~$ for host in ceph-{deploy-110,mon1-101,mon2-102,mon3-103,mgr1-104,mgr2-105,node1-106,node2-107,node3-108,node4-109};do ssh-copy-id ceph@$host;done

3.6 安装 ceph 部署工具： ceph-deploy

root@deploy-110:~# apt update
root@deploy-110:~# apt-cache madison ceph-deploy
ceph-deploy |      2.0.1 | https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific bionic/main amd64 Packages
ceph-deploy |      2.0.1 | https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific bionic/main i386 Packages
ceph-deploy | 1.5.38-0ubuntu1 | https://mirrors.bfsu.edu.cn/ubuntu bionic/universe amd64 Packages
ceph-deploy | 1.5.38-0ubuntu1 | https://mirrors.bfsu.edu.cn/ubuntu bionic/universe i386 Packages

root@deploy-110:~# apt install ceph-deploy

3.7 初始化集群

root@deploy-110:~# su - ceph
ceph@deploy-110:~$ mkdir ceph-cluster && cd ceph-cluster

ceph@deploy-110:~/ceph-cluster$ ceph-deploy --help
new：开始部署一个新的 ceph 存储集群，并生成 CLUSTER.conf 集群配置文件和 keyring
认证文件。
install: 在远程主机上安装 ceph 相关的软件包, 可以通过--release 指定安装的版本。
rgw：管理 RGW 守护程序(RADOSGW,对象存储网关)。
mgr：管理 MGR 守护程序(ceph-mgr,Ceph Manager DaemonCeph 管理器守护程序)。
mds：管理 MDS 守护程序(Ceph Metadata Server，ceph 源数据服务器)。
mon：管理 MON 守护程序(ceph-mon,ceph 监视器)。
gatherkeys：从指定获取提供新节点的验证 keys，这些 keys 会在添加新的 MON/OSD/MD
加入的时候使用。
disk：管理远程主机磁盘。
osd：在远程主机准备数据磁盘，即将指定远程主机的指定磁盘添加到 ceph 集群作为 osd
使用。
repo： 远程主机仓库管理。
admin：推送 ceph 集群配置文件和 client.admin 认证文件到远程主机。
config：将 ceph.conf 配置文件推送到远程主机或从远程主机拷贝。
uninstall：从远端主机删除安装包。
purgedata：从/var/lib/ceph 删除 ceph 数据,会删除/etc/ceph 下的内容。
purge: 删除远端主机的安装包和所有数据。
forgetkeys：从本地主机删除所有的验证 keyring, 包括 client.admin, monitor, bootstrap 等
认证文件。
pkg： 管理远端主机的安装包。
calamari：安装并配置一个 calamari web 节点，calamari 是一个 web 监控平台

# 注意这里的主机名必须与节点hostname保持一致，否则报错
ceph@deploy-110:~/ceph-cluster$ ceph-deploy new --cluster-network 172.31.0.0/24 --public-network 192.168.3.0/24 ceph-mon1-101

ceph@deploy-110:~$ cat ceph-cluster/ceph.conf 
[global]
fsid = af684849-6e4a-4932-8533-e230542e961c
public_network = 192.168.3.0/24
cluster_network = 172.31.0.0/24
mon_initial_members = ceph-mon1
mon_host = 192.168.3.101
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

3.8 初始化mon节点

# 在所有的mon节点上，安装ceph—mon包
root@ceph-mon1-101:~# apt install ceph-mon -y

# 安装ceph-deploy后，会自动重置ceph用户，并修改家目录，这样就需要重新下发ssh-key，并重启系统
# 在ceph-deploy节点初始化mon节点
ceph@deploy-110:~$  cd ceph-cluster/
ceph@deploy-110:~/ceph-cluster$ ceph-deploy mon create-initial
# 初始化后ceph会自动将用户目录移至/var/lib/ceph下，所以需要将配置文件移动到该目录中
ceph@ceph-deploy-110:~$ mkdir /var/lib/ceph/ceph-cluster
ceph@ceph-deploy-110:~$ cd /home/ceph/
ceph@ceph-deploy-110:/home/ceph$ cp * /var/lib/ceph/ceph-cluster/ -r

# 验证mon节点
root@ceph-mon1-101:~# ps -ef|grep ceph-mon

3.9 在deploy节点和所有node节点安装ceph管理客户端

root@ceph-deploy:~# apt install ceph-common

3.10 推送认证文件并初始化ceph-node节点

ceph@ceph-deploy-110:~$ cd /var/lib/ceph/ceph-cluster/
ceph-deploy admin ceph-node1-106 ceph-node2-107 ceph-node3-108 ceph-node4-109 ceph-deploy-110

# 每个节点都需要修改配置文件权限
root@cehp-node1-106:~# setfacl -m u:ceph:rw /etc/ceph/ceph.client.admin.keyring
root@ceph-node2-107:~# setfacl -m u:ceph:rw /etc/ceph/ceph.client.admin.keyring
root@ceph-node3-108:~# setfacl -m u:ceph:rw /etc/ceph/ceph.client.admin.keyring
root@ceph-node4-109:~# setfacl -m u:ceph:rw /etc/ceph/ceph.client.admin.keyring
root@ceph-deploy-110:~# setfacl -m u:ceph:rw /etc/ceph/ceph.client.admin.keyring

# 在deploy上可以看到集群状态
ceph@ceph-deploy-110:~$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_WARN
            mon is allowing insecure global_id reclaim

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 6m)
    mgr: no daemons active
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

3.11 安装配置manager节点

ceph 的 Luminious 及以上版本有 manager 节点，早期的版本没有。

3.11.1 在所有mgr节点上安装ceph-mgr

root@ceph-mgr1-104:~# apt install ceph-mgr
root@ceph-mgr2-105:~# apt install ceph-mgr

3.11.2 在deploy节点中创建mgr节点

root@ceph-deploy-110:~# su - ceph
ceph@ceph-deploy-110:~$ cd ceph-cluster/
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mgr create ceph-mgr1-104

ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_WARN
            mon is allowing insecure global_id reclaim

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 10m)
    mgr: ceph-mgr1-104(active, since 1.49134s)              # 拥有了一个mgr节点
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     

root@ceph-mgr1-104:~# ps -ef|grep mgr
ceph      9814     1  2 14:07 ?        00:00:04 /usr/bin/ceph-mgr -f --cluster ceph --id ceph-mgr1-104 --setuser ceph --setgroup ceph
root      9939  4088  0 14:10 pts/0    00:00:00 grep --color=auto mgr

3.12 添加并初始化node节点

3.12.1 初始化node节点

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy install --no-adjust-repos --nogpgcheck ceph-node1-106
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy install --no-adjust-repos --nogpgcheck ceph-node2-107
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy install --no-adjust-repos --nogpgcheck ceph-node3-108
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy install --no-adjust-repos --nogpgcheck ceph-node4-109
# 不初始化节点将无法zap磁盘

3.12.2 列出 ceph node 节点磁盘：

root@ceph-deploy-110:~# su - ceph
ceph@ceph-deploy-110:~$ cd ceph-cluster/
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk list ceph-node1-106

3.12.3 初始化磁盘

# 注意不要擦除系统盘
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node1-106 /dev/sdb
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node1-106 /dev/sdc
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node1-106 /dev/sdd
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node1-106 /dev/sde
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node1-106 /dev/sdf
...继续擦除其他磁盘
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node4-109 /dev/sde
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk zap ceph-node4-109 /dev/sdf

3.12.4 关闭安全警告

# 关闭非安全模式通信警告
    mon is allowing insecure global_id reclaim
ceph@ceph-deploy-110:~/ceph-cluster$ ceph config set mon auth_allow_insecure_global_id_reclaim false
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
# 关闭后再ceph -s将不会看到非安全模式通信警告

3.12.5 添加osd

#数据分类保存方式：
Data：即 ceph 保存的对象数据
Block: rocks DB 数据即元数据
block-wal：数据库的 wal 日志

# 一般情况下，数据和日志不需要单独指定
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node1-106 --data /dev/sdb
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node1-106 --data /dev/sdc
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node1-106 --data /dev/sdd
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node1-106 --data /dev/sde
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node1-106 --data /dev/sdf
# 继续添加其他osd
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node4-109 --data /dev/sde
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy osd create ceph-node4-109 --data /dev/sdf

# 验证
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 68m)
    mgr: ceph-mgr1-104(active, since 58m)
    osd: 20 osds: 20 up (since 19s), 20 in (since 29s)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   152 MiB used, 400 GiB / 400 GiB avail
    pgs:     1 active+clean

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy disk list ceph-node1-106
……
[ceph-node1-106][INFO  ] Running command: sudo fdisk -l
[ceph-node1-106][INFO  ] Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors
[ceph-node1-106][INFO  ] Disk /dev/sdb: 20 GiB, 21474836480 bytes, 41943040 sectors
[ceph-node1-106][INFO  ] Disk /dev/sdc: 20 GiB, 21474836480 bytes, 41943040 sectors
[ceph-node1-106][INFO  ] Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors
[ceph-node1-106][INFO  ] Disk /dev/sdd: 20 GiB, 21474836480 bytes, 41943040 sectors
[ceph-node1-106][INFO  ] Disk /dev/sdf: 20 GiB, 21474836480 bytes, 41943040 sectors
[ceph-node1-106][INFO  ] Disk /dev/mapper/ceph--d9eb0280--64bb--4eb0--bd1b--2b695aa21333-osd--block--7f20b02b--0a60--4db9--be9a--da184090b1d9: 20 GiB, 21470642176 bytes, 41934848 sectors
[ceph-node1-106][INFO  ] Disk /dev/mapper/ceph--691480f4--302f--40a3--b3a7--d2fad2b9e9f6-osd--block--5177856b--7703--4d57--ac6a--dd20e5d02e0a: 20 GiB, 21470642176 bytes, 41934848 sectors
[ceph-node1-106][INFO  ] Disk /dev/mapper/ceph--4ed74a7e--8859--4714--b124--fc68c32e102e-osd--block--3de90a0c--baec--4638--9b52--ddc37718e26a: 20 GiB, 21470642176 bytes, 41934848 sectors
[ceph-node1-106][INFO  ] Disk /dev/mapper/ceph--5b7245ba--58a6--4372--b928--57f555232d04-osd--block--22cb4619--9a99--4398--aadd--6b837a15fe00: 20 GiB, 21470642176 bytes, 41934848 sectors
[ceph-node1-106][INFO  ] Disk /dev/mapper/ceph--807f9ec7--8b9d--4dfd--b697--fcd38aad9b37-osd--block--e0d8b07e--dd03--4c33--88ee--3faaa374b623: 20 GiB, 21470642176 bytes, 41934848 sectors

#osd从id 0开始分配，id顺序以添加磁盘
root@cehp-node1-106:~# ps -ef |grep ceph
root         800       1  0 15:15 ?        00:00:00 /usr/bin/python3.6 /usr/bin/ceph-crash
ceph        4533       1  0 15:34 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph        6335       1  0 15:35 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
ceph        8151       1  0 15:35 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
ceph        9943       1  0 15:35 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph       11750       1  0 15:36 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
root       12480    1103  0 15:38 pts/0    00:00:00 grep --color=auto ceph

4. 维护

4.1 从RADOS中删除osd

Ceph 集群中的一个 OSD 是一个 node 节点的服务进程且对应于一个物理磁盘设备，是一个专用的守护进程。在某 OSD 设备出现故障，或管理员出于管理之需确实要移除特定的 OSD 设备时，需要先停止相关的守护进程，而后再进行移除操作。对于 Luminous 及其之后的版本来说，停止和移除命令的格式分别如下所示：

1. 停用设备：ceph  osd  out  {osd-num}
2. 停止进程：sudo systemctl stop ceph-osd@{osd-num}
3. 移除设备：ceph osd purge {id} --yes-i-really-mean-it

若类似如下的 OSD 的配置信息存在于 ceph.conf 配置文件中，管理员在删除 OSD 之后手动将其删除。

不过，对于 Luminous 之前的版本来说，管理员需要依次手动执行如下步骤删除 OSD 设备：

1. 于 CRUSH 运行图中移除设备：ceph osd crush remove {name}
2. 移除  OSD  的认证  key：ceph  auth  del  osd.{osd-num}
3. 最后移除  OSD  设备：ceph  osd  rm  {osd-num}

ceph@ceph-deploy-110:~/ceph-cluster$ ceph  osd  out 15
osd.15 is already out. 
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 4h)
    mgr: ceph-mgr1-104(active, since 4h)
    osd: 20 osds: 20 up (since 3h), 19 in (since 20s)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   145 MiB used, 380 GiB / 380 GiB avail
    pgs:     1 active+clean
root@ceph-node4-109:~# sudo systemctl stop ceph-osd@15
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 4h)
    mgr: ceph-mgr1-104(active, since 4h)
    osd: 20 osds: 19 up (since 13s), 19 in (since 74s)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   146 MiB used, 380 GiB / 380 GiB avail
    pgs:     1 active+clean
ceph@ceph-deploy-110:~/ceph-cluster$ ceph  osd  rm 15
removed osd.15
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_WARN
            1 osds exist in the crush map but not in the osdmap

  services:
    mon: 1 daemons, quorum ceph-mon1-101 (age 4h)
    mgr: ceph-mgr1-104(active, since 4h)
    osd: 19 osds: 19 up (since 40s), 19 in (since 101s)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   147 MiB used, 380 GiB / 380 GiB avail
    pgs:     1 active+clean
ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd crush remove osd.15
ceph@ceph-deploy-110:~/ceph-cluster$ ceph auth del osd.15

4.2 测试上传与下载数据

存取数据时，客户端必须首先连接至 RADOS 集群上某存储池，然后根据对象名称由相关的 CRUSH 规则完成数据对象寻址。于是，为了测试集群的数据存取功能，这里首先创建一个用于测试的存储池 mypool，并设定其 PG 数量为 32 个。

　4.2.1 创建存储池

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool create mypool 32 32      # 创建一个名称为mypool的存储池，拥有32个PG和32个PGP
pool 'mypool' created

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool ls   或者   
ceph@ceph-deploy-110:~/ceph-cluster$ rados lspools
device_health_metrics
mypool

4.2.2 验证pg和pgp的对应关系

ceph@ceph-deploy-110:~/ceph-cluster$ ceph pg ls-by-pool mypool|awk '{print $1,$2,$15}'
PG OBJECTS ACTING
2.0 0 [8,10,3]p8            # 2.0这个osd的分片分别在第8、10、3个osd上，其中8为主
2.1 0 [15,0,13]p15
2.2 0 [5,1,15]p5
...

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
-1         0.38971  root default                                      
-3         0.09743      host ceph-node1-106                           
 0    hdd  0.01949          osd.0                up   1.00000  1.00000
 1    hdd  0.01949          osd.1                up   1.00000  1.00000
 2    hdd  0.01949          osd.2                up   1.00000  1.00000
 3    hdd  0.01949          osd.3                up   1.00000  1.00000
 4    hdd  0.01949          osd.4                up   1.00000  1.00000
-5         0.09743      host ceph-node2-107                           
 5    hdd  0.01949          osd.5                up   1.00000  1.00000
 6    hdd  0.01949          osd.6                up   1.00000  1.00000
...

4.2.3 测试文件访问

当前的 ceph 环境还没还没有部署使用块存储和文件系统使用 ceph，也没有使用对象存储的客户端，但是 ceph 的 rados 命令可以实现访问 ceph 对象存储的功能：

4.2.3.1 使用dd生成一个大于4M的文件

ceph@ceph-deploy-110:~/ceph-cluster$ dd if=/dev/zero of=./test count=2 bs=10M

4.2.3.2 上传测试文件到mypool

ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados put msg1 /var/lib/ceph/ceph-cluster/test --pool=mypool       #把messages文件上传到mypool并指定对象id为msg1

4.2.3.3 列出文件

ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados ls --pool=mypool
msg1

4.2.3.4 查看文件详细信息

ceph osd map 命令可以获取到存储池中数据对象的具体位置信息：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph  osd  map  mypool  msg1
osdmap e122 pool 'mypool' (2) object 'msg1' -> pg 2.c833d430 (2.10) -> up ([15,13,0], p15) acting ([15,13,0], p15)
#表示文件放在了存储池 id 为 2 的 c833d430 的 PG 上,10 为当前 PG 的 id, 2.10 表示数据是 在 id 为 2 的存储池当中 id 为 10 的 PG 中存储，在线的 OSD 编号 15,13,0，主 OSD 为 15， 活动的 OSD 15,13,0，三个 OSD 表示数据放一共 3 个副本，PG 中的 OSD 是 ceph 的 crush 算法计算出三份数据保存在哪些 OSD。

4.2.3.5 下载文件

ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados get msg1 --pool=mypool /tmp/test.txt

4.2.3.6 修改文件

#即重新上传文件
ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados put msg1 /etc/passwd --pool=mypool
ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados get msg1 --pool=mypool /tmp/pwd.txt
ceph@ceph-deploy-110:~/ceph-cluster$ ls -lh /tmp/*.txt
-rw-r--r-- 1 root root 1.7K Aug 24 15:32 /tmp/pwd.txt
-rw-r--r-- 1 root root  20M Aug 24 15:26 /tmp/test.txt

4.2.3.7 删除文件

ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados rm msg1 --pool=mypool
ceph@ceph-deploy-110:~/ceph-cluster$ sudo rados ls --pool=mypool

4.3 扩展ceph集群实现高可用

4.3.1 扩展ceph-mon节点

Ceph-mon 是原生具备自选举以实现高可用机制的 ceph 服务，节点数量通常是奇数，扩展需要在相应节点上安装ceph-mon。

root@ceph-mon2-102:~# apt install ceph-mon
root@ceph-mon3-103:~# apt install ceph-mon

在deploy节点上添加mon2和mon3节点

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mon add ceph-mon2-102
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mon add ceph-mon3-103
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-mon1-101,ceph-mon2-102,ceph-mon3-103 (age 5s)
    mgr: ceph-mgr1-104(active, since 6h)
    osd: 20 osds: 20 up (since 76m), 20 in (since 76m)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   282 MiB used, 400 GiB / 400 GiB avail
    pgs:     33 active+clean

4.3.2 验证ceph-mon节点

ceph@ceph-deploy-110:~/ceph-cluster$ ceph quorum_status --format json-pretty

4.4 扩展mgr节点

4.4.1 在mgr节点上安装ceph-mgr

root@ceph-mgr2-105:~# apt install ceph-mgr

4.4.2 扩展ceph-mgr节点

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mgr create ceph-mgr2-105
ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     e8f03660-3996-49cd-b71e-95367b94ff2e
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-mon1-101,ceph-mon2-102,ceph-mon3-103 (age 11m)
    mgr: ceph-mgr1-104(active, since 6h), standbys: ceph-mgr2-105           # 主备模式
    osd: 20 osds: 20 up (since 88m), 20 in (since 88m)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   282 MiB used, 400 GiB / 400 GiB avail
    pgs:     33 active+clean

5. ceph集群应用基础

5.1块设备 RBD

RBD(RADOS Block Devices)即为块存储的一种，RBD 通过 librbd 库与 OSD 进行交互，RBD 为 KVM 等虚拟化技术和云服务（如 OpenStack 和 CloudStack）提供高性能和无限可扩展性的存储后端，这些系统依赖于 libvirt 和 QEMU 实用程序与 RBD 进行集成，客户端基于 librbd 库即可将 RADOS 存储集群用作块设备，不过，用于 rbd 的存储池需要事先启用 rbd 功能并进行初始化。例如，下面的命令创建一个名为 myrbd1 的存储池，并在启用 rbd 功能后对其进行初始化：

5.1.1 创建 RBD

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool create myrbd1 64 64  #创建存储池,指定pg和pgp的数量，pgp是对存在于pg的数据进行组合存储，pgp通常等于pg的值
pool 'myrbd1' created

5.1.2 启用块存储功能

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool application enable myrbd1 rbd
enabled application 'rbd' on pool 'myrbd1'

5.1.3 初始化块存储

ceph@ceph-deploy-110:~/ceph-cluster$ rbd pool init -p myrbd1  #通过RBD命令对存储池初始化

5.1.4 创建并验证img

rbd存储池并不能直接用于块设备，而是需要事先在其中按需创建映像（image），并把映像文件作为块设备使用， rbd命令可用于创建、查看及删除块设备相在的映像 (image），以及克隆映像、创建快照、将映像回滚到快照和查看快照等管理操作。例如，下面的命令能够创建一个名为 myimg1 的映像

ceph@ceph-deploy-110:~/ceph-cluster$ rbd create myimg1 --size 5G --pool myrbd1
ceph@ceph-deploy-110:~/ceph-cluster$ rbd create myimg2 --size 3G --pool myrbd1 --image-format 2 --image-feature layering   # 创建myimg2时附带了一些参数，如不附带参数，将使用默认参数进行创建,由于centos系统内核较低无法挂载使用默认参数创建的img，因此需要指定开启部分特性

ceph@ceph-deploy-110:~/ceph-cluster$ rbd --image myimg1 --pool myrbd1 info
rbd image 'myimg1':
    size 5 GiB in 1280 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 392db7036fa3
    block_name_prefix: rbd_data.392db7036fa3
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features: 
    flags: 
    create_timestamp: Tue Aug 24 16:12:34 2021
    access_timestamp: Tue Aug 24 16:12:34 2021
    modify_timestamp: Tue Aug 24 16:12:34 2021

ceph@ceph-deploy-110:~/ceph-cluster$ rbd --image myimg2 --pool myrbd1 info
rbd image 'myimg2':
    size 3 GiB in 768 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 85529377417f
    block_name_prefix: rbd_data.85529377417f
    format: 2
    features: layering
    op_features: 
    flags: 
    create_timestamp: Tue Aug 24 16:14:30 2021
    access_timestamp: Tue Aug 24 16:14:30 2021
    modify_timestamp: Tue Aug 24 16:14:30 2021

5.2 客户端使用块存储

5.2.1 查看当前ceph状态

ceph@ceph-deploy-110:~/ceph-cluster$ ceph df                    # 类似df命令
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    400 GiB  400 GiB  231 MiB   231 MiB       0.06
TOTAL  400 GiB  400 GiB  231 MiB   231 MiB       0.06

--- POOLS ---
POOL                   ID  PGS  STORED  OBJECTS    USED  %USED  MAX AVAIL
device_health_metrics   1    1     0 B        0     0 B      0    126 GiB
mypool                  2   32     0 B        0     0 B      0    126 GiB
myrbd1                  3   64   405 B        7  48 KiB      0    126 GiB

5.2.2 Centos使用块存储

5.2.2.1 配置yum源

[root@ubuntu-client-200 ~]# yum install epel-release
[root@ubuntu-client-200 ~]# yum install https://mirrors.aliyun.com/ceph/rpm-octopus/el7/noarch/ceph-release-1-1.el7.noarch.rpm

5.2.2.2 安装ceph-common

[root@ubuntu-client-200 ~]# yum install ceph-common

5.2.2.3 从deploy服务器同步认证文件

root@ceph-deploy-110:~# su - ceph
ceph@ceph-deploy-110:~$ cd ceph-cluster/
ceph@ceph-deploy-110:~/ceph-cluster$ scp ceph.conf ceph.client.admin.keyring root@192.168.3.200:/etc/ceph

5.2.2.4 客户端映射img

[root@ubuntu-client-200 ~]# rbd -p myrbd1 map myimg2
/dev/rbd0

[root@ubuntu-client-200 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0   50G  0 disk 
├─sda1            8:1    0    1G  0 part /boot
└─sda2            8:2    0   49G  0 part 
  ├─centos-root 253:0    0   47G  0 lvm  /
  └─centos-swap 253:1    0    2G  0 lvm  [SWAP]
sr0              11:0    1 1024M  0 rom  
rbd0            252:0    0    3G  0 disk 
[root@ubuntu-client-200 ~]# fdisk -l
Disk /dev/rbd0: 3221 MB, 3221225472 bytes, 6291456 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4194304 bytes / 4194304 bytes

# 挂载myimg1时，因为centos不支持部分特性，所以需要按提示关闭特性后再挂载
[root@ubuntu-client-200 ~]# rbd -p myrbd1 map myimg1
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable myrbd1/myimg1 object-map fast-diff deep-flatten".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address
# 按提示关闭特性
[root@ubuntu-client-200 ~]# rbd feature disable myrbd1/myimg1 object-map fast-diff deep-flatten
# 再次挂载
[root@ubuntu-client-200 ~]# rbd -p myrbd1 map myimg1
/dev/rbd1

5.2.2.5 格式化并挂载使用

[root@ubuntu-client-200 ~]# mkfs.ext4 /dev/rbd0
[root@ubuntu-client-200 ~]# mkdir /data/mysql -p
[root@ubuntu-client-200 ~]# mount /dev/rbd0 /data/mysql/
[root@ubuntu-client-200 ~]# df -TH
Filesystem              Type      Size  Used Avail Use% Mounted on
devtmpfs                devtmpfs  952M     0  952M   0% /dev
tmpfs                   tmpfs     964M     0  964M   0% /dev/shm
tmpfs                   tmpfs     964M  9.3M  955M   1% /run
tmpfs                   tmpfs     964M     0  964M   0% /sys/fs/cgroup
/dev/mapper/centos-root xfs        51G  1.8G   49G   4% /
/dev/sda1               xfs       1.1G  157M  907M  15% /boot
tmpfs                   tmpfs     193M     0  193M   0% /run/user/0
/dev/rbd0               ext4      3.2G  9.5M  3.0G   1% /data/mysql

5.2.2.6 客户端上传数据测试

[root@ubuntu-client-200 ~]# dd if=/dev/zero of=/data/mysql/test.data bs=1MB count=300
300+0 records in
300+0 records out
300000000 bytes (300 MB) copied, 0.335979 s, 893 MB/s

5.2.2.7 ceph上验证

ceph@ceph-deploy-110:~/ceph-cluster$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    400 GiB  396 GiB  4.1 GiB   4.1 GiB       1.03
TOTAL  400 GiB  396 GiB  4.1 GiB   4.1 GiB       1.03

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    1      0 B        0      0 B      0    125 GiB
mypool                  2   32      0 B        0      0 B      0    125 GiB
myrbd1                  3   64  352 MiB      102  1.0 GiB   0.27    125 GiB

5.2.3 ceph radosgw(RGW)对象存储

RGW提供的是REST接口，客户端通过http与其进行交互，完成数据的增删改查等管理操作。
radosgw用在需要使用RESTful API接口访问ceph数据的场合，因此在使用RBD即块存储得场合或者使用cephFS的场合可以不用启用radosgw功能。

5.2.3.1 部署radosgw服务

如果使用radosgw，需要将任意mgr节点部署为radosgw主机

root@ceph-mgr1-104:~# apt-cache madison radosgw
root@ceph-mgr1-104:~# apt install radosgw

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf rgw create ceph-mgr1  # 在mgr1上创建rgw并覆盖配置
...
[ceph_deploy.rgw][INFO  ] The Ceph Object Gateway (RGW) is now running on host ceph-mgr1 and default port 7480

5.2.3.2 验证radosgw服务

# 浏览器访问
http://192.168.3.104:7480/                      # mgr1的ip

8.1 部署 MDS 服务:

如果要使用 cephFS，需要部署 cephfs 服务。

#mds节点安装ceph-mds
root@ceph-mon1-101:~# apt install ceph-mds
#管理节点添加mds节点
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mds create ceph-mon1

8.2 创建 CephFS metadata 和 data 存储池：

使用 CephFS 之前需要事先于集群中创建一个文件系统，并为其分别指定元数据和数据相关的存储池。下面创建一个名为 cephfs 的文件系统用于测试，它使用 cephfs-metadata 为元数据存储池，使用 cephfs-data 为数据存储池：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool create cephfs-metadata 32 32
pool 'cephfs-metadata' created
ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool create cephfs-data 64 64
pool 'cephfs-data' created

8.3 创建 cephFS 并验证：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs new mycephfs cephfs-metadata cephfs-data
new fs with metadata pool 2 and data pool 3
ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs ls
name: mycephfs, metadata pool: cephfs-metadata, data pools: [cephfs-data ]
#cephFS状态
ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs status mycephfs
mycephfs - 0 clients
========
RANK  STATE      MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  ceph-mon1  Reqs:    0 /s    10     13     12      0   
      POOL         TYPE     USED  AVAIL  
cephfs-metadata  metadata  96.0k   126G  
  cephfs-data      data       0    126G  
MDS version: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

8.4 验证 cepfFS 服务状态：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph mds stat
mycephfs-1/1/1 up  {0=ceph-mon1=up:active}

8.5 创建客户端账户：

#创建账户
ceph@ceph-deploy-110:~/ceph-cluster$ ceph auth add client.yanyan mon 'allow r' mds 'allow rw' osd 'allow rwx pool=cephfs-data'
#验证账户
ceph@ceph-deploy-110:~/ceph-cluster$ ceph auth get client.yanyan
exported keyring for client.yanyan
[client.yanyan]
    key = AQBl67Jh8M1iFBAAASh+5+GSKa+YDZ3CxR+Qxg==
    caps mds = "allow rw"
    caps mon = "allow r"
    caps osd = "allow rwx pool=cephfs-data"

#创建用 keyring 文件
ceph@ceph-deploy-110:~/ceph-cluster$ ceph auth get client.yanyan -o ceph.client.yanyan.keyring
exported keyring for client.yanyan

#创建 key 文件：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph auth print-key client.yanyan > yanyan.key

#验证用户的 keyring 文件
ceph@ceph-deploy-110:~/ceph-cluster$ cat ceph.client.yanyan.keyring
[client.yanyan]
    key = AQBl67Jh8M1iFBAAASh+5+GSKa+YDZ3CxR+Qxg==
    caps mds = "allow rw"
    caps mon = "allow r"
    caps osd = "allow rwx pool=cephfs-data"

8.6 客户端安装 ceph-common:

[root@centos-client ~]# yum install epel-release
[root@centos-client ~]# yum install ceph-common -y

8.7 同步客户端认证文件：

ceph@ceph-deploy-110:~/ceph-cluster$ scp ceph.conf ceph.client.yanyan.keyring yanyan.key root@192.168.3.201:/etc/ceph/
[root@centos-client ~]# ls /etc/ceph/
ceph.client.yanyan.keyring  ceph.conf  rbdmap  yanyan.key

8.8 客户端验证权限：

[root@centos-client ceph]# ceph --user yanyan -s
  cluster:
    id:     82bad486-fa02-4454-a6b7-c6fce6701eb2
    health: HEALTH_WARN
            too few PGs per OSD (14 < min 30)

  services:
    mon: 3 daemons, quorum ceph-mon1-101,ceph-mon2-102,ceph-mon3-103
    mgr: ceph-mgr1-104(active), standbys: ceph-mgr2-105
    mds: mycephfs-1/1/1 up  {0=ceph-mon1=up:active}
    osd: 20 osds: 20 up, 20 in

  data:
    pools:   2 pools, 96 pgs
    objects: 21 objects, 2.19KiB
    usage:   20.2GiB used, 380GiB / 400GiB avail
    pgs:     96 active+clean

8.9 内核空间挂载 ceph-fs:

客户端挂载有两种方式，一是内核空间一是用户空间，内核空间挂载需要内核支持 ceph 模块，用户空间挂载需要安装 ceph-fuse

8.9.1 客户端通过 key 文件挂载:

[root@centos-client ~]# mkdir /data/cephfs -p
[root@centos-client ceph]# mount -t ceph 192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789:/ /data/cephfs -o name=yanyan,secretfile=/etc/ceph/yanyan.key
[root@centos-client ceph]# df -TH
Filesystem                                                 Type      Size  Used Avail Use% Mounted on
devtmpfs                                                   devtmpfs  952M     0  952M   0% /dev
tmpfs                                                      tmpfs     964M     0  964M   0% /dev/shm
tmpfs                                                      tmpfs     964M  9.4M  955M   1% /run
tmpfs                                                      tmpfs     964M     0  964M   0% /sys/fs/cgroup
/dev/mapper/centos_centos--template-root                   xfs        51G  2.4G   49G   5% /
/dev/sda1                                                  xfs       1.1G  190M  875M  18% /boot
tmpfs                                                      tmpfs     193M     0  193M   0% /run/user/0
192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789:/ ceph      136G     0  136G   0% /data/cephfs

#验证写入数据
[root@centos-client ~]# cp /etc/passwd /data/cephfs/
[root@centos-client cephfs]# dd if=/dev/zero of=/data/cephfs//testfile  bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.0919487 s, 1.1 GB/s
[root@centos-client ~]# ls /data/cephfs/
passwd  testfile

8.9.2 客户端通过 secret 挂载：

#客户端安装ceph-common(本次使用ubuntu)
root@ubuntu-template:~# sudo wget -q -O- 'https://mirrors.tuna.tsinghua.edu.cn/ceph/keys/release.asc' | sudo apt-key add -
root@ubuntu-template:~# sudo echo "deb https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific bionic main" >> /etc/apt/sources.list
root@ubuntu-template:~# apt install ceph-common -y
#同步认证文件到客户端服务器
ceph@ceph-deploy-110:~/ceph-cluster$ scp ceph.conf ceph.client.yanyan.keyring yanyan.key root@192.168.3.200:/etc/ceph/
#获取secret并挂载
root@ubuntu-template:/etc/ceph# tail yanyan.key 
AQBtEbNhcAVLCxAA9zbnwCpo9MkVuDbBmHGFhw==
root@ubuntu-template:/etc/ceph# mkdir /data/cephfs -p
root@ubuntu-template:/etc/ceph# mount -t ceph 192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789:/ /data/cephfs -o name=yanyan,secret=AQBtEbNhcAVLCxAA9zbnwCpo9MkVuDbBmHGFhw==
root@ubuntu-template:/etc/ceph# ls /data/cephfs/
passwd  testfile                                                #可以看到在201上创建的文件
root@ubuntu-template:/etc/ceph# df -TH
Filesystem                                                 Type      Size  Used Avail Use% Mounted on
udev                                                       devtmpfs  1.1G     0  1.1G   0% /dev
tmpfs                                                      tmpfs     210M  5.9M  204M   3% /run
/dev/mapper/ubuntu--template--vg-root                      ext4       31G  3.3G   26G  12% /
tmpfs                                                      tmpfs     1.1G     0  1.1G   0% /dev/shm
tmpfs                                                      tmpfs     5.3M     0  5.3M   0% /run/lock
tmpfs                                                      tmpfs     1.1G     0  1.1G   0% /sys/fs/cgroup
tmpfs                                                      tmpfs     210M     0  210M   0% /run/user/0
192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789:/ ceph      136G  105M  136G   1% /data/cephfs

8.9.3 开机挂载

root@ubuntu-template:~# cat /etc/fstab
192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789:/ /data/cephfs ceph defaults,name=yanyan,secretfile=/etc/ceph/yanyan.key,_netdev 0 0               #最好使用key，不要用secret来挂载
root@ubuntu-template:~# mount -a

8.9.4 客户端模块：

客户端内核加载 ceph.ko 模块挂载 cephfs 文件系统

root@ubuntu-template:~# lsmod|grep ceph
ceph                  380928  1
libceph               315392  1 ceph
fscache                65536  1 ceph
libcrc32c              16384  2 raid456,libceph
root@ubuntu-template:~# modinfo ceph
filename:       /lib/modules/4.15.0-166-generic/kernel/fs/ceph/ceph.ko
license:        GPL
description:    Ceph filesystem for Linux
author:         Patience Warnick <patience@newdream.net>
author:         Yehuda Sadeh <yehuda@hq.newdream.net>
author:         Sage Weil <sage@newdream.net>
alias:          fs-ceph
srcversion:     6FDE92B51C2FCEF58EC349D
depends:        libceph,fscache
retpoline:      Y
intree:         Y
name:           ceph
vermagic:       4.15.0-166-generic SMP mod_unload modversions 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4

8.10 用户空间挂载 ceph-fs：

如果内核本较低而没有 ceph 模块，那么可以安装 ceph-fuse 挂载，但是推荐使用内核模块挂载。

8.10.1 安装 ceph-fuse：

#需要安装epel源和ceph源
[root@centos-client ~]# yum install epel-release -y
[root@centos-client ~]# yum install https://mirrors.aliyun.com/ceph/rpm-octopus/el7/noarch/ceph-release-1-1.el7.noarch.rpm -y
[root@centos-client ~]# sed -i "s#http://download.ceph.com#https://mirrors.tuna.tsinghua.edu.cn/ceph#g" /etc/yum.repos.d/ceph.repo
[root@centos-client ~]# sed -i "s#https://download.ceph.com#https://mirrors.tuna.tsinghua.edu.cn/ceph#g" /etc/yum.repos.d/ceph.repo
#安装fuse和ceph-common
[root@centos-client ~]# yum install ceph-fuse ceph-common -y

8.10.2 ceph-fuse 挂载：

#同步认证及配置文件：
ceph@ceph-deploy-110:~/ceph-cluster$ scp ceph.conf ceph.client.yanyan.keyring yanyan.key root@192.168.3.201:/etc/ceph/
#通过 ceph-fuse 挂载 ceph
[root@centos-client ~]# mkdir /data/cephfs -p
[root@centos-client ~]# ceph-fuse --name client.yanyan -m 192.168.3.101:6789,192.168.3.102:6789,192.168.3.103:6789 /data/cephfs/
2021-12-08T14:20:16.910+0800 7f4bf3f7df40 -1 init, newargv = 0x558c413c1ed0 newargc=9ceph-fuse[
1732]: starting ceph client
ceph-fuse[1732]: starting fuse
#验证挂载
[root@centos-client ~]# df -TH
Filesystem                               Type            Size  Used Avail Use% Mounted on
devtmpfs                                 devtmpfs        952M     0  952M   0% /dev
tmpfs                                    tmpfs           964M     0  964M   0% /dev/shm
tmpfs                                    tmpfs           964M  9.4M  955M   1% /run
tmpfs                                    tmpfs           964M     0  964M   0% /sys/fs/cgroup
/dev/mapper/centos_centos--template-root xfs              51G  2.2G   49G   5% /
/dev/sda1                                xfs             1.1G  158M  907M  15% /boot
tmpfs                                    tmpfs           193M     0  193M   0% /run/user/0
ceph-fuse                                fuse.ceph-fuse  136G  105M  136G   1% /data/cephfs
#验证数据
[root@centos-client ~]# cat /data/cephfs/passwd
[root@centos-client ~]# dd if=/dev/zero of=/data/cephfs/fuse.data bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.749596 s, 140 MB/s

8.10.3 设置fuse开机挂载

[root@centos-client ~]# vi /etc/fstab
none /data/cephfs fuse.ceph ceph.id=yanyan,ceph.conf=/etc/ceph/ceph.conf,_netdev,defaults 0 0
[root@centos-client ~]# mount -a

8.11 ceph mds 高可用：

Ceph mds(etadata service)作为 ceph 的访问入口，需要实现高性能及数据备份，假设启动 4个 MDS 进程，设置 2 个 Rank。这时候有 2 个 MDS 进程会分配给两个 Rank，还剩下 2 个 MDS 进程分别作为另外个的备份。
https://docs.ceph.com/en/latest/cephfs/add-remove-mds/

设置每个 Rank 的备份 MDS，也就是如果此 Rank 当前的 MDS 出现问题马上切换到另个 MDS。
设置备份的方法有很多，常用选项如下。

mds_standby_replay：值为 true 或 false，true 表示开启 replay 模式，这种模式下主 MDS 内的数量将实时与从 MDS 同步，如果主宕机，从可以快速的切换。如果为 false 只有宕机的时候才去同步数据，这样会有一段时间的中断。
mds_standby_for_name：设置当前 MDS 进程只用于备份于指定名称的 MDS。
mds_standby_for_rank：设置当前 MDS 进程只用于备份于哪个 Rank，通常为 Rank 编号。另外在存在之个 CephFS 文件系统中，还可以使用mds_standby_for_fscid 参数来为指定不同的文件系统。
mds_standby_for_fscid：指定 CephFS 文件系统 ID，需要联合 mds_standby_for_rank 生效，如果设置 mds_standby_for_rank，那么就是用于指定文件系统的指定 Rank，如果没有设置，就是指定文件系统的所有 Rank。

8.11.1 当前mds服务状态:

#当前mds服务状态
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mds stat
mycephfs:1 {0=ceph-mon1=up:active}

8.11.2 添加 MDS 服务器：

#mds 服务器安装 ceph-mds 服务
root@ceph-mon2-102:~# apt install ceph-mds -y
root@ceph-mon3-103:~# apt install ceph-mds -y
#ceph-deploy添加mds服务器
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mds create ceph-mon2
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mds create ceph-mon3
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy mds create ceph-node1
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mds stat
mycephfs:1 {0=ceph-mon1=up:active} 3 up:standby                                 #新添加的服务器自动成为备份

8.11.3 验证 ceph 集群当前状态：

#当前处于激活状态的 mds 服务器有一台，处于备份状态的 mds 服务器有两台。
ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs status
mycephfs - 2 clients
========
RANK  STATE      MDS        ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  ceph-mon1  Reqs:    0 /s    13     16     12      2   
      POOL         TYPE     USED  AVAIL  
cephfs-metadata  metadata   304k   126G  
  cephfs-data      data     600M   126G  
STANDBY MDS  
 ceph-mon2   
 ceph-mon3 
 ceph-node1
MDS version: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

8.11.4 当前的文件系统状态:

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs get mycephfs

8.11.5 设置处于激活状态 mds 的数量：

目前有四个 mds 服务器，但是有一个主三个备，可以优化一下部署架构，设置为为两主两备。

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs set mycephfs max_mds 2                     #设置同时活跃的主 mds 最大值为 2。

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs status
mycephfs - 0 clients
========
RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  ceph-mon1   Reqs:    0 /s    10     13     12      0   
 1    active  ceph-node1  Reqs:    0 /s    10     13     12      0   
      POOL         TYPE     USED  AVAIL  
cephfs-metadata  metadata  96.0k   126G  
  cephfs-data      data       0    126G  
STANDBY MDS  
 ceph-mon2   
 ceph-mon3   
MDS version: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs get mycephfs
Filesystem 'mycephfs' (1)
fs_name mycephfs
epoch   11
flags   12
created 2021-12-16T13:57:42.445054+0800
modified    2021-12-16T13:59:08.116455+0800
tableserver 0
root    0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features    {}
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in  0,1
up  {0=44179,1=44218}
failed  
damaged 
stopped 
data_pools  [3]
metadata_pool   2
inline_data disabled
balancer    
standby_count_wanted    1
[mds.ceph-mon1{0:44179} state up:active seq 6 addr [v2:192.168.3.101:6800/1172137117,v1:192.168.3.101:6801/1172137117] compat {c=[1],r=[1],i=[7ff]}]
[mds.ceph-node1{1:44218} state up:active seq 6 addr [v2:192.168.3.106:6820/3402331747,v1:192.168.3.106:6821/3402331747] compat {c=[1],r=[1],i=[7ff]}]

8.11.6 MDS 高可用优化：

目前的状态是 ceph-mgr1 和 ceph-mon2 分别是 active 状态，ceph-mon3 和 ceph-mgr2 分别处于 standby 状态，现在可以将 ceph-mgr2 设置为 ceph-mgr1 的 standby，将 ceph-mon3 设置为 ceph-mon2 的 standby，以实现每个主都有一个固定备份角色的结构，则修改配置文件如下：

ceph@ceph-deploy-110:~/ceph-cluster$ vi ceph.conf
[global]
fsid = 872d69a3-fa66-4e4f-887c-6430bfa2c086
public_network = 192.168.3.0/24
cluster_network = 172.16.3.0/24
mon_initial_members = ceph-mon1-101
mon_host = 192.168.3.101
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

mon clock drift allowed = 2
mon clock drift warn backoff = 30

[mds.ceph-mon1]
#mds_standby_for_fscid = mycephfs
mds_standby_for_name = ceph-mon2                                                          #mon1节点的备份为mon2
mds_standby_replay = true

[mds.ceph-node1]
mds_standby_for_name = ceph-mon3                                                          #node1节点的备份为mon3
mds_standby_replay = true

8.11.7 分发配置文件并重启 mds 服务:

#分发配置文件保证各 mds 服务重启有效(只推备份节点就可以)
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-mon1
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-mon2
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-mon3
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-node1
#重启备份mds服务器的mds服务
root@ceph-mon1-101:~# systemctl restart ceph-mds@ceph-mon1.service
root@ceph-mon2-102:~# systemctl restart ceph-mds@ceph-mon2.service
root@ceph-mon3-103:~# systemctl restart ceph-mds@ceph-mon3.service 
root@ceph-node1-106:~# systemctl restart ceph-mds@ceph-node1.service

8.11.8 ceph 集群 mds 高可用状态：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs status
mycephfs - 0 clients
========
RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  ceph-mon1   Reqs:    0 /s    10     13     12      0   
 1    active  ceph-node1  Reqs:    0 /s    10     13     11      0   
      POOL         TYPE     USED  AVAIL  
cephfs-metadata  metadata   168k   126G  
  cephfs-data      data       0    126G  
STANDBY MDS  
 ceph-mon2   
 ceph-mon3   
MDS version: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

#查看 active 和 standby 对应关系：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs get mycephfs
Filesystem 'mycephfs' (1)
fs_name mycephfs
epoch   11
flags   12
created 2021-12-16T13:57:42.445054+0800
modified    2021-12-16T13:59:08.116455+0800
tableserver 0
root    0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
required_client_features    {}
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in  0,1
up  {0=44179,1=44218}
failed  
damaged 
stopped 
data_pools  [3]
metadata_pool   2
inline_data disabled
balancer    
standby_count_wanted    1
[mds.ceph-mon1{0:44179} state up:active seq 6 addr [v2:192.168.3.101:6800/1172137117,v1:192.168.3.101:6801/1172137117] compat {c=[1],r=[1],i=[7ff]}]
[mds.ceph-node1{1:44218} state up:active seq 6 addr [v2:192.168.3.106:6820/3402331747,v1:192.168.3.106:6821/3402331747] compat {c=[1],r=[1],i=[7ff]}]

8.11.9 重启主mds服务，查看备份接管状态

#重启mon1的mds服务，查看主备切换状态
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mds stat
mycephfs:2 {0=ceph-mon1=up:active,1=ceph-node1=up:active} 2 up:standby              #当前mon1为主
root@ceph-mon1-101:~# systemctl restart ceph-mds@ceph-mon1.service                  #重启mon1的mds服务
ceph@ceph-deploy-110:~/ceph-cluster$ ceph fs status
mycephfs - 0 clients
========
RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS  
 0    active  ceph-mon2   Reqs:    0 /s    10     13     12      0                  #mon2已经按对应关系接管mon1的服务
 1    active  ceph-node1  Reqs:    0 /s    10     13     11      0   
      POOL         TYPE     USED  AVAIL  
cephfs-metadata  metadata   168k   126G  
  cephfs-data      data       0    126G  
STANDBY MDS  
 ceph-mon3   
 ceph-mon1   
MDS version: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

8.12 通过 ganesha 将 cephfs 导出为 NFS：

通过 ganesha 将 cephfs 通过 NFS 协议共享使用。

https://www.server-world.info/en/note?os=Ubuntu_20.04&p=ceph15&f=8

8.12.1 服务端配置：

#在deploy节点下发ceph配置文件到ganesha服务器上，否则不能启动ganesha服务
ceph@ceph-deploy-110:~/ceph-cluster$ scp ceph.client.admin.keyring  ceph.conf  root@172.16.3.104:/etc/ceph
#安裝並配置ganesha
root@ceph-mgr1-104:~# apt install nfs-ganesha-ceph
root@ceph-mgr1-104:~# vi /etc/ganesha/ganesha.conf
# create new
NFS_CORE_PARAM {
    # disable NLM
    Enable_NLM = false;
    # disable RQUOTA (not suported on CephFS)
    Enable_RQUOTA = false;
    # NFS protocol
    Protocols = 4;
}

EXPORT_DEFAULTS {
    # default access mode
    Access_Type = RW;
}

EXPORT {
    # uniq ID
    Export_Id = 1;
    # mount path of CephFS
    Path = "/";
    FSAL {
        name = CEPH;
        # hostname or IP address of this Node
        hostname="172.16.3.104";
    }
    # setting for root Squash
    Squash="No_root_squash";
    # NFSv4 Pseudo path
    Pseudo="/magedu";
    # allowed security options
    SecType = "sys";
}
LOG {
    # default log level
    Default_Log_Level = WARN;
}

#修改完配置文件后，重启服务
root@ceph-mgr1-104:~# systemctl restart nfs-ganesha

8.12.2 :客户端挂载测试：

root@ubuntu-template:~# mkdir /data
root@ubuntu-template:~# mount -t nfs 172.16.3.104:/magedu /data
mount.nfs: access denied by server while mounting 172.16.3.104:/magedu

9 对象存储 RadosGW 使用:

http://docs.ceph.org.cn/radosgw/
对象是对象存储系统中数据存储的基本单位，每个 Object 是数据和数据属性集的综合体，数据属性可以根据应用的需求进行设置，包括数据分布、服务质量等每个对象自我维护其属性，从而简化了存储系统的管理任务，对象的大小可以不同，对象存储（Object Storage）是无层次结构的数据存储方法，通常用于云计算环境中，不同于其他数据存储方法，基于对象的存储不使用目录树：

数据作为单独的对象进行存储
数据并不放置在目录层次结构中，而是存在于平面地址空间内的同一级别
应用通过唯一地址来识别每个单独的数据对象
每个对象可包含有助于检索的元数据
专为使用 API 在应用级别（而非用户级别）进行访问而设计

9.1 RadosGW 对象存储简介：

RadosGW 是对象存储(OSS,Object Storage Service)的一种实现方式，RADOS 网关也称为 Ceph对象网关、RADOSGW、RGW，是一种服务，使客户端能够利用标准对象存储 API 来访问 Ceph集群，它支持AWS S3和Swift API，rgw运行于librados之上，在ceph 0.8版本之后使用Civetweb的 web 服务器来响应 api 请求，可以使用 nginx 或或者 apache 替代，客户端基于 http/https协议通过 RESTful API 与 rgw 通信，而 rgw 则使用 librados 与 ceph 集群通信，rgw 客户端通过 s3 或者 swift api 使用 rgw 用户进行身份验证，然后 rgw 网关代表用户利用 cephx 与 ceph存储进行身份验证。

S3 由 Amazon 于 2006 年推出，全称为 Simple Storage Service,S3 定义了对象存储，是对象存储事实上的标准，从某种意义上说，S3 就是对象存储，对象存储就是 S3,它对象存储市场的霸主，后续的对象存储都是对 S3 的模仿

9.2 对象存储特点：

通过对象存储将数据存储为对象，每个对象除了包含数据，还包含数据自身的元数据。
对象通过 Object ID 来检索，无法通过普通文件系统的方式通过文件路径及文件名称操作来直接访问对象，只能通过 API 来访问，或者第三方客户端（实际上也是对 API 的封装）。
对象存储中的对象不整理到目录树中，而是存储在扁平的命名空间中，Amazon S3 将这个扁平命名空间称为 bucket，而 swift 则将其称为容器。
无论是 bucket 还是容器，都不能嵌套。
bucket 需要被授权才能访问到，一个帐户可以对多个 bucket 授权，而权限可以不同。
方便横向扩展、快速检索数据
不支持客户端挂载,且需要客户端在访问的时候指定文件名称。
不是很适用于文件过于频繁修改及删除的场景。

ceph 使用 bucket 作为存储桶(存储空间)，实现对象数据的存储和多用户隔离，数据存储在bucket 中，用户的权限也是针对 bucket 进行授权，可以设置用户对不同的 bucket 拥有不通的权限，以实现权限管理。

9.2.1 bucket 特性:

存储空间是您用于存储对象（Object）的容器，所有的对象都必须隶属于某个存储空间，可以设置和修改存储空间属性用来控制地域、访问权限、生命周期等，这些属性设置直接作用于该存储空间内所有对象，因此您可以通过灵活创建不同的存储空间来完成不同的管理功能。
同一个存储空间的内部是扁平的，没有文件系统的目录等概念，所有的对象都直接隶属于其对应的存储空间。
每个用户可以拥有多个存储空间
存储空间的名称在 OSS 范围内必须是全局唯一的，一旦创建之后无法修改名称。
存储空间内部的对象数目没有限制。

9.2.2 bucket 命名规范：

https://docs.aws.amazon.com/zh_cn/zh_cn/AmazonS3/latest/userguide/bucketnamingrules.html

只能包括小写字母、数字和短横线（-）。
必须以小写字母或者数字开头和结尾。
长度必须在 3-63 字节之间

图一：radosgw 架构图

图二:radosgw 逻辑图：

9.3 对象存储访问对比：

Amazon S3：提供了 user、bucket 和 object 分别表示用户、存储桶和对象，其中 bucket 隶属于 user，可以针对 user 设置不同 bucket 的名称空间的访问权限，而且不同用户允许访问相同的 bucket。

OpenStack Swift：提供了 user、container 和 object 分别对应于用户、存储桶和对象，不过它还额外为 user 提供了父级组件 account，用于表示一个项目或租户，因此一个 account 中可包含一到多个 user，它们可共享使用同一组 container，并为 container 提供名称空间。

RadosGW：提供了 user、subuser、bucket 和 object，其中的 user 对应于 S3 的 user，而 subuser则对应于 Swift 的 user，不过 user 和 subuser 都不支持为 bucket 提供名称空间，因此，不同用户的存储桶也不允许同名；不过，自 Jewel 版本起，RadosGW 引入了 tenant（租户）用于为 user 和 bucket 提供名称空间，但它是个可选组件，RadosGW 基于 ACL 为不同的用户设置不同的权限控制，如：
Read 读加执行权限
Write 写权限
Readwrite 只读
full-control 全部控制权限

9.4 部署 RadosGW 服务：

将 ceph-mgr1、ceph-mgr2 服务器部署为高可用的 radosGW 服务

9.4.1 安装 radosgw 服务并初始化:

Ubuntu：
root@ceph-mgr1-104:~# apt install radosgw
root@ceph-mgr2-105:~# apt install radosgw

Centos：
[root@ceph-mgr1 ~]# yum install ceph-radosgw
[root@ceph-mgr2 ~]# yum install ceph-radosgw

#在 ceph deploy 服务器将 ceph-mgr1 初始化为 radosGW 服务:
[ceph@ceph-deploy ~]$ cd ceph-cluster/
[ceph@ceph-deploy ceph-cluster]$ ceph-deploy rgw create ceph-mgr1
[ceph@ceph-deploy ceph-cluster]$ ceph-deploy rgw create ceph-mgr2

9.4.2 验证 radosgw 服务状态：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph -s
  cluster:
    id:     872d69a3-fa66-4e4f-887c-6430bfa2c086
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-mon1-101,ceph-mon2-102,ceph-mon3-103 (age 5m)
    mgr: ceph-mgr2-105(active, since 21h), standbys: ceph-mgr1-104
    mds: 2/2 daemons up, 2 standby
    osd: 20 osds: 20 up (since 21h), 20 in (since 6d)
    rgw: 2 daemons active (2 hosts, 1 zones)                                #rgw不区分主备模式

  data:
    volumes: 1/1 healthy
    pools:   7 pools, 201 pgs
    objects: 230 objects, 25 KiB
    usage:   322 MiB used, 400 GiB / 400 GiB avail
    pgs:     201 active+clean

#查看存储池，rgw服务会自动生成存储池
ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool ls
device_health_metrics
cephfs-metadata
cephfs-data
.rgw.root
default.rgw.log
default.rgw.control
default.rgw.meta

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool get default.rgw.meta crush_rule
crush_rule: replicated_rule                                                 # 默认是副本池

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool get default.rgw.meta size
size: 3                                                                     # 默认为3副本

9.4.3 RGW存储池功能：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd lspools
1 device_health_metrics
2 cephfs-metadata
3 cephfs-data
4 .rgw.root
5 default.rgw.log
6 default.rgw.control
7 default.rgw.meta

.rgw.root                       包含realm(领域信息)，比如zone和zonegroup
default.rgw.log                 存储日志信息，用于记录各种log信息
default.rgw.control             系统控制池，在有数据更新时，通知其他RGW更新缓存
default.rgw.meta                元数据存储池，通过不同的命名空间分别存储不同的rados对象，这些名称空间包括用户UID及其bucket映射信息的名称空间users.uid、用户的密钥名称空间users.keys、用户的email名称空间users.email、用户的subuser的名称空间users.swift，以及bucket的名称空间root等
default.rgw.buckets.index       存放bucket到object的索引信息
default.rgw.buckets.data        存放对象的数据
default.rgw.buckets.non-ec      数据的额外信息存储池

9.4.4 验证RGW zone信息

ceph@ceph-deploy-110:~/ceph-cluster$ radosgw-admin zone get --rgw-zone=default
{
    "id": "b5cda766-7dde-48b2-9954-1a1b3de18b4e",
    "name": "default",
    "domain_root": "default.rgw.meta:root",
    "control_pool": "default.rgw.control",
    "gc_pool": "default.rgw.log:gc",
    "lc_pool": "default.rgw.log:lc",
    "log_pool": "default.rgw.log",
    "intent_log_pool": "default.rgw.log:intent",
    "usage_log_pool": "default.rgw.log:usage",
    "roles_pool": "default.rgw.meta:roles",
    "reshard_pool": "default.rgw.log:reshard",
    "user_keys_pool": "default.rgw.meta:users.keys",
    "user_email_pool": "default.rgw.meta:users.email",
    "user_swift_pool": "default.rgw.meta:users.swift",
    "user_uid_pool": "default.rgw.meta:users.uid",
    "otp_pool": "default.rgw.otp",
    "system_key": {
        "access_key": "",
        "secret_key": ""
    },
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",
                "storage_classes": {
                    "STANDARD": {
                        "data_pool": "default.rgw.buckets.data"
                    }
                },
                "data_extra_pool": "default.rgw.buckets.non-ec",
                "index_type": 0
            }
        }
    ],
    "realm_id": "",
    "notif_pool": "default.rgw.log:notif"
}

9.5 访问 radosgw 服务：

rgw服务监听在7480端口

#使用浏览器访问
http://192.168.3.104:7480/

9.5.1 RGW的高可用架构

可以使用lvs、haproxy、nginx对rgw服务配置高可用
#在104上安装haproxy并配置高可用
root@ceph-mgr1-104:~# apt install haproxy
root@ceph-mgr1-104:~# vi /etc/haproxy/haproxy.cfg
listen ceph-rgw
  bind 192.168.3.104:80
  mode tcp
  server rgw1 172.16.3.104:7480 check inter 3s fall 2 rise 5
  server rgw2 172.16.3.105:7480 check inter 3s fall 2 rise 5
root@ceph-mgr1-104:~# systemctl restart haproxy.service

#使用浏览器访问104的80端口
http://192.168.3.104/

9.5.2 自定义端口：

配置文件可以在 ceph deploy 服务器修改然后统一推送，或者单独修改每个 radosgw 服务器的配置为同一配置。

https://docs.ceph.com/en/latest/radosgw/frontends/

#可以在rgw节点上直接修改ceph.conf文件，或者在deploy节点上修改后推送到rgw节点

ceph@ceph-deploy-110:~$ vi ceph-cluster/ceph.conf 

#在最后面添加针对当前节点的自定义配置如下：
[client.rgw.ceph-mgr2]                                                                                          
rgw_host = ceph-mgr2
rgw_frontends = civetweb port=9900

ceph@ceph-deploy-110:~$ scp ceph-cluster/ceph.conf root@172.16.3.105:/etc/ceph/

#重启rgw节点的rgw服务
root@ceph-mgr2-105:~# systemctl restart ceph-radosgw@rgw.ceph-mgr2.service
root@ceph-mgr2-105:~# ss -lntp|grep 9900
LISTEN   0         128                 0.0.0.0:9900             0.0.0.0:*        users:(("radosgw",pid=4216,fd=76))  

#浏览器访问
http://192.168.3.105:9900/

#如果配置高可用，注意下修改相应端口
root@ceph-mgr1-104:~# vi /etc/haproxy/haproxy.cfg
listen ceph-rgw
  bind 192.168.3.104:80
  mode tcp
  server rgw1 172.16.3.104:7480 check inter 3s fall 2 rise 5
  server rgw2 172.16.3.105:9000 check inter 3s fall 2 rise 5

9.5.3 启用 SSL：

生成签名证书并配置 radosgw 启用 SSL

9.5.3.1 自签名证书:

root@ceph-mgr2-105:~# mkdir /etc/ceph/certs
root@ceph-mgr2-105:~# cd /etc/ceph/certs/
root@ceph-mgr2-105:/etc/ceph/certs# openssl genrsa -out civetweb.key 2048
root@ceph-mgr2-105:/etc/ceph/certs# openssl req -new -x509 -key civetweb.key -out civetweb.crt -subj "/CN=rgw.magedu.net"
Can't load /root/.rnd into RNG
140534991716800:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:88:Filename=/root/.rnd
root@ceph-mgr2-105:/etc/ceph/certs# touch /root/.rnd
root@ceph-mgr2-105:/etc/ceph/certs# cat civetweb.key civetweb.crt > civetweb.pem

9.5.3.2 SSL 配置：

#在rgw节点上直接配置或在deploy节点配置后推送至rgw节点
ceph@ceph-deploy-110:~$ vi ceph-cluster/ceph.conf
[client.rgw.ceph-mgr2]                                                                  #在rgw节点配置下添加
rgw_frontends = "civetweb port=9900+9443s ssl_certificate=/etc/ceph/certs/civetweb.pem"       
ceph@ceph-deploy-110:~$ scp ceph-cluster/ceph.conf root@172.16.3.105:/etc/ceph/

#重启rgw节点rgw服务
root@ceph-mgr2-105:/etc/ceph/certs# systemctl restart ceph-radosgw@rgw.ceph-mgr2.service
root@ceph-mgr2-105:/etc/ceph/certs# ss -lntp|grep 9443
LISTEN   0         128                 0.0.0.0:9443             0.0.0.0:*        users:(("radosgw",pid=4949,fd=79)) 

#浏览器访问
https://192.168.3.105:9443/

9.5.3.3 配置高可用

root@ceph-mgr1-104:~# vi /etc/haproxy/haproxy.cfg
listen ceph-rgw-https
  bind 192.168.3.104:443
  mode tcp
  server rgw1 172.16.3.105:9443 check inter 3s fall 2 rise 5
root@ceph-mgr1-104:~# systemctl restart haproxy.service
root@ceph-mgr1-104:~# ss -lntp|grep 443
LISTEN   0         2000          192.168.3.104:443              0.0.0.0:*        users:(("haproxy",pid=3794,fd=9))  
#浏览器访问
https://192.168.3.104/

9.5.4 日志及其它优化配置

ceph@ceph-deploy-110:~$ vi ceph-cluster/ceph.conf 
[client.rgw.ceph-mgr2]                                                     #在最后面添加针对当前节点的自定义配置如下：
rgw_host = ceph-mgr2
rgw_frontends = civetweb port=9900
rgw_frontends = "civetweb port=9900+9443s ssl_certificate=/etc/ceph/certs/civetweb.pem error_log_file=/var/log/radosgw/civetweb.error.log  access_log_file=/var/log/radosgw/civetweb.access.log request_timeout_ms=30000 num_threads=200"
ceph@ceph-deploy-110:~$ scp ceph-cluster/ceph.conf root@172.16.3.105:/etc/ceph/
root@ceph-mgr2-105:~# mkdir /var/log/radosgw
root@ceph-mgr2-105:~# chown ceph.ceph /var/log/radosgw/ -R
root@ceph-mgr2-105:~# systemctl restart ceph-radosgw@rgw.ceph-mgr2.service

#修改并拷贝相同配置及证书到另一台rgw服务器
ceph@ceph-deploy-110:~$ vi ceph-cluster/ceph.conf 
[client.rgw.ceph-mgr1]                                                     #在最后面添加针对当前节点的自定义配置如下：  
rgw_host = ceph-mgr1
rgw_frontends = civetweb port=9900
rgw_frontends = "civetweb port=9900+9443s ssl_certificate=/etc/ceph/certs/civetweb.pem error_log_file=/var/log/radosgw/civetweb.error.log  access_log_file=/var/log/radosgw/civetweb.access.log request_timeout_ms=30000 num_threads=200"

[client.rgw.ceph-mgr2]                                                     #在最后面添加针对当前节点的自定义配置如下：
rgw_host = ceph-mgr2
rgw_frontends = civetweb port=9900
rgw_frontends = "civetweb port=9900+9443s ssl_certificate=/etc/ceph/certs/civetweb.pem error_log_file=/var/log/radosgw/civetweb.error.log  access_log_file=/var/log/radosgw/civetweb.access.log request_timeout_ms=30000 num_threads=200"

ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-mgr1
ceph@ceph-deploy-110:~/ceph-cluster$ ceph-deploy --overwrite-conf config push ceph-mgr2
root@ceph-mgr1-104:~# mkdir /var/log/radosgw
root@ceph-mgr1-104:~# chown ceph.ceph /var/log/radosgw/ -R
root@ceph-mgr1-104:~# systemctl restart ceph-radosgw@rgw.ceph-mgr1.service
root@ceph-mgr2-105:~# systemctl restart ceph-radosgw@rgw.ceph-mgr2.service

9.6 测试rgw数据读写

在实际的生产环境中，RGW1和RGW2的配置参数是完全一致的

9.6.1 创建RGW账户

ceph@ceph-deploy-110:~/ceph-cluster$ radosgw-admin user create --uid="user1" --display-name="ceph-rgw"
{
    "user_id": "user1",
    "display_name": "ceph-rgw",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "user1",
            "access_key": "5L397OV2RCCZFTTJKRXE",
            "secret_key": "ImkiHKCBD8gWoWQV60TuUjOhkjLPiNAaYlmSLWUa"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

9.6.2 安装s3cmd客户端

s3cmd是一个通过命令行访问ceph RGW实现创建存储桶、上传、下载以及管理数据到存储的命令行客户端工具

root@ubuntu-template:~# apt install s3cmd

9.6.3 配置客户端执行环境

9.6.3.1 为执行s3cmd客户端的服务器添加域名解析

root@ubuntu-template:~# echo '192.168.3.104 rgw.magedu.net' >> /etc/hosts

9.6.3.2 配置命令运行环境

root@ubuntu-template:~# s3cmd --configure
Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
Access Key:5L397OV2RCCZFTTJKRXE                                     #配置文件中的access key
Secret Key: ImkiHKCBD8gWoWQV60TuUjOhkjLPiNAaYlmSLWUa                #配置文件中的secret key
Default Region [US]: 

Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]: rgw.magedu.net:9000                 #这里填的是vip，可以直接写后端服务器ip

Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
if the target S3 system supports dns based buckets.
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: rgw.magedu.net:9000/%(bucket)

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password:                                                #如不需要密码加密，直接回车继续
Path to GPG program [/usr/bin/gpg]:                                 #pgp文件路径，直接回车继续

When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
Use HTTPS protocol [Yes]: No                                        #是否使用https

On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name:                                             #是否使用代理

New settings:
  Access Key: 5L397OV2RCCZFTTJKRXE
  Secret Key: ImkiHKCBD8gWoWQV60TuUjOhkjLPiNAaYlmSLWUa
  Default Region: US
  S3 Endpoint: rgw.magedu.net:443
  DNS-style bucket+hostname:port template for accessing a bucket: rgw.magedu.net:443/%(bucket)
  Encryption password: 
  Path to GPG program: /usr/bin/gpg
  Use HTTPS protocol: True
  HTTP Proxy server name: 
  HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] Y                      #测试配置

Please wait, attempting to list all buckets...
Success. Your access key and secret key worked fine :-)

Now verifying that encryption works...
Not configured. Never mind.

Save settings? [y/N]                                                #测试通过后才会提示保存配置
Configuration saved to '/root/.s3cfg'

9.6.3.3 创建bucket

root@ubuntu-template:~# s3cmd mb s3://magedu
Bucket 's3://magedu/' created
root@ubuntu-template:~# s3cmd ls
2021-12-20 08:03  s3://magedu

9.6.3.4 上传并验证文件

root@ubuntu-template:~# wget https://mirrors.tuna.tsinghua.edu.cn/ELK/apt/kibana/4.6/pool/k/ki/kibana-4.6.6-amd64.deb
root@ubuntu-template:~# s3cmd put /root/kibana-4.6.6-amd64.deb s3://magedu/elk/deb
upload: '/root/kibana-4.6.6-amd64.deb' -> 's3://magedu/elk/deb'  [part 1 of 3, 15MB] [1 of 1]
 15728640 of 15728640   100% in    2s     5.70 MB/s  done
upload: '/root/kibana-4.6.6-amd64.deb' -> 's3://magedu/elk/deb'  [part 2 of 3, 15MB] [1 of 1]
 15728640 of 15728640   100% in    0s    29.34 MB/s  done
upload: '/root/kibana-4.6.6-amd64.deb' -> 's3://magedu/elk/deb'  [part 3 of 3, 4MB] [1 of 1]
 4684516 of 4684516   100% in    0s     8.79 MB/s  done

root@ubuntu-template:~# s3cmd la
                       DIR   s3://magedu/elk/

#查看存储池
ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd lspools
1 device_health_metrics
2 .rgw.root
3 default.rgw.log
4 default.rgw.control
5 default.rgw.meta
6 cephfs-metadata
7 cephfs-data
8 default.rgw.buckets.index
9 default.rgw.buckets.non-ec
10 default.rgw.buckets.data

9.6.3.5 下载文件

root@ubuntu-template:~# s3cmd ls s3://magedu/elk/deb/
2021-12-20 08:23  36141796   s3://magedu/elk/deb/kibana-4.6.6-amd64.deb
root@ubuntu-template:~# s3cmd get s3://magedu/elk/deb/kibana-4.6.6-amd64.deb /opt/
download: 's3://magedu/elk/deb/kibana-4.6.6-amd64.deb' -> '/opt/kibana-4.6.6-amd64.deb'  [1 of 1]
 36141796 of 36141796   100% in    0s   176.78 MB/s  done

9.6.3.6 删除文件

root@ubuntu-template:~# s3cmd rm s3://magedu/elk/deb/kibana-4.6.6-amd64.deb 
delete: 's3://magedu/elk/deb/kibana-4.6.6-amd64.deb'
root@ubuntu-template:~# s3cmd ls s3://magedu/elk/deb/

9.6.3.7 查看存储池对应关系

ceph@ceph-deploy-110:~/ceph-cluster$ ceph pg ls-by-pool default.rgw.buckets.data|awk '{print $1,$2,$15}'
PG OBJECTS ACTING
10.0 0 [18,13,0]p18
10.1 2 [18,8,0]p18
10.2 1 [1,18,8]p1
10.3 1 [10,16,8]p10
10.4 1 [12,0,15]p12
10.5 3 [17,4,12]p17
10.6 0 [8,10,16]p8
10.7 1 [19,4,14]p19
10.8 0 [19,12,6]p19
10.9 2 [5,2,18]p5
10.a 0 [17,10,5]p17
10.b 1 [6,4,19]p6
10.c 0 [7,16,0]p7
10.d 3 [6,0,15]p6
10.e 0 [8,2,15]p8
10.f 1 [18,10,7]p18
10.10 2 [16,6,0]p16
10.11 0 [12,0,6]p12
10.12 1 [16,14,7]p16
10.13 1 [1,18,7]p1
10.14 0 [1,15,14]p1
10.15 0 [4,19,5]p4
10.16 0 [6,1,16]p6
10.17 1 [17,13,1]p17
10.18 0 [4,13,6]p4
10.19 0 [16,8,0]p16
10.1a 3 [6,19,12]p6
10.1b 1 [3,6,16]p3
10.1c 4 [6,15,11]p6
10.1d 0 [7,11,18]p7
10.1e 1 [12,18,0]p12
10.1f 1 [7,1,11]p7
ceph@ceph-deploy-110:~$ ceph osd pool get default.rgw.buckets.data size
size: 3
ceph@ceph-deploy-110:~$ ceph osd pool get default.rgw.buckets.data crush_rule
crush_rule: replicated_rule
ceph@ceph-deploy-110:~$ ceph osd pool get default.rgw.buckets.data pg_num
pg_num: 32
ceph@ceph-deploy-110:~$ ceph osd pool get default.rgw.buckets.data pgp_num
pgp_num: 32

10 Ceph Crush进阶

ceph集群中，由mon服务器维护的五种运行图：

Monitor map:                                # 监视器运行图
OSD map:                                    # OSD运行图
PG map:                                     # PG运行图
Crush map:(Controllers replication under scalable hashing)      #可控、可复制、可伸缩的一致性算法
Crush运行图，当新建存储池时会基于OSD map创建出新的PG组合列表，用于存储数据MDS map #cephfs metadata运行图

obj --> pg hash(oid)%pg=pgid

obj --> OSD crush根据当前的mon运行图返回pg内最新的OSD组合，数据即可开始向主的写然后向副本OSD同步

crush算法针对目的节点的选择：

目前有5种算法来实现节点选择：包括Uniform、List、Tree、Straw、Straw2，早期版本使用的是ceph项目的发起者发明的算法straw，目前已经发展到straw2

10.1 PG与OSD映射调整

默认情况下，crush算法自行对创建的pool中的PG分配OSD，但是可以手动基于权重设置crush算法分配数据的倾向性，比如1T的硬盘权重为1，2T的硬盘权重为2，推荐使用大小相同的设备。

10.1.1 查看当前状态

weight表示设备的容量相对值，比如1T对应1.00，那么500G的OSD weight就应该为0.5，weight是基于磁盘空间分配的PG数量，让crush算法尽可能往磁盘空间大的OSD多分配OSD，往磁盘空间小的OSD分配较少的OSD。
Reweight参数的目的是重新平衡ceph的Crush算法随机分配的PG。默认的分配是概率上的平衡，即使OSD都是一样的磁盘空间，也会产生一些PG分布不均匀的情况，此时可以通过调整reweight参数，让ceph集群立即重新平衡当前磁盘的PG，以达到数据均衡分布的目的。reweight是PG已经分配完成，要在ceph集群重新平衡PG的分布。

ceph@ceph-deploy-110:~$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB   47 MiB   26 MiB    5 KiB   21 MiB   20 GiB  0.23  1.25   38      up
 1    hdd  0.01949   1.00000   20 GiB   26 MiB   14 MiB    6 KiB   12 MiB   20 GiB  0.13  0.69   48      up
 2    hdd  0.01949   1.00000   20 GiB   42 MiB   21 MiB    5 KiB   21 MiB   20 GiB  0.20  1.10   32      up
 3    hdd  0.01949   1.00000   20 GiB   38 MiB   14 MiB   10 KiB   24 MiB   20 GiB  0.18  0.99   47      up
 4    hdd  0.01949   1.00000   20 GiB   41 MiB   21 MiB    2 KiB   20 MiB   20 GiB  0.20  1.07   32      up
 5    hdd  0.01949   1.00000   20 GiB   38 MiB   18 MiB    4 KiB   20 MiB   20 GiB  0.19  1.00   40      up
 6    hdd  0.01949   1.00000   20 GiB   41 MiB   21 MiB   12 KiB   20 MiB   20 GiB  0.20  1.07   42      up
 7    hdd  0.01949   1.00000   20 GiB   42 MiB   18 MiB    5 KiB   25 MiB   20 GiB  0.21  1.12   46      up
 8    hdd  0.01949   1.00000   20 GiB   38 MiB   18 MiB    5 KiB   20 MiB   20 GiB  0.18  1.00   30      up
 9    hdd  0.01949   1.00000   20 GiB   45 MiB   25 MiB    5 KiB   20 MiB   20 GiB  0.22  1.20   38      up
10    hdd  0.01949   1.00000   20 GiB   43 MiB   22 MiB   11 KiB   21 MiB   20 GiB  0.21  1.14   37      up
11    hdd  0.01949   1.00000   20 GiB   27 MiB   14 MiB   19 KiB   13 MiB   20 GiB  0.13  0.72   37      up
12    hdd  0.01949   1.00000   20 GiB   44 MiB   22 MiB    8 KiB   23 MiB   20 GiB  0.22  1.17   45      up
13    hdd  0.01949   1.00000   20 GiB   26 MiB   14 MiB    6 KiB   12 MiB   20 GiB  0.13  0.69   45      up
14    hdd  0.01949   1.00000   20 GiB   25 MiB   14 MiB    3 KiB   12 MiB   20 GiB  0.12  0.67   33      up
15    hdd  0.01949   1.00000   20 GiB   45 MiB   22 MiB    9 KiB   22 MiB   20 GiB  0.22  1.18   44      up
16    hdd  0.01949   1.00000   20 GiB   33 MiB   17 MiB    7 KiB   16 MiB   20 GiB  0.16  0.87   45      up
17    hdd  0.01949   1.00000   20 GiB   42 MiB   18 MiB    9 KiB   24 MiB   20 GiB  0.21  1.11   43      up
18    hdd  0.01949   1.00000   20 GiB   34 MiB   22 MiB   15 KiB   13 MiB   20 GiB  0.17  0.91   46      up
19    hdd  0.01949   1.00000   20 GiB   41 MiB   21 MiB    6 KiB   20 MiB   20 GiB  0.20  1.08   51      up
                       TOTAL  400 GiB  760 MiB  377 MiB  162 KiB  382 MiB  399 GiB  0.19

10.1.2 修改weight并验证

#调整osd10的权重为0.05
ceph@ceph-deploy-110:~$ ceph osd crush reweight osd.10 0.05
reweighted item id 10 name 'osd.10' to 0.05 in crush map
#验证
ceph@ceph-deploy-110:~$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB   44 MiB   22 MiB    5 KiB   22 MiB   20 GiB  0.21  1.11   35      up
 1    hdd  0.01949   1.00000   20 GiB   27 MiB   14 MiB    6 KiB   13 MiB   20 GiB  0.13  0.68   48      up
 2    hdd  0.01949   1.00000   20 GiB   42 MiB   21 MiB    5 KiB   21 MiB   20 GiB  0.21  1.06   29      up
 3    hdd  0.01949   1.00000   20 GiB   38 MiB   14 MiB   10 KiB   25 MiB   20 GiB  0.19  0.97   43      up
 4    hdd  0.01949   1.00000   20 GiB   38 MiB   17 MiB    2 KiB   21 MiB   20 GiB  0.18  0.95   30      up
 5    hdd  0.01949   1.00000   20 GiB   43 MiB   18 MiB    4 KiB   25 MiB   20 GiB  0.21  1.08   42      up
 6    hdd  0.01949   1.00000   20 GiB   42 MiB   18 MiB   12 KiB   24 MiB   20 GiB  0.21  1.07   39      up
 7    hdd  0.01949   1.00000   20 GiB   43 MiB   18 MiB    5 KiB   25 MiB   20 GiB  0.21  1.09   44      up
 8    hdd  0.01949   1.00000   20 GiB   42 MiB   18 MiB    5 KiB   25 MiB   20 GiB  0.21  1.07   30      up
 9    hdd  0.01949   1.00000   20 GiB   46 MiB   25 MiB    5 KiB   21 MiB   20 GiB  0.22  1.16   36      up
10    hdd  0.04999   1.00000   20 GiB   46 MiB   22 MiB   11 KiB   24 MiB   20 GiB  0.22  1.16   77      up
11    hdd  0.01949   1.00000   20 GiB   28 MiB   14 MiB   19 KiB   14 MiB   20 GiB  0.14  0.71   30      up
12    hdd  0.01949   1.00000   20 GiB   50 MiB   26 MiB    8 KiB   24 MiB   20 GiB  0.24  1.27   40      up
13    hdd  0.01949   1.00000   20 GiB   30 MiB   17 MiB    6 KiB   13 MiB   20 GiB  0.15  0.75   44      up
14    hdd  0.01949   1.00000   20 GiB   26 MiB   14 MiB    3 KiB   12 MiB   20 GiB  0.13  0.66   25      up
15    hdd  0.01949   1.00000   20 GiB   45 MiB   22 MiB    9 KiB   23 MiB   20 GiB  0.22  1.14   44      up
16    hdd  0.01949   1.00000   20 GiB   37 MiB   20 MiB    7 KiB   17 MiB   20 GiB  0.18  0.93   45      up
17    hdd  0.01949   1.00000   20 GiB   43 MiB   18 MiB    9 KiB   25 MiB   20 GiB  0.21  1.08   45      up
18    hdd  0.01949   1.00000   20 GiB   43 MiB   26 MiB   15 KiB   17 MiB   20 GiB  0.21  1.09   43      up
19    hdd  0.01949   1.00000   20 GiB   39 MiB   18 MiB    6 KiB   21 MiB   20 GiB  0.19  0.98   50      up
                       TOTAL  400 GiB  791 MiB  380 MiB  162 KiB  410 MiB  399 GiB  0.19                
MIN/MAX VAR: 0.66/1.27  STDDEV: 0.03

10.1.3 修改reweight值并验证

osd的reweight值默认为1，值可以调整，范围在0-1直接，值越低PG越小，如果调整了任何一个OSD的reweight值，那么OSD的PG会立即和其他OSD进行重新平衡，即数据的重新分配，用于当某个OSD的PG相对较多需要降低其PG数量的场景

ceph@ceph-deploy-110:~$ ceph osd reweight 19 0.6
reweighted osd.19 to 0.6 (9999)
ceph@ceph-deploy-110:~$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB   45 MiB   23 MiB    5 KiB   22 MiB   20 GiB  0.22  1.10   35      up
 1    hdd  0.01949   1.00000   20 GiB   33 MiB   14 MiB    6 KiB   19 MiB   20 GiB  0.16  0.81   48      up
 2    hdd  0.01949   1.00000   20 GiB   51 MiB   25 MiB    5 KiB   26 MiB   20 GiB  0.25  1.25   30      up
 3    hdd  0.01949   1.00000   20 GiB   40 MiB   14 MiB   10 KiB   26 MiB   20 GiB  0.19  0.97   44      up
 4    hdd  0.01949   1.00000   20 GiB   39 MiB   17 MiB    2 KiB   22 MiB   20 GiB  0.19  0.95   30      up
 5    hdd  0.01949   1.00000   20 GiB   44 MiB   18 MiB    4 KiB   26 MiB   20 GiB  0.21  1.08   43      up
 6    hdd  0.01949   1.00000   20 GiB   44 MiB   18 MiB   12 KiB   25 MiB   20 GiB  0.21  1.07   40      up
 7    hdd  0.01949   1.00000   20 GiB   33 MiB   18 MiB    5 KiB   14 MiB   20 GiB  0.16  0.80   44      up
 8    hdd  0.01949   1.00000   20 GiB   44 MiB   18 MiB    5 KiB   26 MiB   20 GiB  0.21  1.07   30      up
 9    hdd  0.01949   1.00000   20 GiB   51 MiB   26 MiB    5 KiB   25 MiB   20 GiB  0.25  1.25   39      up
10    hdd  0.04999   1.00000   20 GiB   52 MiB   22 MiB   11 KiB   30 MiB   20 GiB  0.26  1.28   80      up
11    hdd  0.01949   1.00000   20 GiB   33 MiB   14 MiB   19 KiB   19 MiB   20 GiB  0.16  0.82   30      up
12    hdd  0.01949   1.00000   20 GiB   40 MiB   26 MiB    8 KiB   13 MiB   20 GiB  0.19  0.97   40      up
13    hdd  0.01949   1.00000   20 GiB   35 MiB   17 MiB    6 KiB   18 MiB   20 GiB  0.17  0.87   44      up
14    hdd  0.01949   1.00000   20 GiB   27 MiB   14 MiB    3 KiB   13 MiB   20 GiB  0.13  0.66   27      up
15    hdd  0.01949   1.00000   20 GiB   46 MiB   23 MiB    9 KiB   24 MiB   20 GiB  0.23  1.14   45      up
16    hdd  0.01949   1.00000   20 GiB   43 MiB   20 MiB    7 KiB   22 MiB   20 GiB  0.21  1.04   49      up
17    hdd  0.01949   1.00000   20 GiB   32 MiB   18 MiB    9 KiB   14 MiB   20 GiB  0.16  0.79   45      up
18    hdd  0.01949   1.00000   20 GiB   45 MiB   26 MiB   15 KiB   19 MiB   20 GiB  0.22  1.10   45      up
19    hdd  0.01949   0.59999   20 GiB   40 MiB   14 MiB    6 KiB   26 MiB   20 GiB  0.20  0.98   31      up
                       TOTAL  400 GiB  816 MiB  386 MiB  162 KiB  430 MiB  399 GiB  0.20                   
MIN/MAX VAR: 0.66/1.28  STDDEV: 0.03

10.2 crush运行图管理

通过工具将ceph的crush运行图导出并进行编辑，然后再导入

10.2.1 导出crush运行图

导出的crush运行图为二进制格式，无法通过文本编辑器直接打开，需要使用crushtool工具转换为文本格式后才能通过文本编辑器进行编辑

ceph@ceph-deploy-110:~$ sudo mkdir /data/ceph -p
ceph@ceph-deploy-110:~$ sudo ceph osd getcrushmap -o /data/ceph/crushmap
109

10.2.2 将运行图转换为文本

导出的运行图不能直接编辑，需要转换为文本格式后再进行查看和编辑

ceph@ceph-deploy-110:~$ sudo apt install ceph-base
ceph@ceph-deploy-110:~$ sudo crushtool -d /data/ceph/crushmap -o /data/ceph/crushmap.txt

10.2.3 crushmap内容

ceph@ceph-deploy-110:~$ cat /data/ceph/crushmap.txt 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices   #当前设备列表
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd

# types                         #当前支持的bucket类型
type 0 osd                      #osd守护进程，对应到一个磁盘设备
type 1 host                     #主机
type 2 chassis                  #刀片服务器的机箱
type 3 rack                     #包含若干个服务器的机柜/机架
type 4 row                      #包含若干个机柜的一排机柜
type 5 pdu                      #机柜的接入电源插座
type 6 pod                      #一个机房中的若干个小房间
type 7 room                     #包含若干个机柜的房间，一个数据中心由多个这样的房间组成
type 8 datacenter               #一个数据中心或IDC
type 9 zone                     #一个
type 10 region                  #一个区域，比如AWS宁夏中卫数据中心
type 11 root                    #bucket分层的最顶部，根

# buckets
host ceph-node1-106 {                                       #类型为host，名称为ceph-node1-106
    id -3       # do not change unnecessarily               #ceph自动生成的osd id，非必要不要修改
    id -4 class hdd     # do not change unnecessarily       
    # weight 0.097
    alg straw2                                              #crush算法，管理OSD角色
    hash 0  # rjenkins1                                     #当前使用的hash算法，0表示选择rjenkins1算法
    item osd.0 weight 0.019                                 #osd权重比例，crush会自动根据磁盘空间计算，不同的磁盘权重不一样
    item osd.1 weight 0.019
    item osd.2 weight 0.019
    item osd.3 weight 0.019
    item osd.4 weight 0.019
}
host ceph-node2-107 {
    id -5       # do not change unnecessarily
    id -6 class hdd     # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.5 weight 0.019
    item osd.6 weight 0.019
    item osd.7 weight 0.019
    item osd.8 weight 0.019
    item osd.9 weight 0.019
}
host ceph-node3-108 {
    id -7       # do not change unnecessarily
    id -8 class hdd     # do not change unnecessarily
    # weight 0.128
    alg straw2
    hash 0  # rjenkins1
    item osd.10 weight 0.050
    item osd.11 weight 0.019
    item osd.12 weight 0.019
    item osd.13 weight 0.019
    item osd.14 weight 0.019
}
host ceph-node4-109 {
    id -9       # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.16 weight 0.019
    item osd.17 weight 0.019
    item osd.18 weight 0.019
    item osd.19 weight 0.019
    item osd.15 weight 0.019
}
root default {
    id -1       # do not change unnecessarily
    id -2 class hdd     # do not change unnecessarily
    # weight 0.420
    alg straw2
    hash 0  # rjenkins1
    item ceph-node1-106 weight 0.097
    item ceph-node2-107 weight 0.097
    item ceph-node3-108 weight 0.128
    item ceph-node4-109 weight 0.097
}

# rules
rule replicated_rule {                                      #副本池的默认设置
    id 0
    type replicated                                         #默认类型为replicated
    min_size 1                                              #最小副本数量
    max_size 10                                             #最大副本数量，默认为10
    step take default                                       #基于default定义的主机分配OSD
    step chooseleaf firstn 0 type host                      #选择主机，故障域类型为主机
    step emit                                               #弹出配置 即返回给客户端
}

# end crush map

10.2.4 编辑crushmap

ceph@ceph-deploy-110:~$ sudo vi /data/ceph/crushmap.txt 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices   #当前设备列表
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd

# types                         #当前支持的bucket类型
type 0 osd                      #osd守护进程，对应到一个磁盘设备
type 1 host                     #主机
type 2 chassis                  #刀片服务器的机箱
type 3 rack                     #包含若干个服务器的机柜/机架
type 4 row                      #包含若干个机柜的一排机柜
type 5 pdu                      #机柜的接入电源插座
type 6 pod                      #一个机房中的若干个小房间
type 7 room                     #包含若干个机柜的房间，一个数据中心由多个这样的房间组成
type 8 datacenter               #一个数据中心或IDC
type 9 zone                     #一个
type 10 region                  #一个区域，比如AWS宁夏中卫数据中心
type 11 root                    #bucket分层的最顶部，根

# buckets
host ceph-node1-106 {                                       #类型为host，名称为ceph-node1-106
    id -3       # do not change unnecessarily               #ceph自动生成的osd id，非必要不要修改
    id -4 class hdd     # do not change unnecessarily       
    # weight 0.097
    alg straw2                                              #crush算法，管理OSD角色
    hash 0  # rjenkins1                                     #当前使用的hash算法，0表示选择rjenkins1算法
    item osd.0 weight 0.019                                 #osd权重比例，crush会自动根据磁盘空间计算，不同的磁盘权重不一样
    item osd.1 weight 0.019
    item osd.2 weight 0.019
    item osd.3 weight 0.019
    item osd.4 weight 0.019
}
host ceph-node2-107 {
    id -5       # do not change unnecessarily
    id -6 class hdd     # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.5 weight 0.019
    item osd.6 weight 0.019
    item osd.7 weight 0.019
    item osd.8 weight 0.019
    item osd.9 weight 0.019
}
host ceph-node3-108 {
    id -7       # do not change unnecessarily
    id -8 class hdd     # do not change unnecessarily
    # weight 0.128
    alg straw2
    hash 0  # rjenkins1
    item osd.10 weight 0.050
    item osd.11 weight 0.019
    item osd.12 weight 0.019
    item osd.13 weight 0.019
    item osd.14 weight 0.019
}
host ceph-node4-109 {
    id -9       # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.16 weight 0.019
    item osd.17 weight 0.019
    item osd.18 weight 0.019
    item osd.19 weight 0.019
    item osd.15 weight 0.019
}
root default {
    id -1       # do not change unnecessarily
    id -2 class hdd     # do not change unnecessarily
    # weight 0.420
    alg straw2
    hash 0  # rjenkins1
    item ceph-node1-106 weight 0.097
    item ceph-node2-107 weight 0.097
    item ceph-node3-108 weight 0.128
    item ceph-node4-109 weight 0.097
}

# rules
rule replicated_rule {                                      #副本池的默认设置
    id 0
    type replicated                                         #默认类型为replicated
    min_size 1                                              #最小副本数量
    max_size 6                                              #最大副本数量，默认为10，修改为6
    step take default                                       #基于default定义的主机分配OSD
    step chooseleaf firstn 0 type host                      #选择主机，故障域类型为主机
    step emit                                               #弹出配置 即返回给客户端
}

# end crush map

10.2.5 将文本转换为crush的二进制格式

ceph@ceph-deploy-110:~/ceph-cluster$ sudo crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap

10.2.6 导入新的crushmap

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd setcrushmap -i /data/ceph/newcrushmap

10.2.7 验证crushmap是否生效

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 6,                                        #已生效
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

10.3 crush数据分类管理

Ceph crush算法分配的PG的时候可以将PG分配到不同主机的OSD上，以实现以主机为单位的高可用，这也是默认机制。但是无法保证不同PG位于不同机柜或者机房的主机，如果要实现基于机柜或者更高级的IDC等方式的数据高可用，而且也不能实现A项目的数据在SSD，B项目的数据在机械盘，如果想实现类似功能，则需要导出crush运行图并手动编辑，然后再导入覆盖原有的crush运行图以实现该功能。

10.3.1 修改crush文件

ceph@ceph-deploy-110:~/ceph-cluster$ cat /data/ceph/crushmap.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices   #当前设备列表
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd

# types                         #当前支持的bucket类型
type 0 osd                      #osd守护进程，对应到一个磁盘设备
type 1 host                     #主机
type 2 chassis                  #刀片服务器的机箱
type 3 rack                     #包含若干个服务器的机柜/机架
type 4 row                      #包含若干个机柜的一排机柜
type 5 pdu                      #机柜的接入电源插座
type 6 pod                      #一个机房中的若干个小房间
type 7 room                     #包含若干个机柜的房间，一个数据中心由多个这样的房间组成
type 8 datacenter               #一个数据中心或IDC
type 9 zone                     #一个
type 10 region                  #一个区域，比如AWS宁夏中卫数据中心
type 11 root                    #bucket分层的最顶部，根

# buckets
host ceph-node1-106 {                                       #类型为host，名称为ceph-node1-106
    id -3       # do not change unnecessarily               #ceph自动生成的osd id，非必要不要修改
    id -4 class hdd     # do not change unnecessarily       
    # weight 0.097
    alg straw2                                              #crush算法，管理OSD角色
    hash 0  # rjenkins1                                     #当前使用的hash算法，0表示选择rjenkins1算法
    item osd.0 weight 0.019                                 #osd权重比例，crush会自动根据磁盘空间计算，不同的磁盘权重不一样
    item osd.1 weight 0.019
    item osd.2 weight 0.019
    item osd.3 weight 0.019
    item osd.4 weight 0.019
}
host ceph-node2-107 {
    id -5       # do not change unnecessarily
    id -6 class hdd     # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.5 weight 0.019
    item osd.6 weight 0.019
    item osd.7 weight 0.019
    item osd.8 weight 0.019
    item osd.9 weight 0.019
}
host ceph-node3-108 {
    id -7       # do not change unnecessarily
    id -8 class hdd     # do not change unnecessarily
    # weight 0.128
    alg straw2
    hash 0  # rjenkins1
    item osd.10 weight 0.050
    item osd.11 weight 0.019
    item osd.12 weight 0.019
    item osd.13 weight 0.019
    item osd.14 weight 0.019
}
host ceph-node4-109 {
    id -9       # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.097
    alg straw2
    hash 0  # rjenkins1
    item osd.16 weight 0.019
    item osd.17 weight 0.019
    item osd.18 weight 0.019
    item osd.19 weight 0.019
    item osd.15 weight 0.019
}

#ssd node                         # 为ssd创建单独的bucket
host ceph-ssd-node1 {
        id -103                   # do not change unnecessarily
        id -104 class hdd         # do not change unnecessarily
        # weight 0.097
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.019   # 具体使用的ssd id
}
host ceph-ssd-node2 {             
        id -105                   # do not change unnecessarily
        id -106 class hdd         # do not change unnecessarily
        # weight 0.097
        alg straw2
        hash 0  # rjenkins1
        item osd.5 weight 0.019
}
host ceph-ssd-node3 {             
        id -107                   # do not change unnecessarily
        id -108 class hdd         # do not change unnecessarily
        # weight 0.097
        alg straw2
        hash 0  # rjenkins1
        item osd.10 weight 0.019
}
host ceph-ssd-node4 {             
        id -109                   # do not change unnecessarily
        id -110 class hdd         # do not change unnecessarily
        # weight 0.097
        alg straw2
        hash 0  # rjenkins1
        item osd.15 weight 0.019
}

root ssd {                        # 为ssd创建单独的root     
        id -127                   # do not change unnecessarily
        id -128 class hdd         # do not change unnecessarily
        # weight 0.097
        alg straw2
        hash 0  # rjenkins1
        item ceph-ssd-node1 weight 0.019      # ssd这个root具体调用的bucket
        item ceph-ssd-node2 weight 0.019
        item ceph-ssd-node3 weight 0.019
        item ceph-ssd-node4 weight 0.019
}
root default {
    id -1       # do not change unnecessarily
    id -2 class hdd     # do not change unnecessarily
    # weight 0.420
    alg straw2
    hash 0  # rjenkins1
    item ceph-node1-106 weight 0.097
    item ceph-node2-107 weight 0.097
    item ceph-node3-108 weight 0.128
    item ceph-node4-109 weight 0.097
}

# rules
rule ssd_rule {                                             #新建的ssd规则
        id 20
        type replicated                                     #默认类型为replicated
        min_size 1                                          #最小副本数量
        max_size 5                                          #最大副本数量，默认为10
        step take ssd                                       #调用ssd这个root规则
        step chooseleaf firstn 0 type host                  #选择主机，故障域类型为主机
        step emit                                           #弹出配置 即返回给客户端
}

rule replicated_rule {                                      #副本池的默认设置
    id 0
    type replicated                                         #默认类型为replicated
    min_size 1                                              #最小副本数量
    max_size 6                                              #最大副本数量，默认为10
    step take default                                       #基于default定义的主机分配OSD
    step chooseleaf firstn 0 type host                      #选择主机，故障域类型为主机
    step emit                                               #弹出配置 即返回给客户端
}

# end crush map

10.3.2 转换文本文件为crush map二进制文件

ceph@ceph-deploy-110:~/ceph-cluster$ sudo crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap

10.3.3 导入新的crush

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd setcrushmap -i /data/ceph/newcrushmap

10.3.4 验证配置是否生效

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 6,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 20,
        "rule_name": "ssd_rule",
        "ruleset": 20,
        "type": 1,
        "min_size": 1,
        "max_size": 5,
        "steps": [
            {
                "op": "take",
                "item": -127,
                "item_name": "ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd tree
ID    CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
-127         0.07599  root ssd                                          
-103         0.01900      host ceph-ssd-node1                           
   0    hdd  0.01900          osd.0                up   1.00000  1.00000
-105         0.01900      host ceph-ssd-node2                           
   5    hdd  0.01900          osd.5                up   1.00000  1.00000
-107         0.01900      host ceph-ssd-node3                           
  10    hdd  0.01900          osd.10               up   1.00000  1.00000
-109         0.01900      host ceph-ssd-node4                           
  15    hdd  0.01900          osd.15               up   1.00000  1.00000
  -1         0.41895  root default                                      
  -3         0.09698      host ceph-node1-106                           
   0    hdd  0.01900          osd.0                up   1.00000  1.00000
   1    hdd  0.01900          osd.1                up   1.00000  1.00000
   2    hdd  0.01900          osd.2                up   1.00000  1.00000
   3    hdd  0.01900          osd.3                up   1.00000  1.00000
   4    hdd  0.01900          osd.4                up   1.00000  1.00000
  -5         0.09698      host ceph-node2-107                           
   5    hdd  0.01900          osd.5                up   1.00000  1.00000
   6    hdd  0.01900          osd.6                up   1.00000  1.00000
   7    hdd  0.01900          osd.7                up   1.00000  1.00000
   8    hdd  0.01900          osd.8                up   1.00000  1.00000
   9    hdd  0.01900          osd.9                up   1.00000  1.00000
  -7         0.12799      host ceph-node3-108                           
  10    hdd  0.04999          osd.10               up   1.00000  1.00000
  11    hdd  0.01900          osd.11               up   1.00000  1.00000
  12    hdd  0.01900          osd.12               up   1.00000  1.00000
  13    hdd  0.01900          osd.13               up   1.00000  1.00000
  14    hdd  0.01900          osd.14               up   1.00000  1.00000
  -9         0.09698      host ceph-node4-109                           
  15    hdd  0.01900          osd.15               up   1.00000  1.00000
  16    hdd  0.01900          osd.16               up   1.00000  1.00000
  17    hdd  0.01900          osd.17               up   1.00000  1.00000
  18    hdd  0.01900          osd.18               up   1.00000  1.00000
  19    hdd  0.01900          osd.19               up   1.00000  1.00000

10.3.5 创建测试存储池

ceph@ceph-deploy-110:~/ceph-cluster$ ceph osd pool create ssdpool 32 32 ssd_rule        #如果使用自定义的ssd规则，需指定存储池使用该规则
pool 'ssdpool' created

10.3.6 验证osd和pgp 的对应关系

ceph@ceph-deploy-110:~/ceph-cluster$ ceph pg ls-by-pool ssdpool|awk '{print $1,$2,$15}'
PG OBJECTS ACTING
2.0 0 [5,10,0]p5
2.1 0 [10,0,5]p10
2.2 0 [0,10,5]p0
2.3 0 [0,15,5]p0
2.4 0 [0,5,10]p0
2.5 0 [5,10,15]p5
2.6 0 [15,10,0]p15
2.7 0 [10,5,0]p10
2.8 0 [5,15,0]p5
2.9 0 [0,5,10]p0
2.a 0 [10,0,5]p10
2.b 0 [10,0,15]p10
2.c 0 [5,10,0]p5
2.d 0 [15,5,0]p15
2.e 0 [15,0,10]p15
2.f 0 [15,10,0]p15
2.10 0 [10,5,15]p10
2.11 0 [0,5,10]p0
2.12 0 [15,0,10]p15
2.13 0 [15,10,5]p15
2.14 0 [5,15,0]p5
2.15 0 [5,10,15]p5
2.16 0 [15,5,10]p15
2.17 0 [0,5,10]p0
2.18 0 [10,5,15]p10
2.19 0 [5,0,15]p5
2.1a 0 [0,5,15]p0
2.1b 0 [15,0,5]p15
2.1c 0 [5,0,15]p5
2.1d 0 [0,10,5]p0
2.1e 0 [5,0,10]p5
2.1f 0 [10,0,5]p10

11 ceph dashboard 及监控

Ceph dashboard 是通过一个 web 界面，对已经运行的 ceph 集群进行状态查看及功能配置等功能，早期 ceph 使用的是第三方的 dashboard 组件，如：
Calamari：

Calamari 对外提供了十分漂亮的 Web 管理和监控界面，以及一套改进的 REST API 接口（不同于 Ceph 自身的 REST API），在一定程度上简化了 Ceph 的管理。最初 Calamari 是作为 Inktank公司的 Ceph 企业级商业产品来销售，红帽 2015 年收购 Inktank 后为了更好地推动 Ceph 的发展，对外宣布 Calamari 开源
https://github.com/ceph/calamari
优点：
管理功能好
界面友好
可以利用它来部署 Ceph 和监控 Ceph
缺点：
非官方
依赖 OpenStack 某些包
[ceph@ceph-deploy ceph-cluster]$ ceph-deploy -h
...... calamari Install and configure Calamari nodes. Assumes that a
repository with Calamari packages is already
configured. Refer to the docs for examples
(http://ceph.com/ceph-deploy/docs/conf.html)

VSM：

Virtual Storage Manager (VSM)是 Intel 公司研发并且开源的一款 Ceph 集群管理和监控软件，简化了一些 Ceph 集群部署的一些步骤，可以简单的通过 WEB 页面来操作。
https://github.com/intel/virtual-storage-manager
优点：
易部署
轻量级
灵活（可以自定义开发功能）
缺点：
监控选项少
缺乏 Ceph 管理功能

Inkscope：

Inkscope 是一个 Ceph 的管理和监控系统，依赖于 Ceph 提供的 API，使用 MongoDB 来存储实时的监控数据和历史信息。
https://github.com/inkscope/inkscope
优点：
易部署
轻量级
灵活（可以自定义开发功能）
缺点：
监控选项少
缺乏 Ceph 管理功能

Ceph-Dash：

Ceph-Dash 是用 Python 开发的一个 Ceph 的监控面板，用来监控 Ceph 的运行状态。同时提供 REST API 来访问状态数据。
http://cephdash.crapworks.de/
优点：
易部署
轻量级
灵活（可以自定义开发功能）
缺点：
功能相对简单

11.1 启用 dashboard 插件：

https://docs.ceph.com/en/mimic/mgr/
https://docs.ceph.com/en/latest/mgr/dashboard/
https://packages.debian.org/unstable/ceph-mgr-dashboard #15 版本有依赖需要单独解决Ceph mgr 是一个多插件(模块化)的组件，其组件可以单独的启用或关闭,以下为在ceph-deploy 服务器操作。

新版本需要安装 dashboard ，而且必须安装在 mgr 节点，否则报错如下：

The following packages have unmet dependencies:
ceph-mgr-dashboard : Depends: ceph-mgr (= 15.2.13-1~bpo10+1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages. root@ceph-mgr1:~# apt-cache madison ceph-mgr-dashboard
root@ceph-mgr1-104:~# apt install ceph-mgr-dashboard

ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr module ls                     # 列出所有模块
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr module ls|grep dashboard
                "config_dashboard": {
                    "name": "config_dashboard",
            "name": "dashboard",
                    "default_value": "osd,host,dashboard,pool,block,nfs,ceph,monitors,gateway,logs,crush,maps",

11.2 启用dashboard模块

ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr module enable dashboard
#模块启用后还不能直接访问，需要配置关闭 SSL 或启用 SSL 及指定监听地址。

11.3 配置dashboard模块

Ceph dashboard 在 mgr 节点进行开启设置，并且可以配置开启或者关闭 SSL，如下：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl false                                    # 关闭SSL
ceph@ceph-deploy-110:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1-104/server_addr 192.168.3.104      # 指定监听地址
ceph@ceph-deploy-110:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1-104/server_port 9009               # 指定监听端口
root@ceph-mgr1-104:~# systemctl restart ceph-mgr@ceph-mgr1-104.service                              # 如果不能启动，就重启mgr服务

11.4 设置dashboard账户和密码

#早期版本
ceph@ceph-deploy-110:~/ceph-cluster$ ceph dashboard set-login-credentials jack 123456

#新版本Ubuntu：
ceph@ceph-deploy-110:~/ceph-cluster$ touch pass.txt
ceph@ceph-deploy-110:~/ceph-cluster$ echo "12345678" > pass.txt
ceph@ceph-deploy-110:~/ceph-cluster$ ceph dashboard set-login-credentials jack -i pass.txt
******************************************************************
***          WARNING: this command is deprecated.              ***
*** Please use the ac-user-* related commands to manage users. ***
******************************************************************
Username and password updated

11.5 登录dashboard

http://192.168.3.104:9009/

11.6 dashboard SSL:

#生成证书：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph dashboard create-self-signed-cert
Self-signed certificate created
#启用 SSL：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl true
#查看当前 dashboard 状态：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr services
{
    "dashboard": "http://192.168.3.104:9009/"
}
#重启 mgr 服务：
root@ceph-mgr1-104:~# systemctl restart ceph-mgr@ceph-mgr1-104.service
#再次验证 dashboard：
ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr services
{
    "dashboard": "https://192.168.3.104:8443/"
}

https://192.168.3.104:8443/

12 通过 prometheus 监控 ceph node 节点：

12.1 部署 prometheus：

#下载
root@ceph-mgr2-105:~# mkdir /apps
root@ceph-mgr2-105:~# cd /apps
root@ceph-mgr2-105:~# wget https://mirrors.tuna.tsinghua.edu.cn/github-release/prometheus/prometheus/LatestRelease/prometheus-2.32.1.linux-amd64.tar.gz
root@ceph-mgr2-105:/apps# tar xf prometheus-2.32.1.linux-amd64.tar.gz -C /apps/
root@ceph-mgr2-105:/apps# ln -s /apps/prometheus-2.32.1.linux-amd64 /apps/prometheus
#配置启动文件
root@ceph-mgr2-105:/apps/prometheus# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/apps/prometheus/
ExecStart=/apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml
[Install]
WantedBy=multi-user.target
root@ceph-mgr2-105:/apps/prometheus# systemctl daemon-reload
root@ceph-mgr2-105:/apps/prometheus# systemctl restart prometheus
root@ceph-mgr2-105:/apps/prometheus# systemctl enable prometheus

12.2 访问Prometheus

http://192.168.3.105:9090/

12.3 部署 node_exporter：

各 node 节点安装 node_exporter

root@ceph-node1-106:~# mkdir /apps && cd /apps
...
root@ceph-node4-109:~# mkdir /apps && cd /apps
root@ceph-node1-106:/apps# wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
...
root@ceph-node4-109:/apps# wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
root@ceph-node1-106:/apps# tar xf node_exporter-1.3.1.linux-amd64.tar.gz -C /apps
...
root@ceph-node4-109:/apps# tar xf node_exporter-1.3.1.linux-amd64.tar.gz -C /apps
root@ceph-node1-106:/apps# ln -s /apps/node_exporter-1.3.1.linux-amd64 /apps/node_exporter
...
root@ceph-node4-109:/apps# ln -s /apps/node_exporter-1.3.1.linux-amd64 /apps/node_exporter
#创建启动文件
root@ceph-node1-106:/apps# vi /etc/systemd/system/node-exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
ExecStart=/apps/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
root@ceph-node1-106:/apps# scp /etc/systemd/system/node-exporter.service root@172.16.3.107:/etc/systemd/system/node-exporter.service
...
root@ceph-node1-106:/apps# scp /etc/systemd/system/node-exporter.service root@172.16.3.109:/etc/systemd/system/node-exporter.service
#启动服务
root@ceph-node1-106:/apps# systemctl daemon-reload
root@ceph-node1-106:/apps# systemctl restart node-exporter
root@ceph-node1-106:/apps# systemctl enable node-exporter
...
root@ceph-node4-109:/apps# systemctl daemon-reload
root@ceph-node4-109:/apps# systemctl restart node-exporter
root@ceph-node4-109:/apps# systemctl enable node-exporter

验证各 node 节点的 node_exporter 数据：

http://192.168.3.107:9100/

12.4 配置 prometheus server 数据并验证：

root@ceph-mgr2-105:~# cd /apps/prometheus
root@ceph-mgr2-105:/apps/prometheus# vi prometheus.yml
#在最下面添加
  - job_name: 'ceph-node-data'
    static_configs:
    - targets: ['172.16.3.106:9100','172.16.3.107:9100','172.16.3.108:9100','172.16.3.109:9100']
#重启服务
root@ceph-mgr2-105:/apps/prometheus# systemctl restart prometheus.service

浏览器访问

http://192.168.3.105:9090/targets

12.5 通过 prometheus 监控 ceph 服务：

Ceph manager 内部的模块中包含了 prometheus 的监控模块,并监听在每个 manager 节点的9283 端口，该端口用于将采集到的信息通过 http 接口向 prometheus 提供数据。

https://docs.ceph.com/en/mimic/mgr/prometheus/?highlight=prometheus

12.5.1 启用 prometheus 监控模块：

ceph@ceph-deploy-110:~/ceph-cluster$ ceph mgr module enable prometheus
root@ceph-mgr1-104:~# ss -lntp|grep 9283                                                    #监听在每个节点的9283端口
LISTEN   0         5                         *:9283                   *:*        users:(("ceph-mgr",pid=3572,fd=30))

12.5.2 验证 manager 数据：

http://192.168.3.104:9283/metrics

12.5.3 配置 prometheus 采集数据：

root@ceph-mgr2-105:/apps/prometheus# vi prometheus.yml 
  - job_name: 'ceph-cluster-monitor' 
    static_configs: 
    - targets: ['172.16.3.105:9283']
root@ceph-mgr2-105:/apps/prometheus# systemctl restart prometheus.service

12.6 通过 grafana 显示监控数据：

12.6.1 安装grafana

root@ceph-mgr2-105:~# cd /apps/
root@ceph-mgr2-105:/apps# wget https://mirrors.tuna.tsinghua.edu.cn/grafana/apt/pool/main/g/grafana/grafana_7.5.9_amd64.deb
root@ceph-mgr2-105:/apps# apt-get install -y adduser libfontconfig
root@ceph-mgr2-105:/apps# dpkg -i grafana_7.5.9_amd64.deb
root@ceph-mgr2-105:/apps# systemctl restart grafana-server
root@ceph-mgr2-105:/apps# ss -lntp |grep 3000