|
|
概述
* k: t: T# s" @' d4 ZCeph的配置参数很多,从网上也能搜索到一大批的调优参数,但这些参数为什么这么设置?设置为这样是否合理?解释的并不多2 ^9 D) v2 c! O) [+ F
本文从当前我们的ceph.conf文件入手,解释其中的每一项配置,做为以后参数调优和新人学习的依据;
* j+ d4 z7 R0 }- Y/ G参数详解* E8 {& B% Q) e
1,一些固定配置参数
' m D1 o# s) c% j
- n/ B; d) {9 O+ e( b. j% e3 x sfsid = 6d529c3d-5745-4fa5-be5f-3962a8e8687c
7 [3 ~! T2 i) i7 q2 O8 X7 umon_initial_members = mon1, mon2, mon3; P3 h$ N. v. h
mon_host = 10.10.40.67,10.10.40.68,10.10.40.69
* D' `' ^% v1 q: g! w" ~以上通常是通过ceph-deploy生成的,都是ceph monitor相关的参数,不用修改;2 J% H( e& [( S% R5 n
2,网络配置参数
+ B2 I4 q/ U, m8 q( _3 M/ j* D$ @, Z( z2 u7 t2 }" o% H0 a
public_network = 10.10.40.0/24 默认值 ""
1 d* ^; D. S' l1 X1 ]cluster_network = 10.10.41.0/24 默认值 ""
) F, W" P; i, T1 ? [public network:monitor与osd,client与monitor,client与osd通信的网络,最好配置为带宽较高的万兆网络;
! Z5 Y/ P0 _- C9 q. q0 Pcluster network:OSD之间通信的网络,一般配置为带宽较高的万兆网络;( Z" I5 R7 S% d. q! `3 J
参考:% C$ b5 Q8 V y5 d6 O; N: J
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
& |" S' N5 y& T6 B7 N" e3,pool size配置参数. l6 F H& j, x" V
5 |, ^, M' z$ ~" i1 t8 Q% n
osd_pool_default_size = 3 默认值 3
4 h3 S# h8 E+ |5 t8 Josd_pool_default_min_size = 1 默认值 0 // 0 means no specific default; ceph will use size-size/2
- S8 z3 u) K# T8 D" \这两个是创建ceph pool的时候的默认size参数,一般配置为3和1,3副本能足够保证数据的可靠性;6 R c* @+ f( Y, ~
4,认证配置参数
9 R0 A0 j" g+ o
^1 @* M7 j: e8 U: mauth_service_required = none 默认值 "cephx"
' F- U0 t4 U) H, C+ R/ A1 Cauth_client_required = none 默认值 "cephx, none"
6 `$ `: r1 T2 kauth_cluster_required = none 默认值 "cephx"- E9 R3 r5 ~3 D- i$ K
以上是Ceph authentication的配置参数,默认值为开启ceph认证;
( T* T& _% X& d2 w, C在内部使用的ceph集群中一般配置为none,即不使用认证,这样能适当加快ceph集群访问速度;
+ W' ?! a( I1 ~4 U& S" ^! Y' s5,osd down out配置参数
" `' ]7 c: W0 l1 f- v7 V; E: `# b7 y7 I X
mon_osd_down_out_interval = 864000 默认值 300 // seconds, z* d p# p# z0 L- o
mon_osd_min_down_reporters = 2 默认值 2
* g, `% y" v0 ?1 b9 omon_osd_report_timeout = 900 默认值 900
) [- [( J. ~* K# L4 C( u( Cosd_heartbeat_interval = 15 默认值 61 L2 b4 E. n9 ~7 A) ?
osd_heartbeat_grace = 60 默认值 20* y Q T/ [+ l
mon_osd_down_out_interval:ceph标记一个osd为down and out的最大时间间隔
8 X, U' o8 ?6 ]; ]3 K& Z) R$ @mon_osd_min_down_reporters:mon标记一个osd为down的最小reporters个数(报告该osd为down的其他osd为一个reporter): T: }: I0 e6 Y$ ?/ I O
mon_osd_report_timeout:mon标记一个osd为down的最长等待时间
* J1 l3 r+ U& ~- F3 t/ A/ K% Rosd_heartbeat_interval:osd发送heartbeat给其他osd的间隔时间(同一PG之间的osd才会有heartbeat)
# z' f6 r% i* L0 |6 v$ ^. }* Kosd_heartbeat_grace:osd报告其他osd为down的最大时间间隔,grace调大,也有副作用,如果某个osd异常退出,等待其他osd上报的时间必须为grace,在这段时间段内,这个osd负责的pg的io会hang住,所以尽量不要将grace调的太大。
! f/ l1 E. K. o0 Q$ q/ N$ l基于实际情况合理配置上述参数,能减少或及时发现osd变为down(降低IO hang住的时间和概率),延长osd变为down and out的时间(防止网络抖动造成的数据recovery);
N9 d0 c, L. h! x/ g: q参考:7 O, i8 E/ r4 I. o
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/3 u1 W2 }! t Q) ^9 r7 v
http://blog.wjin.org/posts/ceph-osd-heartbeat.html4 x9 K- l9 W! f" f
6,objecter配置参数
4 h: Z* R3 V% a2 {, B
/ e" ~, o2 v0 ~/ Z5 v3 h; cobjecter_inflight_ops = 10240 默认值 1024: y/ D1 ^& @* q7 W& ~
objecter_inflight_op_bytes = 1048576000 默认值 100M
k/ N8 p+ f. [osd client端objecter的throttle配置,它的配置会影响librbd,RGW端的性能;+ s% ^" d- D) c: V" V% `
配置建议:
2 O' v( }9 G6 m$ {- A4 j7 p调大这两个值4 r) M1 A& V* x0 B0 o8 |: m' [; u
7,ceph rgw配置参数4 |+ @) o5 J5 D9 t( U6 I
- {" q; t6 Z6 ]$ n9 `! ]
rgw_frontends = "civetweb port=10080 num_threads=2000" 默认值 "fastcgi, civetweb port=7480"
4 Z. [. `1 H( S/ q; vrgw_thread_pool_size = 512 默认值 100
7 Q, C2 m. C, K; brgw_override_bucket_index_max_shards = 20 默认值 0
+ N- O9 k5 p7 W; q8 x; w% s* X/ A $ J, }' y9 ?# W
rgw_max_chunk_size = 1048576 默认值 512 * 1024
, V" r. {& n, hrgw_cache_lru_size = 1000000 默认值 10000 // num of entries in rgw cache* P4 n* C0 y) o3 l
rgw_bucket_default_quota_max_objects = 10000000 默认值 -1 // number of objects allowed& }% O) o, Q2 `! a& `7 t. ]
- }. S" z& ]0 n C, D+ V
rgw_dns_name = object-storage.ffan.com 默认值
! f7 ~2 B0 i% U$ U) e+ b5 `rgw_swift_url = http://object-storage.ffan.com 默认值
$ p6 k% @; i' O) K; R! rrgw_frontends:rgw的前端配置,一般配置为使用轻量级的civetweb;prot为访问rgw的端口,根据实际情况配置;num_threads为civetweb的线程数;6 f, E' R6 N6 z- `% A7 W
rgw_thread_pool_size:rgw前端web的线程数,与rgw_frontends中的num_threads含义一致,但num_threads 优于rgw_thread_pool_size的配置,两个只需要配置一个即可;
4 ?7 y7 ^& p: p$ Yrgw_override_bucket_index_max_shards:rgw bucket index object的最大shards数,增大这个值能减少bucket index object的访问时间,但也会加大bucket的ls时间;; H7 A$ |7 E4 U$ a* G+ U$ j) [/ |7 u
rgw_max_chunk_size:rgw最大chunk size,针对大文件的对象存储场景可以把这个值调大;( K! w( K: [0 T8 m& I x
rgw_cache_lru_size:rgw的lru cache size,对于读较多的应用场景,调大这个值能加快rgw的响应速度;
- N4 P, a+ \7 b/ zrgw_bucket_default_quota_max_objects:配合该参数限制一个bucket的最大objects个数;
5 V: I6 v9 s" s5 N参考:
* c' u5 [ f, t% e/ b8 _+ U+ ^http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/8 A4 {1 k7 P% I9 A1 O
http://ceph-users.ceph.narkive.com/mdB90g7R/rgw-increase-the-first-chunk-size
) [ e5 ~& @ x0 c W" X) u8 ?https://access.redhat.com/solutions/2122231
P: t4 {" M4 d, r8 l8,debug配置参数' \: N- s% G' B4 e) J
0 X* O3 p$ m1 U Y% c7 A1 Z
debug_lockdep = 0/0
7 K. p( W1 G3 p) H1 c. Q7 p& z. @debug_context = 0/0, V- N% {* R, c) q9 a
debug_crush = 0/0
- s: ~) b+ L+ d/ F% Ydebug_buffer = 0/08 @* e) j. S$ n" T
debug_timer = 0/03 w; R: H" y. o- `' e i2 ]
debug_filer = 0/02 z, p/ v# f2 V$ j6 O7 ^" T
debug_objecter = 0/07 w/ P& \3 f" t2 \9 t
debug_rados = 0/0
( B: c3 T$ t$ y' tdebug_rbd = 0/0 S# h, a3 H' U/ H* c: k w! M
debug_journaler = 0/0
8 K4 M6 F' X$ `! \/ r: [debug_objectcatcher = 0/0
2 m" }$ E8 R) w n& Idebug_client = 0/0/ ~* a& O% D0 i& V* L( I( l# }. \
debug_osd = 0/0
& N* o+ b. m1 M# I8 M2 t% Sdebug_optracker = 0/0
+ ?1 S; f3 Y8 S H# f: K$ {, Ndebug_objclass = 0/0 m: q K7 L& s. `
debug_filestore = 0/0
8 O# e C3 ^3 ^" w; I% K3 V5 T0 Ndebug_journal = 0/05 B) ^ _! y" A1 u
debug_ms = 0/0
8 O1 B0 R- t) m. D& o' hdebug_mon = 0/0! j! _/ V6 k% x/ b8 [
debug_monc = 0/01 t% l) d3 K% H3 g
debug_tp = 0/0
, e5 U8 m: Z F0 [: y% O0 ydebug_auth = 0/0: l- D: W, u& H1 @6 v" H
debug_finisher = 0/0, W: H: w. Q' k& r
debug_heartbeatmap = 0/0
+ w8 ]! g7 I4 d, D5 C2 K( B8 {debug_perfcounter = 0/0
$ _6 S0 i2 w" i" g7 n; k6 r3 i3 bdebug_asok = 0/0
9 u( F, ]0 @- M% b! f8 @; f5 Ndebug_throttle = 0/0% T2 F6 t& b3 |+ {& K5 K
debug_paxos = 0/0! T5 I2 h! E0 l4 C& V3 B
debug_rgw = 0/0
4 e/ L; M4 C: ^0 b关闭了所有的debug信息,能一定程度加快ceph集群速度,但也会丢失一些关键log,出问题的时候不好分析;
( C7 t: c, d1 o J5 O9 i6 a参考: J$ T! ?/ s+ E! R% }
http://www.10tiao.com/html/362/201609/2654062487/1.html+ F8 Y9 H( F0 A4 ]0 n7 f8 }" j
9,osd op配置参数* E* X: I* _0 Q* E- M
2 e# Y1 F% P) O$ ^
osd_enable_op_tracker = false 默认值 true
. _! I4 p! K: f% Q4 o1 I/ W, p" R. uosd_num_op_tracker_shard = 32 默认值 32
% u7 N% ?5 l: n* Q: Z, H, ?osd_op_threads = 10 默认值 2: X( }9 k4 G$ O7 Y7 f2 |
osd_disk_threads = 1 默认值 1& u6 ?/ K0 v' Y8 p9 _% `
osd_op_num_shards = 32 默认值 5
. j- [+ K- I# ?- [. y2 r: K) yosd_op_num_threads_per_shard = 2 默认值 2
. g# a2 V+ r; V& x5 T6 Z9 y$ ?osd_enable_op_tracker:追踪osd op状态的配置参数,默认为true;不建议关闭,关闭后osd的 slow_request,ops_in_flight,historic_ops 无法正常统计;- ~+ c( h I5 z; e" |( T4 l0 x
$ N0 [, C% j% F* e% M# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
* _4 N- M( v' T) d3 |op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
4 `) K) m2 k) N( l4 R* A3 I- H! z# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops. e. q4 n o2 H8 c% u
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
+ q5 Z) K3 b8 c7 E- h打开op tracker后,若集群iops很高,osd_num_op_tracker_shard可以适当调大,因为每个shard都有个独立的mutex锁;1 D% W! T+ C7 A4 F
4 e* P: Z M1 v( A0 R
9 j6 Q4 h; F s! Z9 h, F$ H$ Fclass OpTracker {
. C$ L- {5 H0 S# W2 K...
' x) i: [$ f! d8 K struct ShardedTrackingData {: h1 l g% }' P0 I- x+ d. ?
Mutex ops_in_flight_lock_sharded;
u; S& G& B- o, j8 h- G xlist<TrackedOp *> ops_in_flight_sharded;6 i3 p3 t7 h1 b% [8 O. I3 |3 o
explicit ShardedTrackingData(string lock_name):
) {) U$ a5 N" k* {& H ops_in_flight_lock_sharded(lock_name.c_str()) {}9 p7 a0 F+ K' J s, ~2 [. E) _
};
) k# I8 A3 E9 t$ z* ~ vector<ShardedTrackingData*> sharded_in_flight_list;
/ ]3 Q% l$ E% |3 X& U2 S uint32_t num_optracker_shards;
4 \6 z/ P8 j$ q5 {3 P p4 f- K/ o...$ g3 N! F. @6 j( O8 o& s. G
};
( }! p# y( B% _: H9 x: H- E, cosd_op_threads:对应的work queue有peering_wq(osd peering请求),recovery_gen_wq(PG recovery请求);0 u4 L9 v, |, @, p2 x6 J5 T8 p5 t
osd_disk_threads:对应的work queue为 remove_wq(PG remove请求);! V* I; Z: c3 B
osd_op_num_shards和osd_op_num_threads_per_shard:对应的thread pool为osd_op_tp,work queue为op_shardedwq;
" S c% M% {* m9 v# k. k3 I( v处理的请求包括:
8 R1 j4 H q2 ~* EOpRequestRef
V Y8 x! r. X# @9 h- SPGSnapTrim3 V. E5 j6 o0 P; o* H
PGScrub' d$ n; h; F2 J% ^2 y$ C. Z
调大osd_op_num_shards可以增大osd ops的处理线程数,增大并发性,提升OSD性能;* u1 V2 T/ F( f0 \/ d7 t/ b
10,osd client message配置参数+ P! w' w/ e- J7 g
1
# N3 p$ A4 w- {' |2
! {" F0 }6 `1 Bosd_client_message_size_cap = 1048576000 默认值 500*1024L*1024L // client data allowed in-memory (in bytes)
0 m, O2 x% b6 Wosd_client_message_cap = 10000 默认值 100 // num client messages allowed in-memory! ?- f# a; y+ U* I- M; E+ X6 Z
这个是osd端收到client messages的capacity配置,配置大的话能提升osd的处理能力,但会占用较多的系统内存;
: t4 s9 R4 L; A5 m& M' M+ K配置建议:( T1 f; V v5 H- u+ ` [
服务器内存足够大的时候,适当增大这两个值
# x+ W* ^% ]) G4 v; ~ _11,osd scrub配置参数 f% q- [. G0 i* D: c; r1 d
6 |. |9 {5 ]- v3 ]
osd_scrub_begin_hour = 2 默认值 0
0 o6 m6 v8 E$ C4 P4 j- C }osd_scrub_end_hour = 6 默认值 24
* c( S+ K" X1 M' `$ ~- n/ ?& Y ; o4 N7 S$ n& _% _1 Q h2 C: |3 h
// The time in seconds that scrubbing sleeps between two consecutive scrubs
; w, T4 Z& U' ~+ _4 p& t; l" N6 Dosd_scrub_sleep = 2 默认值 0 // sleep between [deep]scrub ops
- Q4 o2 v+ ~* }+ }
! z/ z8 G9 j9 l8 x( I& _+ c( mosd_scrub_load_threshold = 5 默认值 0.50 H6 F) r4 u& w* F
5 B$ [6 x/ t- n R7 o// chunky scrub配置的最小/最大objects数,以下是默认值
! Z. ~# _% D: Losd_scrub_chunk_min = 5
9 g" N- T5 j5 i# W1 iosd_scrub_chunk_max = 25
! F4 w8 }: v+ g# f4 t2 @Ceph osd scrub是保证ceph数据一致性的机制,scrub以PG为单位,但每次scrub回获取PG lock,所以它可能会影响PG正常的IO;
* y6 f r; O9 p( G, T' G0 ~# {Ceph后来引入了chunky的scrub模式,每次scrub只会选取PG的一部分objects,完成后释放PG lock,并把下一次的PG scrub加入队列;这样能很好的减少PG scrub时候占用PG lock的时间,避免过多影响PG正常的IO;# \" k% b8 B2 {5 h. Y( z- `
同理,引入的osd_scrub_sleep参数会让线程在每次scrub前释放PG lock,然后睡眠一段时间,也能很好的减少scrub对PG正常IO的影响;
" \# q) ]+ Y+ T/ t8 z# f配置建议:6 O7 Q; Z; L$ s, T
osd_scrub_begin_hour和osd_scrub_end_hour:OSD Scrub的开始结束时间,根据具体业务指定;5 @9 o* N c9 Y# M
osd_scrub_sleep:osd在每次执行scrub时的睡眠时间;有个bug跟这个配置有关,建议关闭;( p$ `) A: t! U2 y
osd_scrub_load_threshold:osd开启scrub的系统load阈值,根据系统的load average值配置该参数;$ M0 ]9 E5 Y8 l
osd_scrub_chunk_min和osd_scrub_chunk_max:根据PG中object的个数配置;针对RGW全是小文件的情况,这两个值需要调大;
) @! R2 [) O4 z参考:0 `3 ^: W2 a# [* v
http://www.jianshu.com/p/ea2296e1555c
$ Z0 Y& \1 c( Q) O6 Ghttp://tracker.ceph.com/issues/194978 f5 [$ D } U7 Y; }
12,osd thread timeout配置参数7 n* m6 X6 ], Q- V6 V/ J. s& @8 Q
& g" q! K: ]. c! U( c
osd_op_thread_timeout = 580 默认值 15
" z* d; N8 v) ?! t$ S# T! v6 `osd_op_thread_suicide_timeout = 600 默认值 150
- ? R3 h3 U( o+ C) G; ~+ V 3 \4 I8 w8 c/ T' E6 k7 a
osd_recovery_thread_timeout = 580 默认值 30
) U0 _- C: \$ t6 N5 K4 dosd_recovery_thread_suicide_timeout = 600 默认值 300
7 X S; C6 {8 H! j( xosd_op_thread_timeout和osd_op_thread_suicide_timeout关联的work queue为:
+ ~4 S; g) k5 l1 ^$ g3 ^; W# oop_shardedwq - 关联的请求为:OpRequestRef,PGSnapTrim,PGScrub6 e1 i/ C& n' N, O
peering_wq - 关联的请求为:osd peering
1 y4 [# Q' l+ H/ Rosd_recovery_thread_timeout和osd_recovery_thread_suicide_timeout关联的work queue为:% Y( V6 ]: e( c* ?; h2 f' E
recovery_wq - 关联的请求为:PG recovery) V4 O% a! g6 U) B" u3 ~
Ceph的work queue都有个基类WorkQueue_,定义如下:
0 W. u; m9 W1 v; K8 E7 T, N
3 y* {" o3 ~# g5 {9 \" }: \- h/// Pool of threads that share work submitted to multiple work queues./ z# M6 K+ E8 i# b% \& B0 P
class ThreadPool : public md_config_obs_t {
+ R9 C2 Q$ L0 D& k" [3 N9 W...+ L1 @5 c. R1 M" M* ?
/// Basic interface to a work queue used by the worker threads.0 }4 {# ^8 b; x4 S
struct WorkQueue_ {$ S, N; ~* P# a( C
string name;
U9 ^) J% q+ p8 P time_t timeout_interval, suicide_interval;) \+ ~) `. a, [% [; X& m
WorkQueue_(string n, time_t ti, time_t sti)
9 c* i6 S/ G- a) s% Y, s6 z : name(n), timeout_interval(ti), suicide_interval(sti) x( O, C; {; A4 w9 ^+ h& ^2 T
{ }) W8 a# v3 H5 w1 G+ u( f7 w
...
$ `0 B3 O3 L: R8 Q8 u这里的timeout_interval和suicide_interval分别对应上面所述的配置timeout和suicide_timeout;
2 W9 I$ x9 k6 N- }7 d; `当thread处理work queue中的一个请求时,会受到这两个timeout时间的限制:
# n& K$ B7 \: S5 ?' L2 Ttimeout_interval - 到时间后设置m_unhealthy_workers+1
' v, D: Y c3 q. Hsuicide_interval - 到时间后调用assert,OSD进程crush# Y4 W. \0 R/ e, H+ G
对应的处理函数为:
+ n" X7 ]# l& i% r6 l, v( i G D9 Z8 c7 c, M. y% I% ^# a
bool HeartbeatMap::_check(const heartbeat_handle_d *h, const char *who, time_t now)4 j1 o% X! V+ t! e! ]3 G
{
+ Z* T7 d+ f- s# N$ U4 z bool healthy = true;
* \ B5 B' U$ E time_t was;
Y0 w) R# U. V1 g4 _- c5 S5 D was = h->timeout.read();
* I! q3 B/ F* H: J& J if (was && was < now) {( x3 c' @ H* ]( G* \) e
ldout(m_cct, 1) << who << " '" << h->name << "'"3 X1 a8 V& r* I. N5 c! O
<< " had timed out after " << h->grace << dendl;
7 h2 S7 E' I+ S; `; I- [) g& K healthy = false;
. {$ ]/ W! C9 V* ?0 E$ r }. K. [" U# [0 f/ P
was = h->suicide_timeout.read();1 Y5 }4 d( B% r' z4 Z9 Y
if (was && was < now) {$ }8 j2 H7 P/ V4 U( F& Z
ldout(m_cct, 1) << who << " '" << h->name << "'"
! q2 y; n4 g6 O, O, D A << " had suicide timed out after " << h->suicide_grace << dendl;
" O9 E5 T& ~" Y# E assert(0 == "hit suicide timeout");' c3 V3 l9 _- Q% s4 j1 [
}1 l. B& t) |6 i$ V- h2 O" j
return healthy;
( l! |0 W m- I8 Y+ H0 a}
, V1 }! F# a& @8 n) y当前仅有RGW添加了worker的perfcounter,所以也只有RGW可以通过perf dump查看total/unhealthy的worker信息:6 i/ K" {+ ~0 O# w4 V
4 U4 _4 n3 g- f[root@ yangguanjun]# ceph daemon /var/run/ceph/ceph-client.rgw.rgwdaemon.asok perf dump | grep worker
S) i( k' A9 j/ j N) w8 Z( Z "total_workers": 32,
6 @1 Z9 I9 e6 ^9 q% ~ "unhealthy_workers": 0, u7 N2 k2 D- l% i7 p1 ]
对应的配置项为:( ^. `# ~8 Q- w: m! A5 x5 B
0 A7 q) h$ r1 Q& w9 _OPTION(rgw_num_async_rados_threads, OPT_INT, 32) // num of threads to use for async rados operations
# O+ c1 @+ ], ?7 O6 Q! s( W. F% G``` 7 X, m/ l: n/ K, J7 z+ ^
**配置建议:**
9 S3 f: @8 V$ ]/ R6 y" T" R- `*_thread_timeout`:这个值配置越小越能及时发现处理慢的请求,所以不建议配置很大;特别是针对速度快的设备,建议调小该值;
9 Y2 s. }: d. D' y8 F- `*_thread_suicide_timeout`:这个值配置小了会导致超时后的OSD crush,所以建议调大;特别是在对应的throttle调大后,更应该调大该值;
4 e0 k! [% @% ]3 U### 13,fielstore op thread配置参数, P9 U9 ~$ B, i: P4 X# n; J, n
```sh) | R8 T/ ^' d3 f8 ?
filestore_op_threads = 10 默认值 2/ R2 c3 E& y/ q2 i1 j
filestore_op_thread_timeout = 580 默认值 60/ K& S, @& f3 Y( ~' |- P9 n. K! R
filestore_op_thread_suicide_timeout = 600 默认值 180& h7 |7 c( n! w/ T6 O
filestore_op_threads:对应的thread pool为op_tp,对应的work queue为op_wq;filestore的所有请求都经过op_wq处理;% z$ a+ o3 R+ I# M, {% A
增大该参数能提升filestore的处理能力,提升filestore的性能;配合filestore的throttle一起调整;. k0 e4 \7 q9 X* z5 [
filestore_op_thread_timeout和filestore_op_thread_suicide_timeout关联的work queue为:op_wq
! A4 C4 F: @( U! E配置的含义与上一节中的thread_timeout/thread_suicide_timeout保持一致;
2 ]4 h8 v( f9 G13,filestore merge/split配置参数: [3 ^" Y+ R! I" ^
8 L! @3 o5 j3 Y$ a0 l7 ffilestore_merge_threshold = -1 默认值 10
G1 w& B( x1 k; G" @2 A5 Gfilestore_split_multiple = 16000 默认值 2: `* T3 N8 |7 j+ M; Z' m
这两个参数是管理filestore的目录分裂/合并的,filestore的每个目录允许的最大文件数为:
6 J+ C9 [8 F5 H& O; [filestore_split_multiple * abs(filestore_merge_threshold) * 16: u l* v1 |: D9 i+ ?$ z! ?& e
在RGW的小文件应用场景,会很容易达到默认配置的文件数(320),若在写的过程中触发了filestore的分裂,则会非常影响filestore的性能;. o) h5 d: n$ k3 f. ~) K
每次filestore的目录分裂,会依据如下规则分裂为多层目录,最底层16个子目录:% G( r+ a8 I& G: @& [' G" w" ^
例如PG 31.4C0, hash结尾是4C0,若该目录分裂,会分裂为 DIR_0/DIR_C/DIR_4/{DIR_0, DIR_F};
% m4 Q. z% o) n& `8 Q; V; g& Y" X原始目录下的object会根据规则放到不同的子目录里,object的名称格式为: *__head_xxxxX4C0_*,分裂时候X是几,就放进子目录DIR_X里。比如object:*__head_xxxxA4C0_*, 就放进子目录 DIR_0/DIR_C/DIR_4/DIR_A 里;
; m0 R6 i" a( V V! m7 M: U- U解决办法:
Z! f& U: u! Z4 D$ }+ {增大merge/split配置参数的值,使单个目录容纳更多的文件;
, P+ Q/ I& S" ~! efilestore_merge_threshold配置为负数;这样会提前触发目录的预分裂,避免目录在某一时间段的集中分裂,详细机制没有调研;' d9 [: ~, U% l
创建pool时指定expected-num-objects;这样会依据目录分裂规则,在创建pool的时候就创建分裂的子目录,避免了目录分裂对filestore性能的影响;
( [& x1 V/ F. z6 }) o3 F参考:1 V+ X& |0 P: b; X, G
http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/
' l, s8 }/ b8 G' g$ thttp://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool
& @; ?5 }1 Y1 |+ j$ w; ~7 y' hhttp://blog.csdn.net/for_tech/article/details/51251936
# O6 P5 h) L/ M9 V: k' ihttp://ivanjobs.github.io/page3/9 J q) c; R3 t5 h- l5 l
14,filestore fd cache配置参数3 Y2 n( Q" a* x* [, k
; C" ], _4 W5 W) P, D; O8 E7 Rfilestore_fd_cache_shards = 32 默认值 16 // FD number of shards
. M! s% l) q) }: {5 w) Pfilestore_fd_cache_size = 32768 默认值 128 // FD lru size3 C% g& O7 I1 n' B. L, d
filestore的fd cache是加速访问filestore里的file的,在非一次性写入的应用场景,增大配置可以很明显的提升filestore的性能;
$ W R3 a7 o1 j( d( z" w" O3 B15,filestore sync配置参数. q3 t# x: L; u" ~- K; E8 v
# R u' ^ Z- t9 o# ~filestore_wbthrottle_enable = false 默认值 true SSD的时候建议关闭& q1 Y* ^; J! c, s* i
filestore_min_sync_interval = 5 默认值 0.01 s 最小同步间隔秒数,sync fs的数据到disk,FileStore::sync_entry()
1 ~7 T. Q2 G/ K% s% S9 |filestore_max_sync_interval = 10 默认值 5 s 最大同步间隔秒数,sync fs的数据到disk,FileStore::sync_entry()8 A9 M& b7 N; F1 T, O8 U& e& R
filestore_commit_timeout = 3000 默认值 600 s FileStore::sync_entry() 里 new SyncEntryTimeout(m_filestore_commit_timeout)
7 q, v; R9 A% i+ Gfilestore_wbthrottle_enable的配置是关于filestore writeback throttle的,即我们说的filestore处理workqueue op_wq的数据量阈值;默认值是true,开启后XFS相关的配置参数有:
6 ]3 Y4 [6 m1 \0 F" _
/ \& m! ^, s; V) Z) ]OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
# D+ ~+ ~3 ]5 ~OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
5 w6 c2 ]# B( y1 m- ~% {: LOPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)- m/ w | D4 H7 Y
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)' m: E! Z( j% U+ b- g" J$ Q
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)
7 \" W/ h; [$ m& v! I' rOPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 5000)' {* q% G: r" r! k/ D: f
若使用普通HDD,可以保持其为true;针对SSD,建议将其关闭,不开启writeback throttle;" A) O' C/ [ x& N
filestore_min_sync_interval和filestore_max_sync_interval是配置filestore flush outstanding IO到disk的时间间隔的;增大配置可以让系统做尽可能多的IO merge,减少filestore写磁盘的压力,但也会增大page cache占用内存的开销,增大数据丢失的可能性;
6 _6 c( I1 n; U( H4 Xfilestore_commit_timeout是配置filestore sync entry到disk的超时时间,在filestore压力很大时,调大这个值能尽量避免IO超时导致OSD crush;
& h7 q. T0 t4 t5 W16,filestore throttle配置参数
0 f) K! t1 R5 m/ X) j* [0 y4 h+ w6 h( Z9 ]% e; c0 v7 C
filestore_expected_throughput_bytes = 536870912 默认值 200MB /// Expected filestore throughput in B/s2 K0 x ^: S& N. u+ |" t# T* }" M
filestore_expected_throughput_ops = 2500 默认值 200 /// Expected filestore throughput in ops/s
9 P& r8 [0 z0 [filestore_queue_max_bytes= 1048576000 默认值 100MB0 o; p s( U2 N; E+ ]. M5 T
filestore_queue_max_ops = 5000 默认值 50
9 O3 j( Q9 C7 A0 `
& T% O, N9 [8 b$ n. E' E" U5 p/// Use above to inject delays intended to keep the op queue between low and high
3 F8 o1 h' J" R4 H E8 ]filestore_queue_low_threshhold = 0.6 默认值 0.36 K) k$ o& |& ?0 Y; k& |5 [
filestore_queue_high_threshhold = 0.9 默认值 0.9
, z7 `" y( t' X* _# X
* G0 j% D# U: b% Y5 s( S# Cfilestore_queue_high_delay_multiple = 2 默认值 0 /// Filestore high delay multiple. Defaults to 0 (disabled)) q7 S/ f% Q1 W% N1 ~
filestore_queue_max_delay_multiple = 10 默认值 0 /// Filestore max delay multiple. Defaults to 0 (disabled); m! @# @) }: g- O! `9 n6 G5 d1 g% `
在jewel版本里,引入了dynamic throttle,来平滑普通throttle带来的长尾效应问题;6 ^. i" Y, v3 p# \1 q) h2 s8 P
一般在使用普通磁盘时,之前的throttle机制即可很好的工作,所以这里默认filestore_queue_high_delay_multiple和filestore_queue_max_delay_multiple都为0;$ X& a3 P4 }- S: B0 x+ ~
针对高速磁盘,需要在部署之前,通过小工具ceph_smalliobenchfs来测试下,获取合适的配置参数;
$ N- y3 N4 _ I: q
' {- P& m6 \! Q, `' r5 |! @3 VBackoffThrottle的介绍如下:# }: G+ i1 m# t3 O9 n4 M
/**
. I& n* K$ B. S9 w4 B* BackoffThrottle
4 ]/ |) k, K) F- t*
" H) V9 p: W1 d; I7 }- Y* Creates a throttle which gradually induces delays when get() is called
" ^& I: U" j- z z6 ]* based on params low_threshhold, high_threshhold, expected_throughput,; I/ u+ h0 r0 d. R* j2 D
* high_multiple, and max_multiple.
& w! i5 }1 h7 w% h9 S1 S9 N& {*' B$ t9 [' \8 G5 G7 e. v- A0 x
* In [0, low_threshhold), we want no delay.1 y6 R0 T6 b/ e' {
*
5 N& n. m; Z1 R- i% [, e8 E" n! D* In [low_threshhold, high_threshhold), delays should be injected based
$ C4 Y9 P& e; Z# u2 D* on a line from 0 at low_threshhold to! S+ G( s! `1 v0 A# Z
* high_multiple * (1/expected_throughput) at high_threshhold.) M* l) W% L% S" }
*# O% R- N- N1 W7 z+ V# ]
* In [high_threshhold, 1), we want delays injected based on a line from
5 R# j5 H2 R+ k9 \+ [* (high_multiple * (1/expected_throughput)) at high_threshhold to
1 l. A+ b& l0 X+ ] f+ t& U* c* (high_multiple * (1/expected_throughput)) +6 v9 G! c3 J" D3 U6 X
* (max_multiple * (1/expected_throughput)) at 1.
$ W7 }; f! h7 ]! ]# `* N*7 t% \8 X9 \5 u% H) A% }. @4 ]( v
* Let the current throttle ratio (current/max) be r, low_threshhold be l,! _- ]- n! w: p" |
* high_threshhold be h, high_delay (high_multiple / expected_throughput) be e,0 t4 w3 G) F, {7 j% R& v) \9 d% v
* and max_delay (max_muliple / expected_throughput) be m.
' n( O& ~6 h! @/ |, Y0 [*
4 ~6 b3 z! u1 ]4 w* H3 K' r* delay = 0, r \in [0, l)" y& F u) _, y0 L' g+ V( p
* delay = (r - l) * (e / (h - l)), r \in [l, h)2 l: i4 y0 W, C, B: T- x6 u, C
* delay = h + (r - h)((m - e)/(1 - h))
6 l' [2 r+ U0 p, U2 g6 J1 N4 ]$ \7 T*/
& {0 l2 |9 g7 f8 D5 x* F参考:
% T. o& y5 e/ \" `) T0 ?, khttp://docs.ceph.com/docs/jewel/dev/osd_internals/osd_throttles/
4 ?; X& {& P/ T" hhttp://blog.wjin.org/posts/ceph-dynamic-throttle.html
9 ~; ]+ Z. o7 p: lhttps://github.com/ceph/ceph/blob/master/src/doc/dynamic-throttle.txt' G/ K: }8 ]5 }# B) ?, \( |
Ceph BackoffThrottle分析
7 m k* A2 s2 O: _# S17,filestore finisher threads配置参数
5 o5 c1 E$ _5 X Z1# C' v& C& V1 M7 h9 @
2
- J2 l3 C; ~% @/ j: nfilestore_ondisk_finisher_threads = 2 默认值 1) k* w- \& w' w i8 `* _
filestore_apply_finisher_threads = 2 默认值 1
1 A6 m1 c7 r* X; {( g这两个参数定义filestore commit/apply的finisher处理线程数,默认都为1,任何IO commit/apply完成后,都需要经过对应的ondisk/apply finisher thread处理;2 u+ Q* G' V) m$ h( i* }' }
在使用普通HDD时,磁盘性能是瓶颈,单个finisher thread就能处理好;
3 v. o4 G+ a7 g但在使用高速磁盘的时候,IO完成比较快,单个finisher thread不能处理这么多的IO commit/apply reply,它会成为瓶颈;所以在jewel版本里引入了finisher thread pool的配置,这里一般配置为2即可;
! J9 K, |1 m, G3 ]18,journal配置参数8 [' g9 U9 L4 d9 [/ Z
; D; I' o% g4 g& T0 _$ M. Q
journal_max_write_bytes=1048576000 默认值 10M
" R/ j! s3 G3 Y$ Y sjournal_max_write_entries=5000 默认值 100- E$ i! V' t- e2 A# ]$ W3 M
6 d0 @0 x5 Z& O3 O: A" [6 ~journal_throttle_high_multiple = 2 默认值 0 /// Multiple over expected at high_threshhold. Defaults to 0 (disabled).
: r) l4 }6 F- }! y* \% Z4 o0 E$ G9 ajournal_throttle_max_multiple = 10 默认值 0 /// Multiple over expected at max. Defaults to 0 (disabled).
- G* o3 R# ^ N3 ^8 b/// Target range for journal fullness
' X g% }4 @5 F2 B; N+ l- EOPTION(journal_throttle_low_threshhold, OPT_DOUBLE, 0.6)2 h9 e9 s. p0 R# B
OPTION(journal_throttle_high_threshhold, OPT_DOUBLE, 0.9)
/ m3 O5 ]- \0 z3 Z, Y' fjournal_max_write_bytes和journal_max_write_entries是journal一次write的数据量和entries限制;- X9 D% i3 y6 K& S6 K; P
针对SSD分区做journal的情况,这两个值要增大,这样能增大journal的吞吐量;! _/ y; i7 O7 J- @
journal_throttle_high_multiple和journal_throttle_max_multiple是JournalThrottle的配置参数,JournalThrottle是BackoffThrottle的封装类,所以JournalThrottle与我们在filestore throttle介绍的dynamic throttle工作原理一样;$ ]' N: A- m9 i: E
$ e& V& j0 a9 Wint FileJournal::set_throttle_params()2 V& \0 y2 Q3 k1 l. j* y6 Q
{
4 G7 A0 v, I0 \, Y7 B G, \; f7 M stringstream ss;1 x: k% v" L: [
bool valid = throttle.set_params(! F* H, o/ i: O
g_conf->journal_throttle_low_threshhold,
' ~- Q( v* J# g: C- y8 D5 c g_conf->journal_throttle_high_threshhold,8 o9 N: q3 I, R1 V1 y. `. X" ~3 v
g_conf->filestore_expected_throughput_bytes," X3 E, X) Y! V7 T9 I4 |
g_conf->journal_throttle_high_multiple, ?; r' d/ T. Q, ?) k, t
g_conf->journal_throttle_max_multiple,; @3 ?( b0 m2 x2 V# P8 M
header.max_size - get_top(),
5 C) f9 f1 g- I; v &ss);
0 A- F; g2 ~4 s7 n...( D' T$ {- u8 Z+ x
}$ V# r, x r W1 n( E# X
从上述代码中看出相关的配置参数有:
* n0 F7 u6 l9 H& @1 A2 Djournal_throttle_low_threshhold
( W& |4 W# D+ n: q( v8 |2 d2 c3 A9 ~journal_throttle_high_threshhold
7 b8 O. x/ j% j bfilestore_expected_throughput_bytes
+ }% r4 v2 X5 E4 O; H( f! b19,rbd cache配置参数
5 F# }4 Y5 c* S3 U! T" b3 e" Q
2 A' ~ [# i K+ M8 K5 g4 u! w6 L[client]
; x, ~, y4 V8 d4 R" ^rbd_cache_size = 134217728 默认值 32M // cache size in bytes* R! E( s* {' P
rbd_cache_max_dirty = 100663296 默认值 24M // dirty limit in bytes - set to 0 for write-through caching
2 l) K2 r2 `" u9 @+ K4 `) @% R! x; Rrbd_cache_target_dirty = 67108864 默认值 16M // target dirty limit in bytes# I Y9 h' P. l1 i' D1 _
rbd_cache_writethrough_until_flush = true 默认值 true // whether to make writeback caching writethrough until flush is called, to be sure the user of librbd will send flushs so that writeback is safe7 `/ p# x3 G1 w6 `
rbd_cache_max_dirty_age = 5 默认值 1.0 // seconds in cache before writeback starts, ~, m% ?0 E& K1 A. v1 K4 Y( M
rbd_cache_size:client端每个rbd image的cache size,不需要太大,可以调整为64M,不然会比较占client端内存;
$ j8 w; t% f7 v参照默认值,根据rbd_cache_size的大小调整rbd_cache_max_dirty和rbd_cache_target_dirty;
' R N& X; V3 grbd_cache_max_dirty:在writeback模式下cache的最大bytes数,默认是24MB;当该值为0时,表示使用writethrough模式;( F+ R' {& d# V
rbd_cache_target_dirty:在writeback模式下cache向ceph集群写入的bytes阀值,默认16MB;注意该值一定要小于rbd_cache_max_dirty值% Y: |+ k; X1 A3 o. v4 R8 H/ p3 e1 j s
rbd_cache_writethrough_until_flush:在内核触发flush cache到ceph集群前rbd cache一直是writethrough模式,直到flush后rbd cache变成writeback模式;: o/ Y' W$ K, B9 F ]
rbd_cache_max_dirty_age:标记OSDC端ObjectCacher中entry在cache中的最长时间;
( M. P8 f: O* m& }$ l2 `* X |
|