|
|
概述
. y- \, t Q7 {Ceph的配置参数很多,从网上也能搜索到一大批的调优参数,但这些参数为什么这么设置?设置为这样是否合理?解释的并不多
c6 K9 E" @. g6 Q0 j& w本文从当前我们的ceph.conf文件入手,解释其中的每一项配置,做为以后参数调优和新人学习的依据;
( o* Y5 E U5 H$ u F$ `参数详解5 J6 ^6 |( t1 W$ `
1,一些固定配置参数! t) K% U# }. w, F
& p& @; [8 V2 Kfsid = 6d529c3d-5745-4fa5-be5f-3962a8e8687c
/ k4 K* W/ \+ q% }7 ~ wmon_initial_members = mon1, mon2, mon3) ^. T9 A' G$ D7 @
mon_host = 10.10.40.67,10.10.40.68,10.10.40.697 y9 g- i' t! L: w# y
以上通常是通过ceph-deploy生成的,都是ceph monitor相关的参数,不用修改;# M; @, X5 k2 V3 `5 E6 |0 t
2,网络配置参数
& N% `/ A; c. i6 t$ x; Z8 P5 J0 F" x/ @: j2 L" X
public_network = 10.10.40.0/24 默认值 "": c8 ?: ?6 _) v# i" h% E: P
cluster_network = 10.10.41.0/24 默认值 ""/ E! |1 d4 o# d4 u Q
public network:monitor与osd,client与monitor,client与osd通信的网络,最好配置为带宽较高的万兆网络;4 C7 h: C- ]' G$ z$ `0 o3 c+ y
cluster network:OSD之间通信的网络,一般配置为带宽较高的万兆网络;
1 ]1 w+ v6 n5 W2 w6 s2 f H/ t f参考: {5 T& C, E1 b5 D" S
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/& s% x% s- n( h$ l
3,pool size配置参数
# [% Q& a) i0 D1 d- r, Q+ p: ?7 F8 \/ G, C/ J r. M
osd_pool_default_size = 3 默认值 3
5 N$ H2 Q8 Z- A% {; P* q! Josd_pool_default_min_size = 1 默认值 0 // 0 means no specific default; ceph will use size-size/2, A' a& \) b9 c# A+ E- w
这两个是创建ceph pool的时候的默认size参数,一般配置为3和1,3副本能足够保证数据的可靠性;
7 X2 V1 t9 W" t b: m& \" P4,认证配置参数& Y+ @' e: V6 Y
. b+ i1 I/ x1 J1 y A6 C5 Jauth_service_required = none 默认值 "cephx"
0 z2 e2 D! {9 Xauth_client_required = none 默认值 "cephx, none"
$ E+ B- k2 w( V7 j# ?auth_cluster_required = none 默认值 "cephx", n; [7 B5 K, c; x) t W
以上是Ceph authentication的配置参数,默认值为开启ceph认证;
2 b' M9 t8 z3 a5 _在内部使用的ceph集群中一般配置为none,即不使用认证,这样能适当加快ceph集群访问速度;
6 }4 y9 J: c7 F& V5,osd down out配置参数* H2 f: O1 Y7 D
# b. Z) ^/ T/ }
mon_osd_down_out_interval = 864000 默认值 300 // seconds
; ^6 e4 M7 U$ M/ _4 ?mon_osd_min_down_reporters = 2 默认值 24 Y' W2 J; }3 [5 {0 K
mon_osd_report_timeout = 900 默认值 900; I9 p' n" k" h' ?: V* ]" E
osd_heartbeat_interval = 15 默认值 6. U e- V8 X8 q8 P5 c& W
osd_heartbeat_grace = 60 默认值 20
5 s( f! a/ h" J, f, w3 }5 Hmon_osd_down_out_interval:ceph标记一个osd为down and out的最大时间间隔8 ^# A g) d8 T1 N
mon_osd_min_down_reporters:mon标记一个osd为down的最小reporters个数(报告该osd为down的其他osd为一个reporter): q0 o: g: g8 h" c/ [
mon_osd_report_timeout:mon标记一个osd为down的最长等待时间. \1 N5 ~, G! S" ^$ t; o
osd_heartbeat_interval:osd发送heartbeat给其他osd的间隔时间(同一PG之间的osd才会有heartbeat)
/ L8 q& z0 z) |* xosd_heartbeat_grace:osd报告其他osd为down的最大时间间隔,grace调大,也有副作用,如果某个osd异常退出,等待其他osd上报的时间必须为grace,在这段时间段内,这个osd负责的pg的io会hang住,所以尽量不要将grace调的太大。- z) M& v% v, T2 p# j5 `
基于实际情况合理配置上述参数,能减少或及时发现osd变为down(降低IO hang住的时间和概率),延长osd变为down and out的时间(防止网络抖动造成的数据recovery);- q7 {( U! D1 u5 N. t
参考:
- I f5 @9 ^7 z! L% g6 X, Y. C& khttp://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
/ H7 g+ t5 z8 F2 P) O1 H- s1 ^http://blog.wjin.org/posts/ceph-osd-heartbeat.html( v. N# z: ?' D& j
6,objecter配置参数$ g. J' s. z P6 Q+ R+ {* _
0 o9 o& k9 l0 [- ~+ z$ B- X0 a
objecter_inflight_ops = 10240 默认值 1024
8 H: J8 a$ i4 Xobjecter_inflight_op_bytes = 1048576000 默认值 100M
& l% _6 K* C' d! S" x/ losd client端objecter的throttle配置,它的配置会影响librbd,RGW端的性能;, z% z. g9 L; H$ n# N
配置建议:
" w# z. Y" E" e4 V& p& `# r调大这两个值
9 {! X- e! B8 n2 m7,ceph rgw配置参数
0 U3 D5 i3 E: U8 D! l
% _. q9 u4 `4 l/ ~: Y; o+ F1 u: A' {, mrgw_frontends = "civetweb port=10080 num_threads=2000" 默认值 "fastcgi, civetweb port=7480"* Z8 H4 e8 {0 \/ q* [6 T9 J7 ^
rgw_thread_pool_size = 512 默认值 100
" z i8 B; Q4 x( K2 F( j) frgw_override_bucket_index_max_shards = 20 默认值 0) R" y1 N) v2 a5 h) S5 P% S# w
7 q& ^% @0 |. e2 Z4 L" s* h# j5 drgw_max_chunk_size = 1048576 默认值 512 * 1024
* a6 ]; ]& D9 s5 Z7 S1 L3 N2 h1 _rgw_cache_lru_size = 1000000 默认值 10000 // num of entries in rgw cache
. G1 \ K# g. S! r, Srgw_bucket_default_quota_max_objects = 10000000 默认值 -1 // number of objects allowed
9 m( y4 m: I0 g v8 s) Z " U' U; V! c9 Q& M0 D, K
rgw_dns_name = object-storage.ffan.com 默认值
6 k1 j, A2 x' w* w$ l" K* g% [rgw_swift_url = http://object-storage.ffan.com 默认值 @" {- p! J! I% F/ f4 _) ?
rgw_frontends:rgw的前端配置,一般配置为使用轻量级的civetweb;prot为访问rgw的端口,根据实际情况配置;num_threads为civetweb的线程数;) X$ E J* @. E0 u- Z
rgw_thread_pool_size:rgw前端web的线程数,与rgw_frontends中的num_threads含义一致,但num_threads 优于rgw_thread_pool_size的配置,两个只需要配置一个即可;
! E& k8 [+ V* v A8 @' L# Y. c6 o2 Trgw_override_bucket_index_max_shards:rgw bucket index object的最大shards数,增大这个值能减少bucket index object的访问时间,但也会加大bucket的ls时间;' l' a+ X# k$ B- r; |7 E
rgw_max_chunk_size:rgw最大chunk size,针对大文件的对象存储场景可以把这个值调大;
4 x+ p' L0 b1 y4 Z9 }. w! Urgw_cache_lru_size:rgw的lru cache size,对于读较多的应用场景,调大这个值能加快rgw的响应速度;
1 K: o( d1 Q5 Argw_bucket_default_quota_max_objects:配合该参数限制一个bucket的最大objects个数;
. w3 `. W, A$ X参考:. i' [8 y' V+ _3 A; K* e
http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/
$ Z6 [1 ?4 P+ W4 k$ a1 Lhttp://ceph-users.ceph.narkive.com/mdB90g7R/rgw-increase-the-first-chunk-size% A0 i, @& X( `" }
https://access.redhat.com/solutions/2122231) A2 c, O d9 o
8,debug配置参数
0 z, G) w% r" [
$ ]% h0 g6 H {; D d" F2 b1 c0 ?debug_lockdep = 0/0$ F q# B9 U) }; y( d8 ~+ s
debug_context = 0/0
& \1 e; I0 a; C6 C7 n1 Ndebug_crush = 0/0
# |$ ^% M$ K1 B! I3 ddebug_buffer = 0/0 }- V6 u! |0 _: w8 b
debug_timer = 0/0
" ?! l* y% q9 R( |) `debug_filer = 0/06 s; L2 d- v2 Q W/ ^# A( h
debug_objecter = 0/0
1 U) W: J+ p9 r7 @ Ddebug_rados = 0/0
$ [* y2 B2 |, ?. w" Kdebug_rbd = 0/0
7 \2 g; }" {6 y5 G# Vdebug_journaler = 0/02 E8 O- i& ~0 i
debug_objectcatcher = 0/0* d5 b J& v3 O
debug_client = 0/0 C ]5 e+ p! g+ g4 Y# p7 f6 V
debug_osd = 0/0& L, h' l' h9 Q5 M
debug_optracker = 0/0
8 D9 u6 Q6 D8 u8 d5 l6 D$ U4 v. h9 Odebug_objclass = 0/0
+ l6 `. e# w. D1 {3 t6 hdebug_filestore = 0/0# @6 Y* w& G! b! Q/ l
debug_journal = 0/0" J, V0 C- p& V" t; J% F/ b
debug_ms = 0/02 Y8 B# T' o+ l, c' o. U
debug_mon = 0/0! y R& a$ P$ t/ w! w+ n. U2 M7 u0 `2 m
debug_monc = 0/0
: B2 d; ^" B& `6 ^" Edebug_tp = 0/0: g0 ~# o9 z, z. z g2 x) u& k
debug_auth = 0/0- N" `( T) c5 ^8 |) i: Q0 i& U
debug_finisher = 0/0
- q' N. {* v0 Q) G. Zdebug_heartbeatmap = 0/0: o9 J, P. L/ j- ^+ J! j
debug_perfcounter = 0/0
8 u# H: x4 ~9 {) p' t Odebug_asok = 0/0/ P" W" w: |0 a# s5 l
debug_throttle = 0/0
) D+ B% }2 O9 I& }9 ldebug_paxos = 0/0. @/ Q1 W$ H/ v4 E1 b# x8 K
debug_rgw = 0/0
8 @9 \7 g6 p- ^: \关闭了所有的debug信息,能一定程度加快ceph集群速度,但也会丢失一些关键log,出问题的时候不好分析;
: l; _/ P8 s0 u. [参考:
# b& T$ K& m4 ihttp://www.10tiao.com/html/362/201609/2654062487/1.html$ m T% O8 L- F* R" w& G, i
9,osd op配置参数
1 A. a* J# j4 k/ P
! A3 W' T' ^; ?/ A3 tosd_enable_op_tracker = false 默认值 true
# [4 b5 k( _3 p6 \2 Posd_num_op_tracker_shard = 32 默认值 32
& T% i; `, f" n8 Sosd_op_threads = 10 默认值 2
- C( h0 s3 d# S2 posd_disk_threads = 1 默认值 1, M* O2 ^5 Q( I" G
osd_op_num_shards = 32 默认值 5
9 O. Q/ @8 }. P& F1 D4 [0 Tosd_op_num_threads_per_shard = 2 默认值 2. M% j2 L4 t7 q5 {, N/ @" K( @
osd_enable_op_tracker:追踪osd op状态的配置参数,默认为true;不建议关闭,关闭后osd的 slow_request,ops_in_flight,historic_ops 无法正常统计;( L% D4 f5 F- N! M+ f9 J
1 N$ H1 ^8 p: c. n9 W# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
' S+ _4 h& L' T6 h( q0 Uop_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.. ^( g& w/ Z5 K6 M
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops
- I/ w: W" b+ L+ e' }op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.5 o) L0 Q( }+ O8 E: T5 U. o
打开op tracker后,若集群iops很高,osd_num_op_tracker_shard可以适当调大,因为每个shard都有个独立的mutex锁;
6 c2 I- \; H' @6 K- Y% [' E2 ^4 g/ \7 a* w$ o: _$ X: q0 `+ c
( w9 B, i4 q* P* E' lclass OpTracker {3 P4 H5 t; s6 J' ~' K9 s
...
) L% h5 f/ {' z6 t/ \4 y struct ShardedTrackingData {/ D+ F! I/ ^+ Z8 Y
Mutex ops_in_flight_lock_sharded;# h- P8 G; V7 Q% p# w6 Q. u4 ^
xlist<TrackedOp *> ops_in_flight_sharded;, V' Q2 K. u! W6 v
explicit ShardedTrackingData(string lock_name):
6 z4 f) h9 I: O' H ops_in_flight_lock_sharded(lock_name.c_str()) {}
: O5 x5 x, A- ~3 E };* o" f- C: p* T5 M9 _3 s+ V
vector<ShardedTrackingData*> sharded_in_flight_list;
$ t7 S7 _/ g- E. S4 h Z { uint32_t num_optracker_shards;
3 L9 w8 _- M% A* g. h) b- ?+ \...
! L8 h3 F1 K) y4 U7 b6 f};
9 t2 V$ u; f/ ]# b6 Zosd_op_threads:对应的work queue有peering_wq(osd peering请求),recovery_gen_wq(PG recovery请求);
) k& P# ]+ h: A, D( @- _# eosd_disk_threads:对应的work queue为 remove_wq(PG remove请求);( ]1 C8 ^" l' `8 h# m# J+ n" t% s
osd_op_num_shards和osd_op_num_threads_per_shard:对应的thread pool为osd_op_tp,work queue为op_shardedwq;
$ R7 Q2 t5 x# F处理的请求包括:
, K4 S( [, @6 q! U1 K. nOpRequestRef" D) I- J. h3 B; c2 L
PGSnapTrim& [+ L8 X6 ]; x
PGScrub9 A; N, J+ _7 h' R; i0 U
调大osd_op_num_shards可以增大osd ops的处理线程数,增大并发性,提升OSD性能;
+ v! X8 p3 Y% H7 c10,osd client message配置参数
+ j; j" @- @) v( e8 g! {! I" t1+ G4 b# n; T0 P
2
1 A# X; M# H3 }* N! _& ?( Mosd_client_message_size_cap = 1048576000 默认值 500*1024L*1024L // client data allowed in-memory (in bytes)
0 w# K: Y4 v' sosd_client_message_cap = 10000 默认值 100 // num client messages allowed in-memory
0 _+ K0 U6 q% |7 L2 N# Z, B* U" ?; K这个是osd端收到client messages的capacity配置,配置大的话能提升osd的处理能力,但会占用较多的系统内存;
% P# X* V3 v3 S配置建议:* M, s! g1 ^$ x7 w6 r8 p
服务器内存足够大的时候,适当增大这两个值
) U* G! _& N) O+ e4 u3 x5 u1 F% [8 T& A11,osd scrub配置参数2 W5 B+ T& ] i; m. ]
9 k: o6 [/ t" s3 h4 { K7 \' A
osd_scrub_begin_hour = 2 默认值 09 x/ u6 v+ g9 Y
osd_scrub_end_hour = 6 默认值 24
2 @' E7 m( f0 P9 P - }/ y0 h- A+ V8 E
// The time in seconds that scrubbing sleeps between two consecutive scrubs
% I( B# \, S7 [$ u5 Gosd_scrub_sleep = 2 默认值 0 // sleep between [deep]scrub ops
3 I% q+ K4 \! Y f8 J
& T$ {& I6 a Kosd_scrub_load_threshold = 5 默认值 0.56 V2 F" V2 d! ]* u9 A1 |; r6 {5 Y( v
2 f) J9 H+ p+ h5 G$ t// chunky scrub配置的最小/最大objects数,以下是默认值. R7 y& f$ M! o: ?; r: C/ C% m) [; T
osd_scrub_chunk_min = 5' g+ i- |" A1 J
osd_scrub_chunk_max = 25
, e* [! \; M1 { v; y5 pCeph osd scrub是保证ceph数据一致性的机制,scrub以PG为单位,但每次scrub回获取PG lock,所以它可能会影响PG正常的IO;
5 m8 i, ^: ^# D2 L% b9 a1 D; BCeph后来引入了chunky的scrub模式,每次scrub只会选取PG的一部分objects,完成后释放PG lock,并把下一次的PG scrub加入队列;这样能很好的减少PG scrub时候占用PG lock的时间,避免过多影响PG正常的IO;
: l& `' {. s+ U$ ~- s同理,引入的osd_scrub_sleep参数会让线程在每次scrub前释放PG lock,然后睡眠一段时间,也能很好的减少scrub对PG正常IO的影响;' D1 q9 {% G5 W) Y$ u) S9 E
配置建议:1 V: W4 ~# f0 I( l- }8 w
osd_scrub_begin_hour和osd_scrub_end_hour:OSD Scrub的开始结束时间,根据具体业务指定;5 {- \( m3 [1 f, r: }
osd_scrub_sleep:osd在每次执行scrub时的睡眠时间;有个bug跟这个配置有关,建议关闭;
) d) A/ @! K3 ?+ {osd_scrub_load_threshold:osd开启scrub的系统load阈值,根据系统的load average值配置该参数;
. L" y4 x% }# T# O Uosd_scrub_chunk_min和osd_scrub_chunk_max:根据PG中object的个数配置;针对RGW全是小文件的情况,这两个值需要调大;
& M9 c6 E' j8 J8 u& h- z参考:, ~2 G. D8 n3 j7 t. g7 S/ A; N
http://www.jianshu.com/p/ea2296e1555c
. U% D' S( `, `, w: _http://tracker.ceph.com/issues/19497
( x/ T, Y8 ]! N0 k( ~- S9 D/ T- R12,osd thread timeout配置参数
! Z( ]! S- N8 f( ?/ ]4 a, R$ X5 V+ ~: Y& S) T* L. l5 j
osd_op_thread_timeout = 580 默认值 15* S; _ ?' d+ V i$ ~
osd_op_thread_suicide_timeout = 600 默认值 150+ d& h- _! L0 L! N h
9 h( F* h/ N: H5 s3 D) nosd_recovery_thread_timeout = 580 默认值 309 Q* g. [( A# U6 {
osd_recovery_thread_suicide_timeout = 600 默认值 300+ i$ C9 g& c9 E/ y
osd_op_thread_timeout和osd_op_thread_suicide_timeout关联的work queue为:- U4 N2 M# B$ K# h& x# ~6 L
op_shardedwq - 关联的请求为:OpRequestRef,PGSnapTrim,PGScrub) c. r$ z) h9 \! W0 ?+ ~
peering_wq - 关联的请求为:osd peering
$ `' z1 q$ l, M' Uosd_recovery_thread_timeout和osd_recovery_thread_suicide_timeout关联的work queue为:
+ e# Y- ~4 x* y- H+ precovery_wq - 关联的请求为:PG recovery
, G" `9 Q' Q: S6 H9 z2 ] l2 eCeph的work queue都有个基类WorkQueue_,定义如下: |* l6 G M9 V# j3 e5 w4 m
8 X' g, o& ^. u) ?/// Pool of threads that share work submitted to multiple work queues.& W) l' {/ e) O4 s
class ThreadPool : public md_config_obs_t {! N6 h& F3 J3 `% i
...( F z. ]8 |8 W( o
/// Basic interface to a work queue used by the worker threads.
$ x8 e9 F2 Q' G7 b% y; _ struct WorkQueue_ {8 T* r6 {( ^" e
string name;
A7 l, U' J! C: _: F& c time_t timeout_interval, suicide_interval;
: M- Z" Z9 V6 ^0 Y; D4 Y WorkQueue_(string n, time_t ti, time_t sti)
8 f" `1 L4 b* g6 N z& N: h : name(n), timeout_interval(ti), suicide_interval(sti)
; [* ~1 r5 t- o8 O5 k+ }1 ~ { }
+ w! o4 a: R' a" X...% H/ ^8 j. |/ ~" U1 @* ?
这里的timeout_interval和suicide_interval分别对应上面所述的配置timeout和suicide_timeout;
0 b& n# }3 g4 c+ S4 M当thread处理work queue中的一个请求时,会受到这两个timeout时间的限制:
/ b$ B# n c$ G) A* t. ltimeout_interval - 到时间后设置m_unhealthy_workers+1
# t" k3 i* h# U$ Y/ P4 A* Gsuicide_interval - 到时间后调用assert,OSD进程crush
4 C0 Z. V3 i* [ d) w4 J对应的处理函数为:
. w* B/ k$ O# V4 y2 s
; W; J8 h* z5 ?5 x! wbool HeartbeatMap::_check(const heartbeat_handle_d *h, const char *who, time_t now)
- c) g3 g' l, J5 d{
, C, d1 O0 n9 n bool healthy = true;5 c7 O4 ]) n8 X5 u' v
time_t was;
8 S" F6 b" b6 |6 G- D" Q) D# e was = h->timeout.read();
. ?( l' z0 R3 o* H: L if (was && was < now) {. R& u& c/ U* d
ldout(m_cct, 1) << who << " '" << h->name << "'"
8 g0 ?6 M/ ~8 q- C$ O << " had timed out after " << h->grace << dendl;
2 V0 r; s/ C( ? healthy = false;
# F8 F4 @# O9 R }2 Z& u, Z8 [4 E( S
was = h->suicide_timeout.read();
/ X! r3 X M% c' J% P if (was && was < now) {
( }' r' f: z' f* Y ldout(m_cct, 1) << who << " '" << h->name << "'"
5 S, `/ E/ p2 K/ e. y/ {7 ~: G6 E2 F << " had suicide timed out after " << h->suicide_grace << dendl;% V" ?5 q/ D* L9 l8 A( B$ e
assert(0 == "hit suicide timeout");
* W1 `" w) p7 E; I" G }# J! M' ^! v& U
return healthy;
' l% s' o% D$ o- {0 r9 S1 E3 C}
2 A' O2 Z& w0 e当前仅有RGW添加了worker的perfcounter,所以也只有RGW可以通过perf dump查看total/unhealthy的worker信息:: v* X- ~3 q1 X& ]; W4 [" S) A! N1 y
2 i! ^( N5 L( o. T* o. m, T[root@ yangguanjun]# ceph daemon /var/run/ceph/ceph-client.rgw.rgwdaemon.asok perf dump | grep worker
# a7 j* l7 X" d# J: K) n "total_workers": 32," _6 ]+ |/ T9 j& x0 |
"unhealthy_workers": 0
0 O: d5 `+ W$ B/ j6 T) S对应的配置项为:
2 P4 V8 ^5 s0 n4 Y$ l, o4 T) S
) [7 z' n9 I" l! \( uOPTION(rgw_num_async_rados_threads, OPT_INT, 32) // num of threads to use for async rados operations
) ]1 d% r3 }% B( q/ n' @! V```
8 d; {) [5 N) N3 A, ~**配置建议:**
* ?& S& W; p/ `- `*_thread_timeout`:这个值配置越小越能及时发现处理慢的请求,所以不建议配置很大;特别是针对速度快的设备,建议调小该值;
5 E& c. r; a7 i0 ] S' a- `*_thread_suicide_timeout`:这个值配置小了会导致超时后的OSD crush,所以建议调大;特别是在对应的throttle调大后,更应该调大该值;
D8 L) d/ Z) V1 I: I### 13,fielstore op thread配置参数
/ o N+ z) H( M7 J9 S2 h- s7 W```sh
2 x2 D% U0 u( n( L- l; A& J! F; I zfilestore_op_threads = 10 默认值 24 A4 d0 w7 f6 @! Y+ I
filestore_op_thread_timeout = 580 默认值 60
' E+ O. ^/ o+ H! y6 p# O s3 ~filestore_op_thread_suicide_timeout = 600 默认值 180
. R5 Q3 R0 y" _0 hfilestore_op_threads:对应的thread pool为op_tp,对应的work queue为op_wq;filestore的所有请求都经过op_wq处理;5 i2 p# s8 Q6 ?" K9 H
增大该参数能提升filestore的处理能力,提升filestore的性能;配合filestore的throttle一起调整;
E- H: X; k1 @: ]9 N. rfilestore_op_thread_timeout和filestore_op_thread_suicide_timeout关联的work queue为:op_wq
& {, K; Y$ @! y配置的含义与上一节中的thread_timeout/thread_suicide_timeout保持一致;
4 V7 m1 o2 J1 A3 ?: B8 }' I13,filestore merge/split配置参数
1 l) w% C( Z* }2 Y( B( S& ?) B% c1 f0 _& n
filestore_merge_threshold = -1 默认值 10! o; {) N7 V/ F t9 Y
filestore_split_multiple = 16000 默认值 28 O8 P2 x, {" p/ i4 }9 Y2 j C% q6 N
这两个参数是管理filestore的目录分裂/合并的,filestore的每个目录允许的最大文件数为:
4 ?5 R8 o. s9 `! b* q2 W6 v$ R& _filestore_split_multiple * abs(filestore_merge_threshold) * 16) i3 q5 \) q; ^2 ?: w% J8 ~
在RGW的小文件应用场景,会很容易达到默认配置的文件数(320),若在写的过程中触发了filestore的分裂,则会非常影响filestore的性能;
# k, N0 v/ v8 @% Q! T0 j每次filestore的目录分裂,会依据如下规则分裂为多层目录,最底层16个子目录:. J$ u Y7 B% z+ J1 f
例如PG 31.4C0, hash结尾是4C0,若该目录分裂,会分裂为 DIR_0/DIR_C/DIR_4/{DIR_0, DIR_F};5 c: o. Z) p7 l2 J- F, @
原始目录下的object会根据规则放到不同的子目录里,object的名称格式为: *__head_xxxxX4C0_*,分裂时候X是几,就放进子目录DIR_X里。比如object:*__head_xxxxA4C0_*, 就放进子目录 DIR_0/DIR_C/DIR_4/DIR_A 里;; G% X: }/ A8 W/ B# W( H/ Z
解决办法:+ p9 A' _6 D5 X$ K
增大merge/split配置参数的值,使单个目录容纳更多的文件;
0 @4 ~0 Q6 R0 \4 \7 R4 A. C& Hfilestore_merge_threshold配置为负数;这样会提前触发目录的预分裂,避免目录在某一时间段的集中分裂,详细机制没有调研;
+ M, R" O) [4 C4 F8 w) l创建pool时指定expected-num-objects;这样会依据目录分裂规则,在创建pool的时候就创建分裂的子目录,避免了目录分裂对filestore性能的影响;2 V7 x% n" N8 Z
参考:
$ r3 A$ D( v0 `+ G% w7 f/ ]3 Xhttp://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/
3 m- |) P& t/ thttp://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool% {$ t* d- G5 g3 m
http://blog.csdn.net/for_tech/article/details/51251936
8 Z3 U: T8 }4 z# r* u# S# D; K0 `$ _ uhttp://ivanjobs.github.io/page3/9 G* g, w& x9 q
14,filestore fd cache配置参数$ Y9 {! o- B( q; g& G
$ _( o1 ~ e$ Ofilestore_fd_cache_shards = 32 默认值 16 // FD number of shards
2 f8 Y* G8 L* }filestore_fd_cache_size = 32768 默认值 128 // FD lru size% D8 {" e, w# A6 n
filestore的fd cache是加速访问filestore里的file的,在非一次性写入的应用场景,增大配置可以很明显的提升filestore的性能;
J* i, g% v& n- f% c% {& T15,filestore sync配置参数
# s+ y. {+ p2 q, X+ i0 }2 D+ w \
filestore_wbthrottle_enable = false 默认值 true SSD的时候建议关闭0 x& m; n3 I- k7 a) [: r- b
filestore_min_sync_interval = 5 默认值 0.01 s 最小同步间隔秒数,sync fs的数据到disk,FileStore::sync_entry()6 h) c' g+ C# j9 s1 _) j9 E3 a
filestore_max_sync_interval = 10 默认值 5 s 最大同步间隔秒数,sync fs的数据到disk,FileStore::sync_entry()
+ F- n" [/ S: D0 H$ w, Yfilestore_commit_timeout = 3000 默认值 600 s FileStore::sync_entry() 里 new SyncEntryTimeout(m_filestore_commit_timeout)1 C& g6 W/ {, h% o" }) x" n
filestore_wbthrottle_enable的配置是关于filestore writeback throttle的,即我们说的filestore处理workqueue op_wq的数据量阈值;默认值是true,开启后XFS相关的配置参数有:
# Q- J: X X1 r" y% q- k$ }- a( f/ H p' c! h2 g
OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)9 M1 {8 [; b1 K0 q* j1 N& g
OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
# e5 F( ] S3 h2 POPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)# T! [9 o6 }& ]: J5 @
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)) e, c% y" K7 i. ` f
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500) N4 ~2 Y7 X6 D N) [; H
OPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 5000); O& L0 y' G9 [
若使用普通HDD,可以保持其为true;针对SSD,建议将其关闭,不开启writeback throttle;& K1 m4 i7 o/ r( E
filestore_min_sync_interval和filestore_max_sync_interval是配置filestore flush outstanding IO到disk的时间间隔的;增大配置可以让系统做尽可能多的IO merge,减少filestore写磁盘的压力,但也会增大page cache占用内存的开销,增大数据丢失的可能性;
9 m6 {$ G( x1 @3 Q/ rfilestore_commit_timeout是配置filestore sync entry到disk的超时时间,在filestore压力很大时,调大这个值能尽量避免IO超时导致OSD crush;
3 @; n, T$ z. y- g5 j& i+ F. ^5 F16,filestore throttle配置参数# x; E0 G! b% @6 f" o |, J
2 B% J9 d. }- s. R3 ~filestore_expected_throughput_bytes = 536870912 默认值 200MB /// Expected filestore throughput in B/s1 z4 Q9 H9 D1 |, O6 K7 D
filestore_expected_throughput_ops = 2500 默认值 200 /// Expected filestore throughput in ops/s* J' R+ Z. D4 J5 x
filestore_queue_max_bytes= 1048576000 默认值 100MB5 s# j; c, R1 u6 `
filestore_queue_max_ops = 5000 默认值 50
y4 I( a! P6 z; q : w3 K0 I, H: X" O: ?, [) E3 f
/// Use above to inject delays intended to keep the op queue between low and high2 @ Z/ V* T9 |
filestore_queue_low_threshhold = 0.6 默认值 0.3( G$ ~& t: m- K9 ?
filestore_queue_high_threshhold = 0.9 默认值 0.9/ V: C( B; h! d5 Q; p3 p
, X* f9 N& I0 ?! q5 Ufilestore_queue_high_delay_multiple = 2 默认值 0 /// Filestore high delay multiple. Defaults to 0 (disabled)( Z, i9 W. [6 w+ A# H
filestore_queue_max_delay_multiple = 10 默认值 0 /// Filestore max delay multiple. Defaults to 0 (disabled)
, o9 N: z Q e6 u在jewel版本里,引入了dynamic throttle,来平滑普通throttle带来的长尾效应问题;1 ~1 |) W. B- A h( Q* U
一般在使用普通磁盘时,之前的throttle机制即可很好的工作,所以这里默认filestore_queue_high_delay_multiple和filestore_queue_max_delay_multiple都为0;3 H+ \+ e3 j+ k
针对高速磁盘,需要在部署之前,通过小工具ceph_smalliobenchfs来测试下,获取合适的配置参数;
' P7 m' N6 `& K1 ~1 S% V2 o: v r* h$ p6 A
BackoffThrottle的介绍如下:
1 V& s, G: z$ Q* o% S! M5 v* s/**7 B, }" a0 x5 b1 i" t3 w; H
* BackoffThrottle' A) ~; W0 d. ~
*- ?7 P& Y' S' h! Z2 b# H8 U
* Creates a throttle which gradually induces delays when get() is called# L% B0 i) i" Z% W0 o
* based on params low_threshhold, high_threshhold, expected_throughput,5 B) H9 ^; B% m( }
* high_multiple, and max_multiple.
$ d! a" f2 x. c7 ?6 T# L*
0 n* l! U' U" T6 N5 U0 z. `* R* In [0, low_threshhold), we want no delay.' |" i5 s# U! J8 R# l
*
. a9 ?/ s2 E" J0 f" G. r$ m* In [low_threshhold, high_threshhold), delays should be injected based+ z C, _" {! w
* on a line from 0 at low_threshhold to
3 P# n: j/ ]4 P- I3 w: S! ]8 _* high_multiple * (1/expected_throughput) at high_threshhold.
( y* _8 |4 Y( S( ^# T' O" Q) w7 n*
O( A: M' Z& |# Y0 F9 k2 ~* In [high_threshhold, 1), we want delays injected based on a line from
; J3 P' s7 P3 h9 U: A* (high_multiple * (1/expected_throughput)) at high_threshhold to+ z+ [+ k! d% h; R, ]" I
* (high_multiple * (1/expected_throughput)) +
, E7 C/ W: ^; L( s; j& T- ~* (max_multiple * (1/expected_throughput)) at 1.
0 i4 e& A! F f6 p* i4 a*4 l/ ^2 u( l- m. P5 N' ?# H
* Let the current throttle ratio (current/max) be r, low_threshhold be l,- a; k! | [+ n }
* high_threshhold be h, high_delay (high_multiple / expected_throughput) be e,
# G& o4 N1 Q5 s' t9 w# M' \( v* and max_delay (max_muliple / expected_throughput) be m.
! L6 ?: k$ j y. K*
' a, J/ N* g7 z% n: p: \/ P( Z* delay = 0, r \in [0, l)
0 n6 a, s9 o) `$ l |* ^* delay = (r - l) * (e / (h - l)), r \in [l, h): W. g1 H4 j1 H5 T; u% I
* delay = h + (r - h)((m - e)/(1 - h))
6 O7 D. V& N% @+ ~) _*/; O3 }8 b* K+ O" m* v9 T# Z6 ^+ W
参考:
& f: R4 Q. R- p% Fhttp://docs.ceph.com/docs/jewel/dev/osd_internals/osd_throttles/; _- A) Q* @4 ?
http://blog.wjin.org/posts/ceph-dynamic-throttle.html
0 X9 b4 J! q, y% O- jhttps://github.com/ceph/ceph/blob/master/src/doc/dynamic-throttle.txt. E& f0 @& g5 g- P' w h
Ceph BackoffThrottle分析7 t+ l" O( X$ h3 V
17,filestore finisher threads配置参数8 j8 L" D+ y* Q1 A/ \! M0 v; w
1
7 s7 u) e- H) u$ c* }4 O2
' ]5 x6 M+ C: t* C. S" S q' g4 m" mfilestore_ondisk_finisher_threads = 2 默认值 1# W( L# F0 f. X) ]
filestore_apply_finisher_threads = 2 默认值 1+ M# m: p9 h, W: X' N) y5 L
这两个参数定义filestore commit/apply的finisher处理线程数,默认都为1,任何IO commit/apply完成后,都需要经过对应的ondisk/apply finisher thread处理;
8 C; I Y9 r2 @$ T" B$ @6 q" b在使用普通HDD时,磁盘性能是瓶颈,单个finisher thread就能处理好;
! Y' P' n9 d7 ?# _& u0 H但在使用高速磁盘的时候,IO完成比较快,单个finisher thread不能处理这么多的IO commit/apply reply,它会成为瓶颈;所以在jewel版本里引入了finisher thread pool的配置,这里一般配置为2即可; P4 T) e J8 ?1 ]
18,journal配置参数: y: ?/ M# V" g( U g/ V
7 \: x3 D. w5 m C6 C
journal_max_write_bytes=1048576000 默认值 10M
0 w9 A1 O) ~ N2 Gjournal_max_write_entries=5000 默认值 100
; S% S/ B: t* r/ y" C
X; c5 W# B+ v0 v" J1 kjournal_throttle_high_multiple = 2 默认值 0 /// Multiple over expected at high_threshhold. Defaults to 0 (disabled).
& x2 `* ?0 }6 M2 g7 t! ~# ~8 ?journal_throttle_max_multiple = 10 默认值 0 /// Multiple over expected at max. Defaults to 0 (disabled).
2 j z4 ]; L, X/ C/// Target range for journal fullness( Y- z9 {. I2 {3 B* K. {
OPTION(journal_throttle_low_threshhold, OPT_DOUBLE, 0.6)
* i3 T% z0 v- N% YOPTION(journal_throttle_high_threshhold, OPT_DOUBLE, 0.9)" \- c4 G" {/ I6 e/ ]1 P. y0 h
journal_max_write_bytes和journal_max_write_entries是journal一次write的数据量和entries限制;) ?0 a3 I0 a( _. x9 f& e
针对SSD分区做journal的情况,这两个值要增大,这样能增大journal的吞吐量;
* e$ B# u# R7 D/ P/ ?journal_throttle_high_multiple和journal_throttle_max_multiple是JournalThrottle的配置参数,JournalThrottle是BackoffThrottle的封装类,所以JournalThrottle与我们在filestore throttle介绍的dynamic throttle工作原理一样;' t0 v& A; h- B# Z# I/ o
+ y& X6 x6 d7 X3 K2 j9 W. W
int FileJournal::set_throttle_params()
) I% y8 `6 |9 o8 T6 Y{
8 q* v4 X8 O( U8 a% h/ x stringstream ss;( G, ~# C0 c. _- u
bool valid = throttle.set_params(
. D$ V9 u6 @( s6 i0 ]# x1 K' Y g_conf->journal_throttle_low_threshhold,
5 _/ _) d8 S- V5 f' a6 b g_conf->journal_throttle_high_threshhold,; P$ S4 A1 C5 k; w+ P
g_conf->filestore_expected_throughput_bytes,
# } N9 J. X( D% t( {2 ~) K g_conf->journal_throttle_high_multiple,+ i5 C5 y% ?3 [: O4 a3 p) P9 G) f
g_conf->journal_throttle_max_multiple,
5 U- O) [! j' C/ S! ] header.max_size - get_top(),; A4 c+ I5 Z# X1 l( p& p
&ss);8 ?. ~8 s' \$ ~7 J
... C- [8 \: b' q# p2 v4 C* b6 H j- H
}5 b, w+ k; j9 \5 e' `
从上述代码中看出相关的配置参数有:1 g" e, i: T1 c" u
journal_throttle_low_threshhold5 d! G+ d0 k) ]
journal_throttle_high_threshhold% F8 L2 x( k0 Z6 c' {8 C, V9 G8 M
filestore_expected_throughput_bytes
. N/ I$ B. b! b; Q; x19,rbd cache配置参数
% F% F& G3 C+ y2 k+ M$ R5 p
! W) j1 |& `: Z, a" J: J[client]1 Q* z; `$ h P3 t0 p H# Z* x% M
rbd_cache_size = 134217728 默认值 32M // cache size in bytes, U" h1 l+ }1 a0 o0 B
rbd_cache_max_dirty = 100663296 默认值 24M // dirty limit in bytes - set to 0 for write-through caching
' A- K. _2 m( K6 j+ Arbd_cache_target_dirty = 67108864 默认值 16M // target dirty limit in bytes
3 V- [7 R8 T1 ]rbd_cache_writethrough_until_flush = true 默认值 true // whether to make writeback caching writethrough until flush is called, to be sure the user of librbd will send flushs so that writeback is safe. A/ |) C8 ?* j3 r, W
rbd_cache_max_dirty_age = 5 默认值 1.0 // seconds in cache before writeback starts% u3 m4 g$ H6 v8 S; E
rbd_cache_size:client端每个rbd image的cache size,不需要太大,可以调整为64M,不然会比较占client端内存;9 @7 b& ?8 L5 r$ E
参照默认值,根据rbd_cache_size的大小调整rbd_cache_max_dirty和rbd_cache_target_dirty;
. ~# Q! H, }5 |: o% v. c: S: Mrbd_cache_max_dirty:在writeback模式下cache的最大bytes数,默认是24MB;当该值为0时,表示使用writethrough模式;
3 |' e! E# k y- }rbd_cache_target_dirty:在writeback模式下cache向ceph集群写入的bytes阀值,默认16MB;注意该值一定要小于rbd_cache_max_dirty值
8 L& d- p' R" v( C) U( irbd_cache_writethrough_until_flush:在内核触发flush cache到ceph集群前rbd cache一直是writethrough模式,直到flush后rbd cache变成writeback模式;
2 E: a7 U" {6 I' l# m5 Krbd_cache_max_dirty_age:标记OSDC端ObjectCacher中entry在cache中的最长时间; P6 L9 T' d4 e$ b) g8 f5 Y
|
|