Fluentd使用中遇到的丢数据问题

目前遇到的问题主要有3个:两个关于buffer,一个关于connection。下面具体说描述下问题的详细信息及目前我采取的解决措施。先交代下我这里使用的Td-agent架构如下,PS(方便起见以下均将Td-agent简化为TD,关于TD和Fluentd的关系移步我的另一篇Blog

需要注意: 这里的缓存Buffer设置对0.14.21版本测试生效,亲测0.12.20不生效,具体可到【Fluentd官网】获取支持。

1
2
3
4
5
graph LR;
A(Td-client)-->F(Td-forward)
B(Td-client)-->F(Td-forward)
F-->E(Elasticsearch cluster)
E-->K(Kibana)

Version:

td-agent 0.14.21
ES Version: 5.0.0, Build: 253032b/2016-10-26T04:37:51.531Z, JVM: 1.8.0_121 And lucene_version : "6.2.0
Td-agent的es插件版本
elasticsearch (1.0.18)
elasticsearch-api (1.0.18)
elasticsearch-transport (1.0.18)
fluent-plugin-elasticsearch (1.8.0)

Q1:Td-client端的Buffer问题

这个问题出现次数最多,而且log暴露的问题也是显而易见,主要解决是参数问题

  • 报错日志

    1
    2
    3
    4
    5
    6
    2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
    2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
    2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5013.205"
    2018-03-16 03:25:19 +0000 [warn]: #0 suppressed same stacktrace
    2018-03-16 03:25:19 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:throw_exception
    2018-03-16 03:25:19 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" tag="logics.5073.205"
  • 修改后的配置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    <match logics.**>
    type forward
    <buffer>
    @type file
    path /var/log/td-agent/buffer/td-gamex-buffer
    chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
    total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
    chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
    compress text #The option to specify compression of each chunks, during events are buffered
    flush_mode default
    flush_interval 15s #Default: 60s
    flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
    delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
    overflow_action throw_exception
    retry_timeout 10m
    </buffer>
    send_timeout 60s
    recover_wait 10s
    heartbeat_interval 1s
    phi_threshold 16
    hard_timeout 60s
    heartbeat_type tcp
    <server>
    name logics.shard
    host tdagent.test.net
    port 24224
    weight 1
    </server>
    </match>

Q2:Td-forward端的Buffer问题

正常来说对forward端配置buffer跟client端一样就行了,不过在实际使用中发现按client的配置会报错,如下,从报错来看是路径的问题,经查证是buffer的路径在forest copy的配置类型下,需要区分index来进行缓存,使用${tag}作为buffer存储路径的话就很好的解决了这个问题。类似issues可前往【Github】查看

  • 报错日志

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    2018-04-11 02:24:29 +0000 [error]: #0 Cannot output messages with tag 'logics.5022.205'
    2018-04-11 02:24:29 +0000 [error]: #0 failed to configure sub output copy: Other 'elasticsearch' plugin already use same buffer path: type = elasticsearch, buffer path = /var/log/td-agent/buffer/td-gamex-buffer
    2018-04-11 02:24:29 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/buf_file.rb:71:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/output.rb:305:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/inject.rb:104:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_emitter.rb:73:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:504:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:71:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:73:in `block in configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `each'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/multi_output.rb:62:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin.rb:164:in `configure'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:132:in `block in plant'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `synchronize'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:128:in `plant'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-forest-0.3.3/lib/fluent/plugin/out_forest.rb:169:in `emit'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/compat/output.rb:211:in `process'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/bare_output.rb:53:in `emit_sync'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/event_router.rb:96:in `emit_stream'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:300:in `on_message'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:211:in `block in handle_connection'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `call'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:248:in `block (3 levels) in read_messages'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `feed_each'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:247:in `block (2 levels) in read_messages'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `call'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin/in_forward.rb:256:in `block in read_messages'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `call'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/server.rb:576:in `on_read_without_connection'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:123:in `on_readable'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/io.rb:186:in `on_readable'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run_once'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.4.6/lib/cool.io/loop.rb:88:in `run'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
    /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.13/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
  • 解决后的配置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    <match logics.**>
    type forest
    subtype copy
    <template>
    <store>
    @type elasticsearch
    <buffer>
    @type file
    path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
    chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
    total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
    chunk_full_threshold 0.9 #flush the chunk when actual size reaches chunk_limit_size * chunk_full_threshold
    compress text #The option to specify compression of each chunks, during events are buffered
    flush_mode default
    flush_interval 15s #Default: 60s
    flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
    delayed_commit_timeout 60 #The timeout seconds decides that async write operation fails
    overflow_action throw_exception
    retry_timeout 10m
    </buffer>
    host elasticsearch.test.net
    port 9200
    logstash_format true
    logstash_prefix bilogs
    logstash_dateformat logics-${tag_parts[-1]}.%Y.%m.%d
    time_key time
    request_timeout 60s
    reload_connections false
    reload_on_failure true
    reconnect_on_error true
    </store>
    </template>
    </match>

Q3:Td-forward端的connection问题

这个问题主要发生在TD向ES发送数据阶段,起初考虑是ES集群处理能力达到上限,无法分配更对的连接给TD,但是进行Reload之后就正常了,所以这个问题的可能性不大,很可能是TD或者ES在处理连接的逻辑上存在问题,没有正确的关闭或者使用连接。经过查找资料,也找到了一些蛛丝马迹,可供参考的资料也一快带上

  • 参考资料

    A. 【Github 关于这个问题的Issues】

  • 报错信息

    1
    2
    3
    2018-03-21 04:28:19 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:28:34 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"
    2018-03-21 04:28:19 +0000 [warn]: suppressed same stacktrace
    2018-03-21 04:28:35 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-03-21 04:29:08 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3fe6fced399c"
  • 再说明
    解决方法:还是升级吧,升级至1.9.1版本之后该问题消失。【关于修复PR信息点这里】【版本1.9.1 该Release版本时间为2016.12.14】中修复

    目前解决主要涉及以下几个方面

    • reload_connections false # defaults to true
      You can tune how the elasticsearch-transport host reloading feature works. By default it will reload the host list from the server every 10,000th request to spread the load. This can be an issue if your Elasticsearch cluster is behind a Reverse Proxy, as Fluentd process may not have direct network access to the Elasticsearch nodes.
      对于这个参数,我这里ES集群并没有使用代理而是DSN域名
    • reload_on_failure true # defaults to false
      Indicates that the elasticsearch-transport will try to reload the nodes addresses if there is a failure while making the
    • request, this can be useful to quickly remove a dead node from the list of addresses
      这个主要是当请求发生故障时ES-transport将重新加载节点地址,删除死节点,我这里使用的也是true
    • reconnect_on_error true
      Github提到这个有帮助,实测并不好用,问题还是会出现,但频率貌似有减少。
  • 目前解决途径
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    <match logics.**>
    type forest
    subtype copy
    <template>
    <store>
    type elasticsearch
    <buffer>
    @type file
    path /var/log/td-agent/buffer/td-gamex-buffer/${tag}
    chunk_limit_size 512MB #Default: 8MB (memory) / 256MB (file)
    total_limit_size 32GB #Default: 512MB (memory) / 64GB (file)
    chunk_full_threshold 0.9 #flush chunk when size reaches chunk_limit_size * chunk_full_threshold
    compress text #The option to specify compression of each chunks, during events are buffered
    flush_mode default
    flush_interval 15s #Default: 60s
    flush_thread_count 1 #Default: 1 The number threads used to write chunks in parallel
    delayed_commit_timeout 60 #The timeout seconds async write operation fails
    overflow_action throw_exception
    retry_timeout 10m
    </buffer>
    host elasticsearch.yingxiong.net
    port 9200
    logstash_format true
    logstash_prefix bilogs
    logstash_dateformat logics-${tag_parts[-1]}.%Y.%W
    time_key time
    flush_interval 10s
    request_timeout 15s
    num_threads 2
    reload_connections false
    reload_on_failure true
    reconnect_on_error true
    </store>
    </template>
    </match>