监控和管理指南¶

Monitoring and Management Guide

简介¶

Introduction

中文

有多种工具可用于监控和检查 Celery 集群。

本文档将介绍其中的一些工具，以及与监控相关的功能，例如事件（events）和广播命令（broadcast commands）。

英文

There are several tools available to monitor and inspect Celery clusters.

This document describes some of these, as well as features related to monitoring, like events and broadcast commands.

RabbitMQ¶

RabbitMQ

中文

要管理一个 Celery 集群，了解如何监控 RabbitMQ 是非常重要的。

RabbitMQ 自带了 rabbitmqctl(1) 命令工具，你可以使用它来列出队列、交换器、绑定关系、队列长度、每个队列的内存使用情况，还可以管理用户、虚拟主机及其权限。

备注

以下示例中使用的是默认虚拟主机（"/"），如果你使用了自定义虚拟主机，需为命令添加 -p 参数，例如： rabbitmqctl list_queues -p my_vhost …

英文

To manage a Celery cluster it is important to know how RabbitMQ can be monitored.

RabbitMQ ships with the rabbitmqctl(1) command, with this you can list queues, exchanges, bindings, queue lengths, the memory usage of each queue, as well as manage users, virtual hosts and their permissions.

备注

The default virtual host ("/") is used in these examples, if you use a custom virtual host you have to add the -p argument to the command, for example: rabbitmqctl list_queues -p my_vhost …

检查队列¶

Inspecting queues

中文

查询队列中的任务数量：

$ rabbitmqctl list_queues name messages messages_ready \
                        messages_unacknowledged

其中，messages_ready 表示已发送但尚未被接收的消息数量， messages_unacknowledged 表示已被 worker 接收但尚未确认的消息数量（表示该任务正在执行或已被保留）。 messages 则是上述两者之和。

查询当前从某个队列中消费的 worker 数量：

$ rabbitmqctl list_queues name consumers

查询分配给某个队列的内存量：

$ rabbitmqctl list_queues name memory

Tip:: 给 rabbitmqctl(1) 添加 -q 选项可以使输出更易于解析。

英文

Finding the number of tasks in a queue:

$ rabbitmqctl list_queues name messages messages_ready \
                        messages_unacknowledged

Here messages_ready is the number of messages ready for delivery (sent but not received), messages_unacknowledged is the number of messages that's been received by a worker but not acknowledged yet (meaning it is in progress, or has been reserved). messages is the sum of ready and unacknowledged messages.

Finding the number of workers currently consuming from a queue:

$ rabbitmqctl list_queues name consumers

Finding the amount of memory allocated to a queue:

$ rabbitmqctl list_queues name memory

Tip:: Adding the -q option to rabbitmqctl(1) makes the output easier to parse.

Redis¶

Redis

中文

如果你使用 Redis 作为 broker，可以使用 redis-cli(1) 命令来监控 Celery 集群中各队列的长度。

英文

If you're using Redis as the broker, you can monitor the Celery cluster using the redis-cli(1) command to list lengths of queues.

检查队列¶

Inspecting queues

中文

查询队列中的任务数量：

$ redis-cli -h HOST -p PORT -n DATABASE_NUMBER llen QUEUE_NAME

默认的队列名为 celery。要获取所有可用的队列，可以使用：

$ redis-cli -h HOST -p PORT -n DATABASE_NUMBER keys \*

备注

队列的键只有在其中存在任务时才会存在。如果某个键不存在，仅表示该队列中当前没有消息。因为在 Redis 中，如果列表为空会自动被删除，所以该键不会出现在 keys 命令的输出中，对该键使用 llen 命令也会返回 0。

另外，如果你还在使用 Redis 执行其他任务， keys 命令的输出可能会包含数据库中不相关的键值。推荐的做法是为 Celery 使用一个专用的 DATABASE_NUMBER，你也可以用不同的数据库编号来隔离不同的 Celery 应用（相当于虚拟主机），但需要注意，这不会影响像 Flower 这样基于 Redis 的 pub/sub 事件监控，因为 pub/sub 是全局性的，而非按数据库划分的。

英文

Finding the number of tasks in a queue:

$ redis-cli -h HOST -p PORT -n DATABASE_NUMBER llen QUEUE_NAME

The default queue is named celery. To get all available queues, invoke:

$ redis-cli -h HOST -p PORT -n DATABASE_NUMBER keys \*

备注

Queue keys only exists when there are tasks in them, so if a key doesn't exist it simply means there are no messages in that queue. This is because in Redis a list with no elements in it is automatically removed, and hence it won't show up in the keys command output, and llen for that list returns 0.

Also, if you're using Redis for other purposes, the output of the keys command will include unrelated values stored in the database. The recommended way around this is to use a dedicated DATABASE_NUMBER for Celery, you can also use database numbers to separate Celery applications from each other (virtual hosts), but this won't affect the monitoring events used by for example Flower as Redis pub/sub commands are global rather than database based.

Munin¶

Munin

中文

以下是一些在维护 Celery 集群时可能有用的已知 Munin 插件列表：

rabbitmq-munin：用于 RabbitMQ 的 Munin 插件。

https://github.com/ask/rabbitmq-munin
celery_tasks：监控每种任务类型的执行次数（需要 celerymon 支持）。

https://github.com/munin-monitoring/contrib/blob/master/plugins/celery/celery_tasks
celery_tasks_states：监控各个状态下的任务数量（需要 celerymon 支持）。

https://github.com/munin-monitoring/contrib/blob/master/plugins/celery/celery_tasks_states

英文

This is a list of known Munin plug-ins that can be useful when maintaining a Celery cluster.

rabbitmq-munin: Munin plug-ins for RabbitMQ.

https://github.com/ask/rabbitmq-munin
celery_tasks: Monitors the number of times each task type has been executed (requires celerymon).

https://github.com/munin-monitoring/contrib/blob/master/plugins/celery/celery_tasks
celery_tasks_states: Monitors the number of tasks in each state (requires celerymon).

https://github.com/munin-monitoring/contrib/blob/master/plugins/celery/celery_tasks_states

事件¶

Events

中文

Worker 有能力在某些事件发生时发送消息。这些事件可以被如 Flower 和 celery events 等工具捕获，用于监控集群状态。

英文

The worker has the ability to send a message whenever some event happens. These events are then captured by tools like Flower, and celery events to monitor the cluster.

快照¶

Snapshots

中文

Added in version 2.1.

即使是单个 worker 也可能产生大量事件，因此将所有事件的历史记录存储到磁盘上可能非常昂贵。

一系列事件描述了该时间段内的集群状态，通过定期对这一状态进行快照，可以保留整个历史记录，但只需周期性地写入磁盘。

若要拍摄快照，你需要定义一个 Camera 类，它决定了每次捕获状态时应执行的操作；你可以选择将其写入数据库、发送邮件，或者采取其他任意处理方式。

celery events 可用于结合 camera 来拍摄快照，例如，若你希望每隔 2 秒捕获一次状态，且使用 camera 类 myapp.Camera，你可以使用如下命令运行 celery events：

$ celery -A proj events -c myapp.Camera --frequency=2.0

英文

Added in version 2.1.

Even a single worker can produce a huge amount of events, so storing the history of all events on disk may be very expensive.

A sequence of events describes the cluster state in that time period, by taking periodic snapshots of this state you can keep all history, but still only periodically write it to disk.

To take snapshots you need a Camera class, with this you can define what should happen every time the state is captured; You can write it to a database, send it by email or something else entirely.

celery events is then used to take snapshots with the camera, for example if you want to capture state every 2 seconds using the camera myapp.Camera you run celery events with the following arguments:

$ celery -A proj events -c myapp.Camera --frequency=2.0

自定义摄像头¶

Custom Camera

中文

如果你需要周期性地捕获事件并进行处理，Camera 会很有用。若需实时处理事件，应该直接使用 app.events.Receiver，详见实时处理。

以下是一个示例 camera，它将快照信息打印到屏幕上：

from pprint import pformat

from celery.events.snapshot import Polaroid

class DumpCam(Polaroid):
    clear_after = True  # flush 后清除状态（包括 state.event_count）

    def on_shutter(self, state):
        if not state.event_count:
            # 自上次快照后无新事件
            return
        print('Workers: {0}'.format(pformat(state.workers, indent=4)))
        print('Tasks: {0}'.format(pformat(state.tasks, indent=4)))
        print('Total: {0.event_count} events, {0.task_count} tasks'.format(
            state))

详见 celery.events.state 的 API 参考，了解更多关于 state 对象的信息。

你可以通过在运行 celery events 时使用 -c 选项来指定使用该 camera：

$ celery -A proj events -c myapp.DumpCam --frequency=2.0

你也可以以编程方式使用该 camera，如下所示：

from celery import Celery
from myapp import DumpCam

def main(app, freq=1.0):
    state = app.events.State()
    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={'*': state.event})
        with DumpCam(state, freq=freq):
            recv.capture(limit=None, timeout=None)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    main(app)

英文

Cameras can be useful if you need to capture events and do something with those events at an interval. For real-time event processing you should use app.events.Receiver directly, like in 实时处理.

Here is an example camera, dumping the snapshot to screen:

from pprint import pformat

from celery.events.snapshot import Polaroid

class DumpCam(Polaroid):
    clear_after = True  # clear after flush (incl, state.event_count).

    def on_shutter(self, state):
        if not state.event_count:
            # No new events since last snapshot.
            return
        print('Workers: {0}'.format(pformat(state.workers, indent=4)))
        print('Tasks: {0}'.format(pformat(state.tasks, indent=4)))
        print('Total: {0.event_count} events, {0.task_count} tasks'.format(
            state))

See the API reference for celery.events.state to read more about state objects.

Now you can use this cam with celery events by specifying it with the -c option:

$ celery -A proj events -c myapp.DumpCam --frequency=2.0

Or you can use it programmatically like this:

from celery import Celery
from myapp import DumpCam

def main(app, freq=1.0):
    state = app.events.State()
    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={'*': state.event})
        with DumpCam(state, freq=freq):
            recv.capture(limit=None, timeout=None)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    main(app)

实时处理¶

Real-time processing

中文

若要实时处理事件，需要以下组件：

一个事件消费者（即 Receiver）
一组事件处理器，会在事件到达时被调用。

你可以为每种事件类型定义不同的处理器，也可以使用通配处理器（如 '*'）
状态（可选）

app.events.State 是一个内存中的便捷结构，用于表示集群中的任务和 worker，会在事件到达时自动更新。

它封装了解决许多常见需求的方法，例如判断某个 worker 是否存活（通过检查心跳）、将事件字段合并、确保时间戳同步等等。

结合这些组件，你就可以轻松实现事件的实时处理：

from celery import Celery


def my_monitor(app):
    state = app.events.State()

    def announce_failed_tasks(event):
        state.event(event)
        # 任务名仅在 -received 事件中传送，state 会帮我们记录
        task = state.tasks.get(event['uuid'])

        print('TASK FAILED: %s[%s] %s' % (
            task.name, task.uuid, task.info(),))

    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={
                'task-failed': announce_failed_tasks,
                '*': state.event,
        })
        recv.capture(limit=None, timeout=None, wakeup=True)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    my_monitor(app)

备注

向 capture 传入 wakeup 参数会向所有 worker 发送一个信号，强制其发送心跳。这样一来，当监控器启动时就能立刻看到所有 worker。

你也可以通过只注册感兴趣的处理器来监听特定事件：

from celery import Celery

def my_monitor(app):
    state = app.events.State()

    def announce_failed_tasks(event):
        state.event(event)
        # 任务名仅在 -received 事件中传送，state 会帮我们记录
        task = state.tasks.get(event['uuid'])

        print('TASK FAILED: %s[%s] %s' % (
            task.name, task.uuid, task.info(),))

    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={
                'task-failed': announce_failed_tasks,
        })
        recv.capture(limit=None, timeout=None, wakeup=True)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    my_monitor(app)

英文

To process events in real-time you need the following

An event consumer (this is the Receiver)
A set of handlers called when events come in.

You can have different handlers for each event type, or a catch-all handler can be used ('*')
State (optional)

app.events.State is a convenient in-memory representation of tasks and workers in the cluster that's updated as events come in.

It encapsulates solutions for many common things, like checking if a worker is still alive (by verifying heartbeats), merging event fields together as events come in, making sure time-stamps are in sync, and so on.

Combining these you can easily process events in real-time:

from celery import Celery


def my_monitor(app):
    state = app.events.State()

    def announce_failed_tasks(event):
        state.event(event)
        # task name is sent only with -received event, and state
        # will keep track of this for us.
        task = state.tasks.get(event['uuid'])

        print('TASK FAILED: %s[%s] %s' % (
            task.name, task.uuid, task.info(),))

    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={
                'task-failed': announce_failed_tasks,
                '*': state.event,
        })
        recv.capture(limit=None, timeout=None, wakeup=True)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    my_monitor(app)

备注

The wakeup argument to capture sends a signal to all workers to force them to send a heartbeat. This way you can immediately see workers when the monitor starts.

You can listen to specific events by specifying the handlers:

from celery import Celery

def my_monitor(app):
    state = app.events.State()

    def announce_failed_tasks(event):
        state.event(event)
        # task name is sent only with -received event, and state
        # will keep track of this for us.
        task = state.tasks.get(event['uuid'])

        print('TASK FAILED: %s[%s] %s' % (
            task.name, task.uuid, task.info(),))

    with app.connection() as connection:
        recv = app.events.Receiver(connection, handlers={
                'task-failed': announce_failed_tasks,
        })
        recv.capture(limit=None, timeout=None, wakeup=True)

if __name__ == '__main__':
    app = Celery(broker='amqp://guest@localhost//')
    my_monitor(app)

监控和管理指南¶

简介¶

Workers¶

命令行管理实用程序（`inspect`/`control`）¶

命令¶

指定目标节点¶

Flower：实时 Celery web-monitor¶

功能¶

用法¶

celery 事件：Curses 监控¶

RabbitMQ¶

检查队列¶

Redis¶

检查队列¶

Munin¶

事件¶

快照¶

自定义摄像头¶

实时处理¶

事件参考¶

任务事件¶

任务已发送¶

任务已接收¶

任务已启动¶

任务已成功¶

任务已失败¶

任务已拒绝¶

任务已撤销¶

任务已重试¶

Worker 事件¶

worker-online¶

worker-heartbeat¶

worker-offline¶

监控和管理指南¶

简介¶

Workers¶

命令行管理实用程序（inspect/control）¶

命令¶

指定目标节点¶

Flower：实时 Celery web-monitor¶

功能¶

用法¶

celery 事件：Curses 监控¶

RabbitMQ¶

检查队列¶

Redis¶

检查队列¶

Munin¶

事件¶

快照¶

自定义摄像头¶

实时处理¶

事件参考¶

任务事件¶

任务已发送¶

任务已接收¶

任务已启动¶

任务已成功¶

任务已失败¶

任务已拒绝¶

任务已撤销¶

任务已重试¶

Worker 事件¶

worker-online¶

worker-heartbeat¶

worker-offline¶

命令行管理实用程序（`inspect`/`control`）¶