异步文件系统 I/O

Asynchronous filesystem I/O

Trio 提供了内置功能, 用于执行异步文件系统操作, 例如读取或重命名文件。通常, 我们建议您使用这些功能, 而不是 Python 的正常同步文件 API。但这里的权衡有些微妙:有时人们切换到异步 I/O, 然后发现它并没有加速程序, 反而感到惊讶和困惑。下一节将解释异步文件 I/O 背后的理论, 帮助您更好地理解代码的行为。或者, 如果您只是想开始使用, 可以 跳到 API 概览

Trio provides built-in facilities for performing asynchronous filesystem operations like reading or renaming a file. Generally, we recommend that you use these instead of Python's normal synchronous file APIs. But the tradeoffs here are somewhat subtle: sometimes people switch to async I/O, and then they're surprised and confused when they find it doesn't speed up their program. The next section explains the theory behind async file I/O, to help you better understand your code's behavior. Or, if you just want to get started, you can jump down to the API overview.

背景:异步文件 I/O 为何有用?答案可能会让您大吃一惊

Background: Why is async file I/O useful? The answer may surprise you

许多人认为从同步文件 I/O 切换到异步文件 I/O 总是能加速程序的运行。事实并非如此!如果我们仅仅看总体吞吐量, 那么异步文件 I/O 可能更快, 也可能更慢, 或者差不多一样, 这取决于诸如磁盘访问模式或内存大小等复杂因素。异步文件 I/O 的主要动机并不是提高吞吐量, 而是 减少延迟波动的频率

要理解这一点, 您需要知道两件事。

首先, 目前没有主流操作系统提供通用的, 可靠的本地 API 来进行异步文件或文件系统操作, 因此我们必须通过使用线程 (具体来说, 使用 trio.to_thread.run_sync() ) 来模拟实现。这虽然便宜, 但并不是免费的:在典型的 PC 上, 调度到一个工作线程会为每次操作增加大约 ~100 微秒的开销。 (“µs”表示“微秒”, 一秒钟有 1,000,000 微秒。请注意, 这里所有的数字都是粗略的数量级, 用来给您一个规模的概念;如果您需要您环境中的精确数字, 请进行测量! )

其次, 磁盘操作的成本是极其双峰的。有时, 您需要的数据已经缓存到内存中, 这时访问数据非常, 非常快——对一个缓存的文件调用 io.FileIOread 方法只需大约 ~1 微秒。但当数据没有缓存时, 访问它要慢得多:SSD 的平均延迟大约是 ~100 微秒, 旋转磁盘则大约是 ~10,000 微秒。如果您查看尾部延迟, 您会发现对于这两种存储类型, 偶尔有些操作的延迟会比平均值慢 10 倍甚至 100 倍。而且这是假设您的程序是唯一在使用那个磁盘的——如果您使用的是超卖的云虚拟机, 与其他租户争抢 I/O, 那谁知道会发生什么。有些操作可能需要多次磁盘访问。

将这些因素结合起来:如果您的数据已经在内存中, 那么显然使用线程是一个糟糕的主意——如果您在一个 1 微秒的操作上加上 100 微秒的开销, 那就意味着速度变慢了 100 倍!另一方面, 如果您的数据在旋转磁盘上, 那么使用线程是 非常好 的——我们不再阻塞主线程和所有任务 10,000 微秒, 而是仅仅阻塞它们 100 微秒, 并可以利用剩余的时间运行其他任务完成有用的工作, 这样就能有效地提升 100 倍的速度。

但问题在于:对于任何单独的 I/O 操作, 我们无法预先知道它是会是快操作还是慢操作, 因此无法选择性地使用它们。当您切换到异步文件 I/O 时, 它会使所有快操作变慢, 所有慢操作变快。这算是一个优势吗?从整体速度上讲, 很难说:这取决于您使用的磁盘类型, 以及您的内核磁盘缓存命中率, 这又取决于您的文件访问模式, 可用内存, 服务负载……各种因素。如果这个问题对您很重要, 那么没有什么能替代在您实际部署环境中测量代码实际行为的方式。但我们 可以 说的是, 异步磁盘 I/O 可以让性能在更广泛的运行时条件下变得更加可预测。

**如果您不确定该做什么, 我们建议默认使用异步磁盘 I/O, ** 因为它能让您的代码在条件不佳时更为稳健, 特别是在处理尾部延迟时;这提高了用户看到的结果与您在测试中看到的结果一致的可能性。阻塞主线程会导致 所有 任务在那段时间内都无法运行。10,000 微秒是 10 毫秒, 而只需要几次 10 毫秒的延迟就能导致 实际损失;异步磁盘 I/O 可以帮助防止这些情况发生。只要不要期望它是魔法, 并且意识到其中的权衡。

Many people expect that switching from synchronous file I/O to async file I/O will always make their program faster. This is not true! If we just look at total throughput, then async file I/O might be faster, slower, or about the same, and it depends in a complicated way on things like your exact patterns of disk access, or how much RAM you have. The main motivation for async file I/O is not to improve throughput, but to reduce the frequency of latency glitches.

To understand why, you need to know two things.

First, right now no mainstream operating system offers a generic, reliable, native API for async file or filesystem operations, so we have to fake it by using threads (specifically, trio.to_thread.run_sync()). This is cheap but isn't free: on a typical PC, dispatching to a worker thread adds something like ~100 µs of overhead to each operation. ("µs" is pronounced "microseconds", and there are 1,000,000 µs in a second. Note that all the numbers here are going to be rough orders of magnitude to give you a sense of scale; if you need precise numbers for your environment, measure!)

And second, the cost of a disk operation is incredibly bimodal. Sometimes, the data you need is already cached in RAM, and then accessing it is very, very fast – calling io.FileIO's read method on a cached file takes on the order of ~1 µs. But when the data isn't cached, then accessing it is much, much slower: the average is ~100 µs for SSDs and ~10,000 µs for spinning disks, and if you look at tail latencies then for both types of storage you'll see cases where occasionally some operation will be 10x or 100x slower than average. And that's assuming your program is the only thing trying to use that disk – if you're on some oversold cloud VM fighting for I/O with other tenants then who knows what will happen. And some operations can require multiple disk accesses.

Putting these together: if your data is in RAM then it should be clear that using a thread is a terrible idea – if you add 100 µs of overhead to a 1 µs operation, then that's a 100x slowdown! On the other hand, if your data's on a spinning disk, then using a thread is great – instead of blocking the main thread and all tasks for 10,000 µs, we only block them for 100 µs and can spend the rest of that time running other tasks to get useful work done, which can effectively be a 100x speedup.

But here's the problem: for any individual I/O operation, there's no way to know in advance whether it's going to be one of the fast ones or one of the slow ones, so you can't pick and choose. When you switch to async file I/O, it makes all the fast operations slower, and all the slow operations faster. Is that a win? In terms of overall speed, it's hard to say: it depends what kind of disks you're using and your kernel's disk cache hit rate, which in turn depends on your file access patterns, how much spare RAM you have, the load on your service, ... all kinds of things. If the answer is important to you, then there's no substitute for measuring your code's actual behavior in your actual deployment environment. But what we can say is that async disk I/O makes performance much more predictable across a wider range of runtime conditions.

If you're not sure what to do, then we recommend that you use async disk I/O by default, because it makes your code more robust when conditions are bad, especially with regards to tail latencies; this improves the chances that what your users see matches what you saw in testing. Blocking the main thread stops all tasks from running for that time. 10,000 µs is 10 ms, and it doesn't take many 10 ms glitches to start adding up to real money; async disk I/O can help prevent those. Just don't expect it to be magic, and be aware of the tradeoffs.

API 概述

API overview

如果您想执行一般的文件系统操作, 例如创建和列出目录, 重命名文件或检查文件元数据, 或者如果您只是想以一种友好的方式处理文件系统路径, 那么您需要使用 trio.Path。它是标准库 pathlib.Path 的异步替代品, 并提供相同的全面操作集。

对于文件和类文件对象的读取和写入, Trio 还提供了一种机制, 可以将任何同步的类文件对象包装为异步接口。如果您拥有一个 trio.Path 对象, 可以通过调用其 open() 方法来获取一个异步文件对象;或者, 如果您知道文件名, 也可以直接使用 trio.open_file() 打开文件。或者, 如果您已经有一个打开的类文件对象, 可以通过 trio.wrap_file() 将其包装起来——在编写测试时, 特别有用的一种情况是将 io.BytesIOio.StringIO 包装起来。

If you want to perform general filesystem operations like creating and listing directories, renaming files, or checking file metadata – or if you just want a friendly way to work with filesystem paths – then you want trio.Path. It's an asyncified replacement for the standard library's pathlib.Path, and provides the same comprehensive set of operations.

For reading and writing to files and file-like objects, Trio also provides a mechanism for wrapping any synchronous file-like object into an asynchronous interface. If you have a trio.Path object you can get one of these by calling its open() method; or if you know the file's name you can open it directly with trio.open_file(). Alternatively, if you already have an open file-like object, you can wrap it with trio.wrap_file() – one case where this is especially useful is to wrap io.BytesIO or io.StringIO when writing tests.

异步路径对象

Asynchronous path objects

class trio.Path(*args: str | os.PathLike[str])

基类:PurePath

An async pathlib.Path that executes blocking methods in trio.to_thread.run_sync().

Instantiating Path returns a concrete platform-specific subclass, one of PosixPath or WindowsPath.

await absolute()

Like absolute(), but async.

Return an absolute version of this path by prepending the current working directory. No normalization or symlink resolution is performed.

Use resolve() to get the canonical path to a file.

property anchor

The concatenation of the drive and root, or ''.

as_posix()

Return the string representation of the path with forward (/) slashes.

as_uri()

Return the path as a 'file' URI.

await chmod(mode, *, follow_symlinks=True)

Like chmod(), but async.

Change the permissions of the path, like os.chmod().

classmethod await cwd()

Like cwd(), but async.

Return a new path pointing to the current working directory.

property drive

The drive prefix (letter or UNC path), if any.

await exists(*, follow_symlinks=True)

Like exists(), but async.

Whether this path exists.

This method normally follows symlinks; to check whether a symlink exists, add the argument follow_symlinks=False.

await expanduser()

Like expanduser(), but async.

Return a new path with expanded ~ and ~user constructs (as returned by os.path.expanduser)

await glob(pattern, *, case_sensitive=None)

Like glob(), but async.

Iterate over this subtree and yield all existing files (of any kind, including directories) matching the given relative pattern.

This is an async method that returns a synchronous iterator, so you use it like:

for subpath in await mypath.glob():
    ...

备注

The iterator is loaded into memory immediately during the initial call (see issue #501 for discussion).

await group()

Like group(), but async.

Return the group name of the file gid.

Like hardlink_to(), but async.

Make this path a hard link pointing to the same file as target.

Note the order of arguments (self, target) is the reverse of os.link's.

classmethod await home()

Like home(), but async.

Return a new path pointing to the user's home directory (as returned by os.path.expanduser('~')).

is_absolute()

True if the path is absolute (has both a root and, if applicable, a drive).

await is_block_device()

Like is_block_device(), but async.

Whether this path is a block device.

await is_char_device()

Like is_char_device(), but async.

Whether this path is a character device.

await is_dir()

Like is_dir(), but async.

Whether this path is a directory.

await is_fifo()

Like is_fifo(), but async.

Whether this path is a FIFO.

await is_file()

Like is_file(), but async.

Whether this path is a regular file (also True for symlinks pointing to regular files).

await is_junction()

Like is_junction(), but async.

Whether this path is a junction.

await is_mount()

Like is_mount(), but async.

Check if this path is a mount point

is_relative_to(other, /, *_deprecated)

Return True if the path is relative to another path or False.

is_reserved()

Return True if the path contains one of the special names reserved by the system, if any.

await is_socket()

Like is_socket(), but async.

Whether this path is a socket.

Like is_symlink(), but async.

Whether this path is a symbolic link.

await iterdir()

Like iterdir(), but async.

Yield path objects of the directory contents.

The children are yielded in arbitrary order, and the special entries '.' and '..' are not included.

This is an async method that returns a synchronous iterator, so you use it like:

for subpath in await mypath.iterdir():
    ...

备注

The iterator is loaded into memory immediately during the initial call (see issue #501 for discussion).

joinpath(*pathsegments)

Combine this path with one or several arguments, and return a new path representing either a subpath (if all arguments are relative paths) or a totally different path (if one of the arguments is anchored).

await lchmod(mode)

Like lchmod(), but async.

Like chmod(), except if the path points to a symlink, the symlink's permissions are changed, rather than its target's.

await lstat()

Like lstat(), but async.

Like stat(), except if the path points to a symlink, the symlink's status information is returned, rather than its target's.

match(path_pattern, *, case_sensitive=None)

Return True if this path matches the given pattern.

await mkdir(mode=511, parents=False, exist_ok=False)

Like mkdir(), but async.

Create a new directory at this given path.

property name

The final path component, if any.

await open(mode='r', buffering=-1, encoding=None, errors=None, newline=None)

Like open(), but async.

Open the file pointed to by this path and return a file object, as the built-in open() function does.

await owner()

Like owner(), but async.

Return the login name of the file owner.

property parent

The logical parent of the path.

property parents

A sequence of this path's logical parents.

property parts

An object providing sequence-like access to the components in the filesystem path.

await read_bytes()

Like read_bytes(), but async.

Open the file in bytes mode, read it, and close the file.

await read_text(encoding=None, errors=None)

Like read_text(), but async.

Open the file in text mode, read it, and close the file.

Like readlink(), but async.

Return the path to which the symbolic link points.

relative_to(other, /, *_deprecated, walk_up=False)

Return the relative path to another path identified by the passed arguments. If the operation is not possible (because this is not related to the other path), raise ValueError.

The walk_up parameter controls whether .. may be used to resolve the path.

await rename(target)

Like rename(), but async.

Rename this path to the target path.

The target path may be absolute or relative. Relative paths are interpreted relative to the current working directory, not the directory of the Path object.

Returns the new Path instance pointing to the target path.

await replace(target)

Like replace(), but async.

Rename this path to the target path, overwriting if that path exists.

The target path may be absolute or relative. Relative paths are interpreted relative to the current working directory, not the directory of the Path object.

Returns the new Path instance pointing to the target path.

await resolve(strict=False)

Like resolve(), but async.

Make the path absolute, resolving all symlinks on the way and also normalizing it.

await rglob(pattern, *, case_sensitive=None)

Like rglob(), but async.

Recursively yield all existing files (of any kind, including directories) matching the given relative pattern, anywhere in this subtree.

This is an async method that returns a synchronous iterator, so you use it like:

for subpath in await mypath.rglob():
    ...

备注

The iterator is loaded into memory immediately during the initial call (see issue #501 for discussion).

await rmdir()

Like rmdir(), but async.

Remove this directory. The directory must be empty.

property root

The root of the path, if any.

await samefile(other_path)

Like samefile(), but async.

Return whether other_path is the same or not as this file (as returned by os.path.samefile()).

await stat(*, follow_symlinks=True)

Like stat(), but async.

Return the result of the stat() system call on this path, like os.stat() does.

property stem

The final path component, minus its last suffix.

property suffix

The final component's last suffix, if any.

This includes the leading period. For example: '.txt'

property suffixes

A list of the final component's suffixes, if any.

These include the leading periods. For example: ['.tar', '.gz']

Like symlink_to(), but async.

Make this path a symlink pointing to the target path. Note the order of arguments (link, target) is the reverse of os.symlink.

await touch(mode=438, exist_ok=True)

Like touch(), but async.

Create this file with the given access mode, if it doesn't exist.

Like unlink(), but async.

Remove this file or link. If the path is a directory, use rmdir() instead.

with_name(name)

Return a new path with the file name changed.

with_segments(*pathsegments)

Construct a new path object from any number of path-like objects. Subclasses may override this method to customize how new path objects are created from methods like iterdir().

with_stem(stem)

Return a new path with the stem changed.

with_suffix(suffix)

Return a new path with the file suffix changed. If the path has no suffix, add given suffix. If the given suffix is an empty string, remove the suffix from the path.

await write_bytes(data)

Like write_bytes(), but async.

Open the file in bytes mode, write to it, and close the file.

await write_text(data, encoding=None, errors=None, newline=None)

Like write_text(), but async.

Open the file in text mode, write to it, and close the file.

class trio.PosixPath(*args: str | os.PathLike[str])

基类:Path, PurePosixPath

An async pathlib.PosixPath that executes blocking methods in trio.to_thread.run_sync().

class trio.WindowsPath(*args: str | os.PathLike[str])

基类:Path, PureWindowsPath

An async pathlib.WindowsPath that executes blocking methods in trio.to_thread.run_sync().

异步文件对象

Asynchronous file objects

await trio.open_file(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Asynchronous version of open().

返回类型:

AsyncIOWrapper[object]

返回:

An asynchronous file object

Example:

async with await trio.open_file(filename) as f:
    async for line in f:
        pass

assert f.closed
trio.wrap_file(file)

This wraps any file object in a wrapper that provides an asynchronous file object interface.

参数:

file (TypeVar(FileT)) -- a file object

返回类型:

AsyncIOWrapper[TypeVar(FileT)]

返回:

An asynchronous file object that wraps file

Example:

async_file = trio.wrap_file(StringIO('asdf'))

assert await async_file.read() == 'asdf'
Asynchronous file interface(异步文件接口)

Trio 的异步文件对象具有一个接口, 该接口会自动适应被包装的对象。直观地说, 您可以将它们像常规的 文件对象 一样使用, 只是在执行 I/O 的任何方法前添加 await。不过, Python 中的 文件对象 定义有点模糊, 因此这里列出了具体细节:

  • 同步属性/方法:如果存在以下任何属性或方法, 它们将保持不变并重新导出:closed, encoding, errors, fileno, isatty, newlines, readable, seekable, writable, buffer, raw, line_buffering, closefd, name, mode, getvalue, getbuffer

  • 异步方法:如果存在以下任何方法, 它们将作为异步方法重新导出:flush, read, read1, readall, readinto, readline, readlines, seek, tell, truncate, write, writelines, readinto1, peek, detach

特别说明:

  • 异步文件对象实现了 Trio 的 AsyncResource 接口:您通过调用 aclose() 来关闭它们, 而不是使用 close (!!) , 并且它们可以作为异步上下文管理器使用。像所有的 aclose() 方法一样, 异步文件对象上的 aclose 方法在返回之前保证关闭文件, 即使它被取消或以其他方式引发错误。

  • 从多个任务同时使用同一个异步文件对象:由于异步文件对象上的异步方法是通过线程实现的, 只有当底层同步文件对象是线程安全时, 才可以从不同任务中同时安全地调用两个方法。您应该查阅您正在包装的对象的文档。对于 trio.open_file()trio.Path.open() 返回的对象, 这取决于您是以二进制模式还是文本模式打开文件:二进制模式文件是任务安全/线程安全的, 文本模式文件则不是

  • 异步文件对象可以作为异步迭代器, 用于迭代文件的每一行:

async with await trio.open_file(...) as f:
      async for line in f:
         print(line)
  • detach 方法 (如果存在 ) 返回一个异步文件对象。

这应该包括 io 中的类所暴露的所有属性。如果您正在包装的对象有其他未在上述列表中的属性, 您可以通过 .wrapped 属性访问它们:

wrapped

底层同步文件对象。

await trio.open_file(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Asynchronous version of open().

返回类型:

AsyncIOWrapper[object]

返回:

An asynchronous file object

Example:

async with await trio.open_file(filename) as f:
    async for line in f:
        pass

assert f.closed
trio.wrap_file(file)

This wraps any file object in a wrapper that provides an asynchronous file object interface.

参数:

file (TypeVar(FileT)) -- a file object

返回类型:

AsyncIOWrapper[TypeVar(FileT)]

返回:

An asynchronous file object that wraps file

Example:

async_file = trio.wrap_file(StringIO('asdf'))

assert await async_file.read() == 'asdf'
Asynchronous file interface

Trio's asynchronous file objects have an interface that automatically adapts to the object being wrapped. Intuitively, you can mostly treat them like a regular file object, except adding an await in front of any of methods that do I/O. The definition of file object is a little vague in Python though, so here are the details:

  • Synchronous attributes/methods: if any of the following attributes or methods are present, then they're re-exported unchanged: closed, encoding, errors, fileno, isatty, newlines, readable, seekable, writable, buffer, raw, line_buffering, closefd, name, mode, getvalue, getbuffer.

  • Async methods: if any of the following methods are present, then they're re-exported as an async method: flush, read, read1, readall, readinto, readline, readlines, seek, tell, truncate, write, writelines, readinto1, peek, detach.

Special notes:

  • Async file objects implement Trio's AsyncResource interface: you close them by calling aclose() instead of close (!!), and they can be used as async context managers. Like all aclose() methods, the aclose method on async file objects is guaranteed to close the file before returning, even if it is cancelled or otherwise raises an error.

  • Using the same async file object from multiple tasks simultaneously: because the async methods on async file objects are implemented using threads, it's only safe to call two of them at the same time from different tasks IF the underlying synchronous file object is thread-safe. You should consult the documentation for the object you're wrapping. For objects returned from trio.open_file() or trio.Path.open(), it depends on whether you open the file in binary mode or text mode: binary mode files are task-safe/thread-safe, text mode files are not.

  • Async file objects can be used as async iterators to iterate over the lines of the file:

async with await trio.open_file(...) as f:
      async for line in f:
         print(line)
  • The detach method, if present, returns an async file object.

This should include all the attributes exposed by classes in io. But if you're wrapping an object that has other attributes that aren't on the list above, then you can access them via the .wrapped attribute:

wrapped

The underlying synchronous file object.