文本相关对象

Text-related objects

Paragraph 对象

Paragraph objects

class docx.text.paragraph.Paragraph[源代码]

Proxy object wrapping a <w:p> element.

add_run(text: str | None = None, style: str | CharacterStyle | None = None) Run[源代码]

Append run containing text and having character-style style.

text can contain tab (\t) characters, which are converted to the appropriate XML form for a tab. text can also include newline (\n) or carriage return (\r) characters, each of which is converted to a line break. When text is None, the new run is empty.

property alignment: WD_PARAGRAPH_ALIGNMENT | None

A member of the WD_PARAGRAPH_ALIGNMENT enumeration specifying the justification setting for this paragraph.

A value of None indicates the paragraph has no directly-applied alignment value and will inherit its alignment value from its style hierarchy. Assigning None to this property removes any directly-applied alignment value.

clear()[源代码]

Return this same paragraph after removing all its content.

Paragraph-level formatting, such as style, is preserved.

property contains_page_break: bool

True when one or more rendered page-breaks occur in this paragraph.

A Hyperlink instance for each hyperlink in this paragraph.

insert_paragraph_before(text: str | None = None, style: str | ParagraphStyle | None = None) Paragraph[源代码]

Return a newly created paragraph, inserted directly before this paragraph.

If text is supplied, the new paragraph contains that text in a single run. If style is provided, that style is assigned to the new paragraph.

iter_inner_content() Iterator[Run | Hyperlink][源代码]

Generate the runs and hyperlinks in this paragraph, in the order they appear.

The content in a paragraph consists of both runs and hyperlinks. This method allows accessing each of those separately, in document order, for when the precise position of the hyperlink within the paragraph text is important. Note that a hyperlink itself contains runs.

property paragraph_format

The ParagraphFormat object providing access to the formatting properties for this paragraph, such as line spacing and indentation.

property rendered_page_breaks: List[RenderedPageBreak]

All rendered page-breaks in this paragraph.

Most often an empty list, sometimes contains one page-break, but can contain more than one is rare or contrived cases.

property runs: List[Run]

Sequence of Run instances corresponding to the <w:r> elements in this paragraph.

property style: ParagraphStyle | None

Read/Write.

ParagraphStyle object representing the style assigned to this paragraph. If no explicit style is assigned to this paragraph, its value is the default paragraph style for the document. A paragraph style name can be assigned in lieu of a paragraph style object. Assigning None removes any applied style, making its effective value the default paragraph style for the document.

property text: str

The textual content of this paragraph.

The text includes the visible-text portion of any hyperlinks in the paragraph. Tabs and line breaks in the XML are mapped to \t and \n characters respectively.

Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. A \t character in the text is mapped to a <w:tab/> element and each \n or \r character is mapped to a line break. Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.

ParagraphFormat 对象

ParagraphFormat objects

class docx.text.parfmt.ParagraphFormat[源代码]

Provides access to paragraph formatting such as justification, indentation, line spacing, space before and after, and widow/orphan control.

property alignment

A member of the WD_PARAGRAPH_ALIGNMENT enumeration specifying the justification setting for this paragraph.

A value of None indicates paragraph alignment is inherited from the style hierarchy.

property first_line_indent

Length value specifying the relative difference in indentation for the first line of the paragraph.

A positive value causes the first line to be indented. A negative value produces a hanging indent. None indicates first line indentation is inherited from the style hierarchy.

property keep_together

True if the paragraph should be kept "in one piece" and not broken across a page boundary when the document is rendered.

None indicates its effective value is inherited from the style hierarchy.

property keep_with_next

True if the paragraph should be kept on the same page as the subsequent paragraph when the document is rendered.

For example, this property could be used to keep a section heading on the same page as its first paragraph. None indicates its effective value is inherited from the style hierarchy.

property left_indent

Length value specifying the space between the left margin and the left side of the paragraph.

None indicates the left indent value is inherited from the style hierarchy. Use an Inches value object as a convenient way to apply indentation in units of inches.

property line_spacing

float or Length value specifying the space between baselines in successive lines of the paragraph.

A value of None indicates line spacing is inherited from the style hierarchy. A float value, e.g. 2.0 or 1.75, indicates spacing is applied in multiples of line heights. A Length value such as Pt(12) indicates spacing is a fixed height. The Pt value class is a convenient way to apply line spacing in units of points. Assigning None resets line spacing to inherit from the style hierarchy.

property line_spacing_rule

A member of the WD_LINE_SPACING enumeration indicating how the value of line_spacing should be interpreted.

Assigning any of the WD_LINE_SPACING members SINGLE, DOUBLE, or ONE_POINT_FIVE will cause the value of line_spacing to be updated to produce the corresponding line spacing.

property page_break_before

True if the paragraph should appear at the top of the page following the prior paragraph.

None indicates its effective value is inherited from the style hierarchy.

property right_indent

Length value specifying the space between the right margin and the right side of the paragraph.

None indicates the right indent value is inherited from the style hierarchy. Use a Cm value object as a convenient way to apply indentation in units of centimeters.

property space_after

Length value specifying the spacing to appear between this paragraph and the subsequent paragraph.

None indicates this value is inherited from the style hierarchy. Length objects provide convenience properties, such as pt and inches, that allow easy conversion to various length units.

property space_before

Length value specifying the spacing to appear between this paragraph and the prior paragraph.

None indicates this value is inherited from the style hierarchy. Length objects provide convenience properties, such as pt and cm, that allow easy conversion to various length units.

tab_stops[源代码]

TabStops object providing access to the tab stops defined for this paragraph format.

property widow_control

True if the first and last lines in the paragraph remain on the same page as the rest of the paragraph when Word repaginates the document.

None indicates its effective value is inherited from the style hierarchy.

Run 对象

Run objects

class docx.text.run.Run[源代码]

Proxy object wrapping <w:r> element.

Several of the properties on Run take a tri-state value, True, False, or None. True and False correspond to on and off respectively. None indicates the property is not specified directly on the run and its effective value is taken from the style hierarchy.

add_break(break_type: WD_BREAK_TYPE = WD_BREAK_TYPE.LINE)[源代码]

Add a break element of break_type to this run.

break_type can take the values WD_BREAK.LINE, WD_BREAK.PAGE, and WD_BREAK.COLUMN where WD_BREAK is imported from docx.enum.text. break_type defaults to WD_BREAK.LINE.

add_picture(image_path_or_stream: str | IO[bytes], width: int | Length | None = None, height: int | Length | None = None) InlineShape[源代码]

Return InlineShape containing image identified by image_path_or_stream.

The picture is added to the end of this run.

image_path_or_stream can be a path (a string) or a file-like object containing a binary image.

If neither width nor height is specified, the picture appears at its native size. If only one is specified, it is used to compute a scaling factor that is then applied to the unspecified dimension, preserving the aspect ratio of the image. The native size of the picture is calculated using the dots- per-inch (dpi) value specified in the image file, defaulting to 72 dpi if no value is specified, as is often the case.

add_tab() None[源代码]

Add a <w:tab/> element at the end of the run, which Word interprets as a tab character.

add_text(text: str)[源代码]

Returns a newly appended _Text object (corresponding to a new <w:t> child element) to the run, containing text.

Compare with the possibly more friendly approach of assigning text to the Run.text property.

property bold: bool | None

Read/write tri-state value.

When True, causes the text of the run to appear in bold face. When False, the text unconditionally appears non-bold. When None the bold setting for this run is inherited from the style hierarchy.

clear()[源代码]

Return reference to this run after removing all its content.

All run formatting is preserved.

property contains_page_break: bool

True when one or more rendered page-breaks occur in this run.

Note that "hard" page-breaks inserted by the author are not included. A hard page-break gives rise to a rendered page-break in the right position so if those were included that page-break would be "double-counted".

It would be very rare for multiple rendered page-breaks to occur in a single run, but it is possible.

property font: Font

The Font object providing access to the character formatting properties for this run, such as font name and size.

property italic: bool | None

Read/write tri-state value.

When True, causes the text of the run to appear in italics. When False, the text unconditionally appears non-italic. When None the italic setting for this run is inherited from the style hierarchy.

iter_inner_content() Iterator[str | Drawing | RenderedPageBreak][源代码]

Generate the content-items in this run in the order they appear.

NOTE: only content-types currently supported by python-docx are generated. In this version, that is text and rendered page-breaks. Drawing is included but currently only provides access to its XML element (CT_Drawing) on its ._drawing attribute. Drawing attributes and methods may be expanded in future releases.

There are a number of element-types that can appear inside a run, but most of those (w:br, w:cr, w:noBreakHyphen, w:t, w:tab) have a clear plain-text equivalent. Any contiguous range of such elements is generated as a single str. Rendered page-break and drawing elements are generated individually. Any other elements are ignored.

property style: CharacterStyle

Read/write.

A CharacterStyle object representing the character style applied to this run. The default character style for the document (often Default Character Font) is returned if the run has no directly-applied character style. Setting this property to None removes any directly-applied character style.

property text: str

String formed by concatenating the text equivalent of each run.

Each <w:t> element adds the text characters it contains. A <w:tab/> element adds a t character. A <w:cr/> or <w:br> element each add a n character. Note that a <w:br> element can indicate a page break or column break as well as a line break. Only line-break <w:br> elements translate to a n character. Others are ignored. All other content child elements, such as <w:drawing>, are ignored.

Assigning text to this property has the reverse effect, translating each t character to a <w:tab/> element and each n or r character to a <w:cr/> element. Any existing run content is replaced. Run formatting is preserved.

property underline: bool | WD_UNDERLINE | None

The underline style for this Run.

Value is one of None, True, False, or a member of WD_UNDERLINE.

A value of None indicates the run has no directly-applied underline value and so will inherit the underline value of its containing paragraph. Assigning None to this property removes any directly-applied underline value.

A value of False indicates a directly-applied setting of no underline, overriding any inherited value.

A value of True indicates single underline.

The values from WD_UNDERLINE are used to specify other outline styles such as double, wavy, and dotted.

Font 对象

Font objects

class docx.text.run.Font[源代码]

Proxy object for parent of a <w:rPr> element and providing access to character properties such as font name, font size, bold, and subscript.

property all_caps: bool | None

Read/write.

Causes text in this font to appear in capital letters.

property bold: bool | None

Read/write.

Causes text in this font to appear in bold.

property color

A ColorFormat object providing a way to get and set the text color for this font.

property complex_script: bool | None

Read/write tri-state value.

When True, causes the characters in the run to be treated as complex script regardless of their Unicode values.

property cs_bold: bool | None

Read/write tri-state value.

When True, causes the complex script characters in the run to be displayed in bold typeface.

property cs_italic: bool | None

Read/write tri-state value.

When True, causes the complex script characters in the run to be displayed in italic typeface.

property double_strike: bool | None

Read/write tri-state value.

When True, causes the text in the run to appear with double strikethrough.

property emboss: bool | None

Read/write tri-state value.

When True, causes the text in the run to appear as if raised off the page in relief.

property hidden: bool | None

Read/write tri-state value.

When True, causes the text in the run to be hidden from display, unless applications settings force hidden text to be shown.

property highlight_color: WD_COLOR_INDEX | None

Color of highlighing applied or None if not highlighted.

property imprint: bool | None

Read/write tri-state value.

When True, causes the text in the run to appear as if pressed into the page.

property italic: bool | None

Read/write tri-state value.

When True, causes the text of the run to appear in italics. None indicates the effective value is inherited from the style hierarchy.

property math: bool | None

Read/write tri-state value.

When True, specifies this run contains WML that should be handled as though it was Office Open XML Math.

property name: str | None

The typeface name for this Font.

Causes the text it controls to appear in the named font, if a matching font is found. None indicates the typeface is inherited from the style hierarchy.

property no_proof: bool | None

Read/write tri-state value.

When True, specifies that the contents of this run should not report any errors when the document is scanned for spelling and grammar.

property outline: bool | None

Read/write tri-state value.

When True causes the characters in the run to appear as if they have an outline, by drawing a one pixel wide border around the inside and outside borders of each character glyph.

property rtl: bool | None

Read/write tri-state value.

When True causes the text in the run to have right-to-left characteristics.

property shadow: bool | None

Read/write tri-state value.

When True causes the text in the run to appear as if each character has a shadow.

property size: Length | None

Font height in English Metric Units (EMU).

None indicates the font size should be inherited from the style hierarchy. Length is a subclass of int having properties for convenient conversion into points or other length units. The docx.shared.Pt class allows convenient specification of point values:

>>> font.size = Pt(24)
>>> font.size
304800
>>> font.size.pt
24.0
property small_caps: bool | None

Read/write tri-state value.

When True causes the lowercase characters in the run to appear as capital letters two points smaller than the font size specified for the run.

property snap_to_grid: bool | None

Read/write tri-state value.

When True causes the run to use the document grid characters per line settings defined in the docGrid element when laying out the characters in this run.

property spec_vanish: bool | None

Read/write tri-state value.

When True, specifies that the given run shall always behave as if it is hidden, even when hidden text is being displayed in the current document. The property has a very narrow, specialized use related to the table of contents. Consult the spec (§17.3.2.36) for more details.

property strike: bool | None

Read/write tri-state value.

When True causes the text in the run to appear with a single horizontal line through the center of the line.

property subscript: bool | None

Boolean indicating whether the characters in this Font appear as subscript.

None indicates the subscript/subscript value is inherited from the style hierarchy.

property superscript: bool | None

Boolean indicating whether the characters in this Font appear as superscript.

None indicates the subscript/superscript value is inherited from the style hierarchy.

property underline: bool | WD_UNDERLINE | None

The underline style for this Font.

The value is one of None, True, False, or a member of WD_UNDERLINE.

None indicates the font inherits its underline value from the style hierarchy. False indicates no underline. True indicates single underline. The values from WD_UNDERLINE are used to specify other outline styles such as double, wavy, and dotted.

property web_hidden: bool | None

Read/write tri-state value.

When True, specifies that the contents of this run shall be hidden when the document is displayed in web page view.

RenderedPageBreak 对象

RenderedPageBreak objects

class docx.text.pagebreak.RenderedPageBreak[源代码]

A page-break inserted by Word during page-layout for print or display purposes.

This usually does not correspond to a "hard" page-break inserted by the document author, rather just that Word ran out of room on one page and needed to start another. The position of these can change depending on the printer and page-size, as well as margins, etc. They also will change in response to edits, but not until Word loads and saves the document.

Note these are never inserted by python-docx because it has no rendering function. These are generally only useful for text-extraction of existing documents when python-docx is being used solely as a document "reader".

NOTE: a rendered page-break can occur within a hyperlink; consider a multi-word hyperlink like "excellent Wikipedia article on LLMs" that happens to fall close to the end of the last line on a page such that the page breaks between "Wikipedia" and "article". In such a "page-breaks-in-hyperlink" case, THESE METHODS WILL "MOVE" THE PAGE-BREAK to occur after the hyperlink, such that the entire hyperlink appears in the paragraph returned by .preceding_paragraph_fragment. While this places the "tail" text of the hyperlink on the "wrong" page, it avoids having two hyperlinks each with a fragment of the actual text and pointing to the same address.

property following_paragraph_fragment: Paragraph | None

A "loose" paragraph containing the content following this page-break.

HAS POTENTIALLY SURPRISING BEHAVIORS so read carefully to be sure this is what you want. This is primarily targeted toward text-extraction use-cases for which precisely associating text with the page it occurs on is important.

Compare .preceding_paragraph_fragment as these two are intended to be used together.

This value is None when no content follows this page-break. This case is unlikely to occur in practice because Word places even-paragraph-boundary page-breaks on the paragraph following the page-break. Still, it is possible and must be checked for. Returning None for this case avoids "inserting" an extra, non-existent paragraph into the content stream. Note that content can include DrawingML items like images or charts, not just text.

The returned paragraph is divorced from the document body. Any changes made to it will not be reflected in the document. It is intended to provide a container (Paragraph) with familiar properties and methods that can be used to characterize the paragraph content following a mid-paragraph page-break.

Contains no portion of the hyperlink when this break occurs within a hyperlink.

property preceding_paragraph_fragment: Paragraph | None

A "loose" paragraph containing the content preceding this page-break.

Compare .following_paragraph_fragment as these two are intended to be used together.

This value is None when no content precedes this page-break. This case is common and occurs whenever a page breaks on an even paragraph boundary. Returning None for this case avoids "inserting" a non-existent paragraph into the content stream. Note that content can include DrawingML items like images or charts.

Note the returned paragraph is divorced from the document body. Any changes made to it will not be reflected in the document. It is intended to provide a familiar container (Paragraph) to interrogate for the content preceding this page-break in the paragraph in which it occured.

Contains the entire hyperlink when this break occurs within a hyperlink.

TabStop 对象

TabStop objects

class docx.text.tabstops.TabStop[源代码]

An individual tab stop applying to a paragraph or style.

Accessed using list semantics on its containing TabStops object.

property alignment

A member of WD_TAB_ALIGNMENT specifying the alignment setting for this tab stop.

Read/write.

property leader

A member of WD_TAB_LEADER specifying a repeating character used as a "leader", filling in the space spanned by this tab.

Assigning None produces the same result as assigning WD_TAB_LEADER.SPACES. Read/write.

property position

A Length object representing the distance of this tab stop from the inside edge of the paragraph.

May be positive or negative. Read/write.

TabStops 对象

TabStops objects

class docx.text.tabstops.TabStops[源代码]

A sequence of TabStop objects providing access to the tab stops of a paragraph or paragraph style.

Supports iteration, indexed access, del, and len(). It is accesed using the tab_stops property of ParagraphFormat; it is not intended to be constructed directly.

add_tab_stop(position, alignment=WD_TAB_ALIGNMENT.LEFT, leader=WD_TAB_LEADER.SPACES)[源代码]

Add a new tab stop at position, a Length object specifying the location of the tab stop relative to the paragraph edge.

A negative position value is valid and appears in hanging indentation. Tab alignment defaults to left, but may be specified by passing a member of the WD_TAB_ALIGNMENT enumeration as alignment. An optional leader character can be specified by passing a member of the WD_TAB_LEADER enumeration as leader.

clear_all()[源代码]

Remove all custom tab stops.