文本相关对象¶
Text-related objects
Paragraph
对象¶
Paragraph
objects
- class docx.text.paragraph.Paragraph[源代码]¶
Proxy object wrapping a <w:p> element.
- add_run(text: str | None = None, style: str | CharacterStyle | None = None) Run [源代码]¶
Append run containing text and having character-style style.
text can contain tab (
\t
) characters, which are converted to the appropriate XML form for a tab. text can also include newline (\n
) or carriage return (\r
) characters, each of which is converted to a line break. When text is None, the new run is empty.
- property alignment: WD_PARAGRAPH_ALIGNMENT | None¶
A member of the WD_PARAGRAPH_ALIGNMENT enumeration specifying the justification setting for this paragraph.
A value of
None
indicates the paragraph has no directly-applied alignment value and will inherit its alignment value from its style hierarchy. AssigningNone
to this property removes any directly-applied alignment value.
- clear()[源代码]¶
Return this same paragraph after removing all its content.
Paragraph-level formatting, such as style, is preserved.
- property contains_page_break: bool¶
True when one or more rendered page-breaks occur in this paragraph.
- insert_paragraph_before(text: str | None = None, style: str | ParagraphStyle | None = None) Paragraph [源代码]¶
Return a newly created paragraph, inserted directly before this paragraph.
If text is supplied, the new paragraph contains that text in a single run. If style is provided, that style is assigned to the new paragraph.
- iter_inner_content() Iterator[Run | Hyperlink] [源代码]¶
Generate the runs and hyperlinks in this paragraph, in the order they appear.
The content in a paragraph consists of both runs and hyperlinks. This method allows accessing each of those separately, in document order, for when the precise position of the hyperlink within the paragraph text is important. Note that a hyperlink itself contains runs.
- property paragraph_format¶
The
ParagraphFormat
object providing access to the formatting properties for this paragraph, such as line spacing and indentation.
- property rendered_page_breaks: List[RenderedPageBreak]¶
All rendered page-breaks in this paragraph.
Most often an empty list, sometimes contains one page-break, but can contain more than one is rare or contrived cases.
- property runs: List[Run]¶
Sequence of
Run
instances corresponding to the <w:r> elements in this paragraph.
- property style: ParagraphStyle | None¶
Read/Write.
ParagraphStyle
object representing the style assigned to this paragraph. If no explicit style is assigned to this paragraph, its value is the default paragraph style for the document. A paragraph style name can be assigned in lieu of a paragraph style object. AssigningNone
removes any applied style, making its effective value the default paragraph style for the document.
- property text: str¶
The textual content of this paragraph.
The text includes the visible-text portion of any hyperlinks in the paragraph. Tabs and line breaks in the XML are mapped to
\t
and\n
characters respectively.Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. A
\t
character in the text is mapped to a<w:tab/>
element and each\n
or\r
character is mapped to a line break. Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.
ParagraphFormat
对象¶
ParagraphFormat
objects
- class docx.text.parfmt.ParagraphFormat[源代码]¶
Provides access to paragraph formatting such as justification, indentation, line spacing, space before and after, and widow/orphan control.
- property alignment¶
A member of the WD_PARAGRAPH_ALIGNMENT enumeration specifying the justification setting for this paragraph.
A value of
None
indicates paragraph alignment is inherited from the style hierarchy.
- property first_line_indent¶
Length
value specifying the relative difference in indentation for the first line of the paragraph.A positive value causes the first line to be indented. A negative value produces a hanging indent.
None
indicates first line indentation is inherited from the style hierarchy.
- property keep_together¶
True
if the paragraph should be kept "in one piece" and not broken across a page boundary when the document is rendered.None
indicates its effective value is inherited from the style hierarchy.
- property keep_with_next¶
True
if the paragraph should be kept on the same page as the subsequent paragraph when the document is rendered.For example, this property could be used to keep a section heading on the same page as its first paragraph.
None
indicates its effective value is inherited from the style hierarchy.
- property left_indent¶
Length
value specifying the space between the left margin and the left side of the paragraph.None
indicates the left indent value is inherited from the style hierarchy. Use anInches
value object as a convenient way to apply indentation in units of inches.
- property line_spacing¶
float
orLength
value specifying the space between baselines in successive lines of the paragraph.A value of
None
indicates line spacing is inherited from the style hierarchy. A float value, e.g.2.0
or1.75
, indicates spacing is applied in multiples of line heights. ALength
value such asPt(12)
indicates spacing is a fixed height. ThePt
value class is a convenient way to apply line spacing in units of points. AssigningNone
resets line spacing to inherit from the style hierarchy.
- property line_spacing_rule¶
A member of the WD_LINE_SPACING enumeration indicating how the value of
line_spacing
should be interpreted.Assigning any of the WD_LINE_SPACING members
SINGLE
,DOUBLE
, orONE_POINT_FIVE
will cause the value ofline_spacing
to be updated to produce the corresponding line spacing.
- property page_break_before¶
True
if the paragraph should appear at the top of the page following the prior paragraph.None
indicates its effective value is inherited from the style hierarchy.
- property right_indent¶
Length
value specifying the space between the right margin and the right side of the paragraph.None
indicates the right indent value is inherited from the style hierarchy. Use aCm
value object as a convenient way to apply indentation in units of centimeters.
- property space_after¶
Length
value specifying the spacing to appear between this paragraph and the subsequent paragraph.None
indicates this value is inherited from the style hierarchy.Length
objects provide convenience properties, such aspt
andinches
, that allow easy conversion to various length units.
- property space_before¶
Length
value specifying the spacing to appear between this paragraph and the prior paragraph.None
indicates this value is inherited from the style hierarchy.Length
objects provide convenience properties, such aspt
andcm
, that allow easy conversion to various length units.
- tab_stops[源代码]¶
TabStops
object providing access to the tab stops defined for this paragraph format.
- property widow_control¶
True
if the first and last lines in the paragraph remain on the same page as the rest of the paragraph when Word repaginates the document.None
indicates its effective value is inherited from the style hierarchy.
Hyperlink
对象¶
Hyperlink
objects
- class docx.text.hyperlink.Hyperlink[源代码]¶
Proxy object wrapping a <w:hyperlink> element.
A hyperlink occurs as a child of a paragraph, at the same level as a Run. A hyperlink itself contains runs, which is where the visible text of the hyperlink is stored.
- property address: str¶
The "URL" of the hyperlink (but not necessarily a web link).
While commonly a web link like "https://google.com" the hyperlink address can take a variety of forms including "internal links" to bookmarked locations within the document. When this hyperlink is an internal "jump" to for example a heading from the table-of-contents (TOC), the address is blank. The bookmark reference (like "_Toc147925734") is stored in the .fragment property.
- property contains_page_break: bool¶
True when the text of this hyperlink is broken across page boundaries.
This is not uncommon and can happen for example when the hyperlink text is multiple words and occurs in the last line of a page. Theoretically, a hyperlink can contain more than one page break but that would be extremely uncommon in practice. Still, this value should be understood to mean that "one-or-more" rendered page breaks are present.
- property fragment: str¶
Reference like #glossary at end of URL that refers to a sub-resource.
Note that this value does not include the fragment-separator character ("#").
This value is known as a "named anchor" in an HTML context and "anchor" in the MS API, but an "anchor" element (<a>) represents a full hyperlink in HTML so we avoid confusion by using the more precise RFC 3986 naming "URI fragment".
These are also used to refer to bookmarks within the same document, in which case the .address value with be blank ("") and this property will hold a value like "_Toc147925734".
To reliably get an entire web URL you will need to concatenate this with the .address value, separated by "#" when both are present. Consider using the .url property for that purpose.
Word sometimes stores a fragment in this property (an XML attribute) and sometimes with the address, depending on how the URL is inserted, so don't depend on this field being empty to indicate no fragment is present.
- property runs: list[Run]¶
List of
Run
instances in this hyperlink.Together these define the visible text of the hyperlink. The text of a hyperlink is typically contained in a single run will be broken into multiple runs if for example part of the hyperlink is bold or the text was changed after the document was saved.
- property text: str¶
String formed by concatenating the text of each run in the hyperlink.
Tabs and line breaks in the XML are mapped to
\t
and\n
characters respectively. Note that rendered page-breaks can occur within a hyperlink but they are not reflected in this text.
- property url: str¶
Convenience property to get web URLs from hyperlinks that contain them.
This value is the empty string ("") when there is no address portion, so its boolean value can also be used to distinguish external URIs from internal "jump" hyperlinks like those found in a table-of-contents.
Note that this value may also be a link to a file, so if you only want web-urls you'll need to check for a protocol prefix like https://.
When both an address and fragment are present, the return value joins the two separated by the fragment-separator hash ("#"). Otherwise this value is the same as that of the .address property.
Run
对象¶
Run
objects
- class docx.text.run.Run[源代码]¶
Proxy object wrapping <w:r> element.
Several of the properties on Run take a tri-state value,
True
,False
, orNone
.True
andFalse
correspond to on and off respectively.None
indicates the property is not specified directly on the run and its effective value is taken from the style hierarchy.- add_break(break_type: WD_BREAK_TYPE = WD_BREAK_TYPE.LINE)[源代码]¶
Add a break element of break_type to this run.
break_type can take the values WD_BREAK.LINE, WD_BREAK.PAGE, and WD_BREAK.COLUMN where WD_BREAK is imported from docx.enum.text. break_type defaults to WD_BREAK.LINE.
- add_picture(image_path_or_stream: str | IO[bytes], width: int | Length | None = None, height: int | Length | None = None) InlineShape [源代码]¶
Return
InlineShape
containing image identified by image_path_or_stream.The picture is added to the end of this run.
image_path_or_stream can be a path (a string) or a file-like object containing a binary image.
If neither width nor height is specified, the picture appears at its native size. If only one is specified, it is used to compute a scaling factor that is then applied to the unspecified dimension, preserving the aspect ratio of the image. The native size of the picture is calculated using the dots- per-inch (dpi) value specified in the image file, defaulting to 72 dpi if no value is specified, as is often the case.
- add_tab() None [源代码]¶
Add a
<w:tab/>
element at the end of the run, which Word interprets as a tab character.
- add_text(text: str)[源代码]¶
Returns a newly appended
_Text
object (corresponding to a new<w:t>
child element) to the run, containing text.Compare with the possibly more friendly approach of assigning text to the
Run.text
property.
- property bold: bool | None¶
Read/write tri-state value.
When
True
, causes the text of the run to appear in bold face. WhenFalse
, the text unconditionally appears non-bold. WhenNone
the bold setting for this run is inherited from the style hierarchy.
- clear()[源代码]¶
Return reference to this run after removing all its content.
All run formatting is preserved.
- property contains_page_break: bool¶
True when one or more rendered page-breaks occur in this run.
Note that "hard" page-breaks inserted by the author are not included. A hard page-break gives rise to a rendered page-break in the right position so if those were included that page-break would be "double-counted".
It would be very rare for multiple rendered page-breaks to occur in a single run, but it is possible.
- property font: Font¶
The
Font
object providing access to the character formatting properties for this run, such as font name and size.
- property italic: bool | None¶
Read/write tri-state value.
When
True
, causes the text of the run to appear in italics. WhenFalse
, the text unconditionally appears non-italic. WhenNone
the italic setting for this run is inherited from the style hierarchy.
- iter_inner_content() Iterator[str | Drawing | RenderedPageBreak] [源代码]¶
Generate the content-items in this run in the order they appear.
NOTE: only content-types currently supported by python-docx are generated. In this version, that is text and rendered page-breaks. Drawing is included but currently only provides access to its XML element (CT_Drawing) on its ._drawing attribute. Drawing attributes and methods may be expanded in future releases.
There are a number of element-types that can appear inside a run, but most of those (w:br, w:cr, w:noBreakHyphen, w:t, w:tab) have a clear plain-text equivalent. Any contiguous range of such elements is generated as a single str. Rendered page-break and drawing elements are generated individually. Any other elements are ignored.
- property style: CharacterStyle¶
Read/write.
A
CharacterStyle
object representing the character style applied to this run. The default character style for the document (often Default Character Font) is returned if the run has no directly-applied character style. Setting this property toNone
removes any directly-applied character style.
- property text: str¶
String formed by concatenating the text equivalent of each run.
Each <w:t> element adds the text characters it contains. A <w:tab/> element adds a t character. A <w:cr/> or <w:br> element each add a n character. Note that a <w:br> element can indicate a page break or column break as well as a line break. Only line-break <w:br> elements translate to a n character. Others are ignored. All other content child elements, such as <w:drawing>, are ignored.
Assigning text to this property has the reverse effect, translating each t character to a <w:tab/> element and each n or r character to a <w:cr/> element. Any existing run content is replaced. Run formatting is preserved.
- property underline: bool | WD_UNDERLINE | None¶
The underline style for this
Run
.Value is one of
None
,True
,False
, or a member of WD_UNDERLINE.A value of
None
indicates the run has no directly-applied underline value and so will inherit the underline value of its containing paragraph. AssigningNone
to this property removes any directly-applied underline value.A value of
False
indicates a directly-applied setting of no underline, overriding any inherited value.A value of
True
indicates single underline.The values from WD_UNDERLINE are used to specify other outline styles such as double, wavy, and dotted.
Font
对象¶
Font
objects
- class docx.text.run.Font[源代码]¶
Proxy object for parent of a <w:rPr> element and providing access to character properties such as font name, font size, bold, and subscript.
- property color¶
A
ColorFormat
object providing a way to get and set the text color for this font.
- property complex_script: bool | None¶
Read/write tri-state value.
When
True
, causes the characters in the run to be treated as complex script regardless of their Unicode values.
- property cs_bold: bool | None¶
Read/write tri-state value.
When
True
, causes the complex script characters in the run to be displayed in bold typeface.
- property cs_italic: bool | None¶
Read/write tri-state value.
When
True
, causes the complex script characters in the run to be displayed in italic typeface.
- property double_strike: bool | None¶
Read/write tri-state value.
When
True
, causes the text in the run to appear with double strikethrough.
- property emboss: bool | None¶
Read/write tri-state value.
When
True
, causes the text in the run to appear as if raised off the page in relief.
Read/write tri-state value.
When
True
, causes the text in the run to be hidden from display, unless applications settings force hidden text to be shown.
- property highlight_color: WD_COLOR_INDEX | None¶
Color of highlighing applied or
None
if not highlighted.
- property imprint: bool | None¶
Read/write tri-state value.
When
True
, causes the text in the run to appear as if pressed into the page.
- property italic: bool | None¶
Read/write tri-state value.
When
True
, causes the text of the run to appear in italics.None
indicates the effective value is inherited from the style hierarchy.
- property math: bool | None¶
Read/write tri-state value.
When
True
, specifies this run contains WML that should be handled as though it was Office Open XML Math.
- property name: str | None¶
The typeface name for this
Font
.Causes the text it controls to appear in the named font, if a matching font is found.
None
indicates the typeface is inherited from the style hierarchy.
- property no_proof: bool | None¶
Read/write tri-state value.
When
True
, specifies that the contents of this run should not report any errors when the document is scanned for spelling and grammar.
- property outline: bool | None¶
Read/write tri-state value.
When
True
causes the characters in the run to appear as if they have an outline, by drawing a one pixel wide border around the inside and outside borders of each character glyph.
- property rtl: bool | None¶
Read/write tri-state value.
When
True
causes the text in the run to have right-to-left characteristics.
- property shadow: bool | None¶
Read/write tri-state value.
When
True
causes the text in the run to appear as if each character has a shadow.
- property size: Length | None¶
Font height in English Metric Units (EMU).
None
indicates the font size should be inherited from the style hierarchy.Length
is a subclass ofint
having properties for convenient conversion into points or other length units. Thedocx.shared.Pt
class allows convenient specification of point values:>>> font.size = Pt(24) >>> font.size 304800 >>> font.size.pt 24.0
- property small_caps: bool | None¶
Read/write tri-state value.
When
True
causes the lowercase characters in the run to appear as capital letters two points smaller than the font size specified for the run.
- property snap_to_grid: bool | None¶
Read/write tri-state value.
When
True
causes the run to use the document grid characters per line settings defined in the docGrid element when laying out the characters in this run.
- property spec_vanish: bool | None¶
Read/write tri-state value.
When
True
, specifies that the given run shall always behave as if it is hidden, even when hidden text is being displayed in the current document. The property has a very narrow, specialized use related to the table of contents. Consult the spec (§17.3.2.36) for more details.
- property strike: bool | None¶
Read/write tri-state value.
When
True
causes the text in the run to appear with a single horizontal line through the center of the line.
- property subscript: bool | None¶
Boolean indicating whether the characters in this
Font
appear as subscript.None
indicates the subscript/subscript value is inherited from the style hierarchy.
- property superscript: bool | None¶
Boolean indicating whether the characters in this
Font
appear as superscript.None
indicates the subscript/superscript value is inherited from the style hierarchy.
- property underline: bool | WD_UNDERLINE | None¶
The underline style for this
Font
.The value is one of
None
,True
,False
, or a member of WD_UNDERLINE.None
indicates the font inherits its underline value from the style hierarchy.False
indicates no underline.True
indicates single underline. The values from WD_UNDERLINE are used to specify other outline styles such as double, wavy, and dotted.
Read/write tri-state value.
When
True
, specifies that the contents of this run shall be hidden when the document is displayed in web page view.
RenderedPageBreak
对象¶
RenderedPageBreak
objects
- class docx.text.pagebreak.RenderedPageBreak[源代码]¶
A page-break inserted by Word during page-layout for print or display purposes.
This usually does not correspond to a "hard" page-break inserted by the document author, rather just that Word ran out of room on one page and needed to start another. The position of these can change depending on the printer and page-size, as well as margins, etc. They also will change in response to edits, but not until Word loads and saves the document.
Note these are never inserted by python-docx because it has no rendering function. These are generally only useful for text-extraction of existing documents when python-docx is being used solely as a document "reader".
NOTE: a rendered page-break can occur within a hyperlink; consider a multi-word hyperlink like "excellent Wikipedia article on LLMs" that happens to fall close to the end of the last line on a page such that the page breaks between "Wikipedia" and "article". In such a "page-breaks-in-hyperlink" case, THESE METHODS WILL "MOVE" THE PAGE-BREAK to occur after the hyperlink, such that the entire hyperlink appears in the paragraph returned by .preceding_paragraph_fragment. While this places the "tail" text of the hyperlink on the "wrong" page, it avoids having two hyperlinks each with a fragment of the actual text and pointing to the same address.
- property following_paragraph_fragment: Paragraph | None¶
A "loose" paragraph containing the content following this page-break.
HAS POTENTIALLY SURPRISING BEHAVIORS so read carefully to be sure this is what you want. This is primarily targeted toward text-extraction use-cases for which precisely associating text with the page it occurs on is important.
Compare .preceding_paragraph_fragment as these two are intended to be used together.
This value is None when no content follows this page-break. This case is unlikely to occur in practice because Word places even-paragraph-boundary page-breaks on the paragraph following the page-break. Still, it is possible and must be checked for. Returning None for this case avoids "inserting" an extra, non-existent paragraph into the content stream. Note that content can include DrawingML items like images or charts, not just text.
The returned paragraph is divorced from the document body. Any changes made to it will not be reflected in the document. It is intended to provide a container (Paragraph) with familiar properties and methods that can be used to characterize the paragraph content following a mid-paragraph page-break.
Contains no portion of the hyperlink when this break occurs within a hyperlink.
- property preceding_paragraph_fragment: Paragraph | None¶
A "loose" paragraph containing the content preceding this page-break.
Compare .following_paragraph_fragment as these two are intended to be used together.
This value is None when no content precedes this page-break. This case is common and occurs whenever a page breaks on an even paragraph boundary. Returning None for this case avoids "inserting" a non-existent paragraph into the content stream. Note that content can include DrawingML items like images or charts.
Note the returned paragraph is divorced from the document body. Any changes made to it will not be reflected in the document. It is intended to provide a familiar container (Paragraph) to interrogate for the content preceding this page-break in the paragraph in which it occured.
Contains the entire hyperlink when this break occurs within a hyperlink.
TabStop
对象¶
TabStop
objects
- class docx.text.tabstops.TabStop[源代码]¶
An individual tab stop applying to a paragraph or style.
Accessed using list semantics on its containing
TabStops
object.- property alignment¶
A member of WD_TAB_ALIGNMENT specifying the alignment setting for this tab stop.
Read/write.
- property leader¶
A member of WD_TAB_LEADER specifying a repeating character used as a "leader", filling in the space spanned by this tab.
Assigning
None
produces the same result as assigning WD_TAB_LEADER.SPACES. Read/write.
TabStops
对象¶
TabStops
objects
- class docx.text.tabstops.TabStops[源代码]¶
A sequence of
TabStop
objects providing access to the tab stops of a paragraph or paragraph style.Supports iteration, indexed access, del, and len(). It is accesed using the
tab_stops
property of ParagraphFormat; it is not intended to be constructed directly.- add_tab_stop(position, alignment=WD_TAB_ALIGNMENT.LEFT, leader=WD_TAB_LEADER.SPACES)[源代码]¶
Add a new tab stop at position, a
Length
object specifying the location of the tab stop relative to the paragraph edge.A negative position value is valid and appears in hanging indentation. Tab alignment defaults to left, but may be specified by passing a member of the WD_TAB_ALIGNMENT enumeration as alignment. An optional leader character can be specified by passing a member of the WD_TAB_LEADER enumeration as leader.