Skip to content

Telegram Markdown Rendering

Research into rendering LLM markdown output as formatted Telegram messages.


Problem

Claude returns markdown-formatted text (bold, code blocks, lists, etc.). Currently, run_legatus_request() in legio/telegram/utils.py escapes all LLM output with html.escape() via _safe_html(), sending it as plain text. Users see raw markdown syntax instead of formatted messages.

Current Architecture

Claude response (markdown)
  → _safe_html() escapes everything
  → edit_html() sends with parse_mode="HTML"
  → User sees: **bold** `code` - list item

Key Functions

FunctionFileRole
_safe_html(text)telegram/utils.py:36html.escape() on LLM output
reply_html(msg, text)telegram/utils.py:21Send with parse_mode="HTML"
edit_html(msg, text)telegram/utils.py:31Edit with parse_mode="HTML"
run_legatus_request()telegram/utils.py:119Orchestrates response lifecycle
split_message(text)telegram/utils.py:48Splits at \n\n, max 4000 chars
render_attribution_header()rendering.py:85Centurio name/role header

Security Controls

  • Line 155-156: _safe_html(first) escapes LLM output before sending
  • Line 101-106: Attribution header escapes name and description separately
  • Generic error message on exception (line 168) prevents leaking internals

Telegram Parse Modes

HTML (current)

Supported tags: <b>, <i>, <u>, <s>, <code>, <pre>, <pre><code class="language-X">, <a href="">, <blockquote>, <tg-spoiler>.

Pros:

  • Already in use throughout the codebase
  • Predictable escaping (just <, >, &)
  • Well-tested infrastructure (reply_html, edit_html)
  • Code blocks support language attribute

Cons:

  • LLM output is markdown, not HTML — requires conversion

MarkdownV2

Syntax: *bold*, _italic_, `code`, ```pre```, ~strike~, >blockquote, ||spoiler||.

Pros:

  • Closer to Claude's natural output format

Cons:

  • 18 characters must be escaped outside formatting: _ * [ ] ( ) ~ > # + - = | { } . !`
  • Claude's markdown is not MarkdownV2-compliant (different escaping rules)
  • One unescaped character = entire message fails to send
  • Nested formatting has strict ordering rules
  • No language attribute on code blocks (just ```code```)
  • Would require replacing all existing HTML infrastructure

Markdown (legacy)

Deprecated. Supports only *bold*, `code`, ```pre```. Do not use.

Implementation Strategies

Convert Claude's markdown to Telegram-compatible HTML before sending.

Claude response (markdown)
  → markdown_to_telegram_html() converts formatting
  → split_message() chunks the result
  → edit_html() sends with parse_mode="HTML"
  → User sees: **bold** code list item (formatted)

Conversion mapping:

MarkdownTelegram HTML
**bold** / __bold__<b>bold</b>
*italic* / _italic_<i>italic</i>
`inline code`<code>inline code</code>
```lang\nblock\n```<pre><code class="language-lang">block</code></pre>
> quote<blockquote>quote</blockquote>
~~strike~~<s>strike</s>
[text](url)<a href="url">text</a>
- item / * item• item (Unicode bullet, no HTML tag)
1. item1. item (plain text, Telegram has no <ol>)
# Heading<b>Heading</b> (no heading tags in Telegram)

Approach options:

  1. Library: mistune — Fast, pure Python markdown parser. Write a custom Telegram HTML renderer. Handles edge cases (nested formatting, escaping). ~150 lines for renderer.

  2. Library: markdown-it-py — Port of markdown-it. Token-based, highly configurable. More complex API but more accurate parsing.

  3. Regex-based — Simple regex replacements (**(.+?)**<b>\1</b>). Fast but fragile: fails on nested formatting, code blocks containing markdown syntax, edge cases. Not recommended for production.

  4. Custom parser — Hand-rolled state machine. Full control, zero dependencies. ~200-300 lines. Risk of bugs on edge cases.

Recommendation: mistune with custom renderer. Lightweight (single dependency), well-maintained, handles all edge cases. Custom renderer is ~100-150 lines.

Security: All text nodes in the renderer must call html.escape(). Only recognized markdown constructs produce HTML tags. Unknown input passes through escaped. This preserves the existing security posture.

Strategy B: Switch to MarkdownV2

Send Claude's output with parse_mode="MarkdownV2" after escaping special characters.

Problems:

  • Must escape 18 characters outside formatting spans — requires parsing markdown structure anyway
  • Claude may produce markdown that doesn't conform to Telegram's MarkdownV2 spec
  • Existing reply_html / edit_html infrastructure must be replaced or duplicated
  • Attribution headers, command responses, and error messages all use HTML — mixed modes add complexity
  • One escaping mistake = message delivery failure

Verdict: More work, more fragile, no benefit over Strategy A.

Strategy C: Plain Text + Selective Formatting

Keep _safe_html() but post-process to add formatting for obvious patterns (code blocks, bullet lists).

Problems:

  • Half-measure — some formatting works, some doesn't
  • Hard to handle code blocks reliably without a real parser
  • Still looks bad for most responses

Verdict: Not recommended.

Interaction with Existing Code

split_message() Compatibility

split_message() splits at \n\n boundaries with a 4000-char limit (buffer for HTML tags). After markdown→HTML conversion, the text will contain HTML tags that increase length. Options:

  1. Convert first, then split — HTML tags counted toward limit. Simple, correct.
  2. Split first, then convert — Risk of splitting mid-markdown construct (e.g., code block). Dangerous.

Decision: Convert first, then split. May need to reduce max_len slightly if HTML overhead is significant (4000 already provides 96-char buffer).

render_attribution_header() Compatibility

Attribution headers already produce HTML (html.escape on name/description). The rendered LLM response is appended after the header. Both produce HTML — fully compatible.

Message Editing

run_legatus_request() sends an initial status via edit_html(), then edits it with the response. The converted HTML response replaces the status cleanly.

Dependency Considerations

Adding mistune to pyproject.toml dependencies:

  • Pure Python, no C extensions
  • Well-maintained (active development)
  • Small footprint (~30KB)
  • Already handles CommonMark spec
  • No conflict with existing dependencies
  • MIT licensed

Risks

  1. Telegram tag limit — Telegram may reject messages with deeply nested or malformed HTML. Mitigation: the renderer produces flat, non-nested tags.

  2. Message length inflation — HTML tags add bytes. A 4000-char markdown response might exceed 4096 after conversion. Mitigation: convert before splitting.

  3. Code blocks with HTML — Code blocks may contain <, >, &. Mitigation: escape code block content, only wrap with <pre><code>.

  4. Claude output variation — Claude doesn't always produce clean markdown. Mitigation: the parser handles partial/broken markdown gracefully (passes through as escaped text).

Recommendation

Strategy A with mistune. Implementation plan:

  1. Add mistune to pyproject.toml dependencies
  2. Create legio/telegram/markdown_render.py — custom mistune renderer producing Telegram HTML
  3. Replace _safe_html(first) in run_legatus_request() with markdown_to_telegram_html(first)
  4. Keep _safe_html() for non-LLM text (command output, error messages, attribution headers)
  5. Write comprehensive tests (nested formatting, code blocks, edge cases)
  6. Maintain 100% test coverage

Built with Roman discipline.