Telegram Markdown Rendering

Research into rendering LLM markdown output as formatted Telegram messages.

Problem

Claude returns markdown-formatted text (bold, code blocks, lists, etc.). Currently, run_legatus_request() in legio/telegram/utils.py escapes all LLM output with html.escape() via _safe_html(), sending it as plain text. Users see raw markdown syntax instead of formatted messages.

Current Architecture

Claude response (markdown)
  → _safe_html() escapes everything
  → edit_html() sends with parse_mode="HTML"
  → User sees: **bold** `code` - list item

Key Functions

Function	File	Role
`_safe_html(text)`	`telegram/utils.py:36`	`html.escape()` on LLM output
`reply_html(msg, text)`	`telegram/utils.py:21`	Send with `parse_mode="HTML"`
`edit_html(msg, text)`	`telegram/utils.py:31`	Edit with `parse_mode="HTML"`
`run_legatus_request()`	`telegram/utils.py:119`	Orchestrates response lifecycle
`split_message(text)`	`telegram/utils.py:48`	Splits at `\n\n`, max 4000 chars
`render_attribution_header()`	`rendering.py:85`	Centurio name/role header

Security Controls

Line 155-156: _safe_html(first) escapes LLM output before sending
Line 101-106: Attribution header escapes name and description separately
Generic error message on exception (line 168) prevents leaking internals

Telegram Parse Modes

HTML (current)

Supported tags: , , , <s>, <code>, <pre>, <pre><code class="language-X">, <a href="">, <blockquote>, <tg-spoiler>.

Pros:

Already in use throughout the codebase
Predictable escaping (just <, >, &)
Well-tested infrastructure (reply_html, edit_html)
Code blocks support language attribute

Cons:

LLM output is markdown, not HTML — requires conversion

MarkdownV2

Syntax: *bold*, _italic_, `code`, ```pre```, ~strike~, >blockquote, ||spoiler||.

Pros:

Closer to Claude's natural output format

Cons:

18 characters must be escaped outside formatting: _ * [ ] ( ) ~ > # + - = | { } . !`
Claude's markdown is not MarkdownV2-compliant (different escaping rules)
One unescaped character = entire message fails to send
Nested formatting has strict ordering rules
No language attribute on code blocks (just ```code```)
Would require replacing all existing HTML infrastructure

Markdown (legacy)

Deprecated. Supports only *bold*, `code`, ```pre```. Do not use.

Implementation Strategies

Strategy A: Markdown → Telegram HTML (recommended)

Convert Claude's markdown to Telegram-compatible HTML before sending.

Claude response (markdown)
  → markdown_to_telegram_html() converts formatting
  → split_message() chunks the result
  → edit_html() sends with parse_mode="HTML"
  → User sees: **bold** code list item (formatted)

Conversion mapping:

Markdown	Telegram HTML
`bold` / `__bold__`	`<b>bold</b>`
`italic` / `_italic_`	`<i>italic</i>`
`inline code`	`<code>inline code</code>`
```lang\nblock\n```	`<pre><code class="language-lang">block</code></pre>`
`> quote`	`<blockquote>quote</blockquote>`
`~~strike~~`	`<s>strike</s>`
`[text](url)`	`<a href="url">text</a>`
`- item` / `* item`	`• item` (Unicode bullet, no HTML tag)
`1. item`	`1. item` (plain text, Telegram has no `<ol>`)
`# Heading`	`<b>Heading</b>` (no heading tags in Telegram)

Approach options:

Library: mistune — Fast, pure Python markdown parser. Write a custom Telegram HTML renderer. Handles edge cases (nested formatting, escaping). ~150 lines for renderer.
Library: markdown-it-py — Port of markdown-it. Token-based, highly configurable. More complex API but more accurate parsing.
Regex-based — Simple regex replacements (**(.+?)** → \1). Fast but fragile: fails on nested formatting, code blocks containing markdown syntax, edge cases. Not recommended for production.
Custom parser — Hand-rolled state machine. Full control, zero dependencies. ~200-300 lines. Risk of bugs on edge cases.

Recommendation: mistune with custom renderer. Lightweight (single dependency), well-maintained, handles all edge cases. Custom renderer is ~100-150 lines.

Security: All text nodes in the renderer must call html.escape(). Only recognized markdown constructs produce HTML tags. Unknown input passes through escaped. This preserves the existing security posture.

Strategy B: Switch to MarkdownV2

Send Claude's output with parse_mode="MarkdownV2" after escaping special characters.

Problems:

Must escape 18 characters outside formatting spans — requires parsing markdown structure anyway
Claude may produce markdown that doesn't conform to Telegram's MarkdownV2 spec
Existing reply_html / edit_html infrastructure must be replaced or duplicated
Attribution headers, command responses, and error messages all use HTML — mixed modes add complexity
One escaping mistake = message delivery failure

Verdict: More work, more fragile, no benefit over Strategy A.

Strategy C: Plain Text + Selective Formatting

Keep _safe_html() but post-process to add formatting for obvious patterns (code blocks, bullet lists).

Problems:

Half-measure — some formatting works, some doesn't
Hard to handle code blocks reliably without a real parser
Still looks bad for most responses

Verdict: Not recommended.

Interaction with Existing Code

`split_message()` Compatibility

split_message() splits at \n\n boundaries with a 4000-char limit (buffer for HTML tags). After markdown→HTML conversion, the text will contain HTML tags that increase length. Options:

Convert first, then split — HTML tags counted toward limit. Simple, correct.
Split first, then convert — Risk of splitting mid-markdown construct (e.g., code block). Dangerous.

Decision: Convert first, then split. May need to reduce max_len slightly if HTML overhead is significant (4000 already provides 96-char buffer).

`render_attribution_header()` Compatibility

Attribution headers already produce HTML (html.escape on name/description). The rendered LLM response is appended after the header. Both produce HTML — fully compatible.

Message Editing

run_legatus_request() sends an initial ⏳ status via edit_html(), then edits it with the response. The converted HTML response replaces the status cleanly.

Dependency Considerations

Adding mistune to pyproject.toml dependencies:

Pure Python, no C extensions
Well-maintained (active development)
Small footprint (~30KB)
Already handles CommonMark spec
No conflict with existing dependencies
MIT licensed

Risks

Telegram tag limit — Telegram may reject messages with deeply nested or malformed HTML. Mitigation: the renderer produces flat, non-nested tags.
Message length inflation — HTML tags add bytes. A 4000-char markdown response might exceed 4096 after conversion. Mitigation: convert before splitting.
Code blocks with HTML — Code blocks may contain <, >, &. Mitigation: escape code block content, only wrap with <pre><code>.
Claude output variation — Claude doesn't always produce clean markdown. Mitigation: the parser handles partial/broken markdown gracefully (passes through as escaped text).

Recommendation

Strategy A with mistune. Implementation plan:

Add mistune to pyproject.toml dependencies
Create legio/telegram/markdown_render.py — custom mistune renderer producing Telegram HTML
Replace _safe_html(first) in run_legatus_request() with markdown_to_telegram_html(first)
Keep _safe_html() for non-LLM text (command output, error messages, attribution headers)
Write comprehensive tests (nested formatting, code blocks, edge cases)
Maintain 100% test coverage

Telegram Markdown Rendering ​

Problem ​

Current Architecture ​

Key Functions ​

Security Controls ​

Telegram Parse Modes ​

HTML (current) ​

MarkdownV2 ​

Markdown (legacy) ​

Implementation Strategies ​

Strategy A: Markdown → Telegram HTML (recommended) ​

Strategy B: Switch to MarkdownV2 ​

Strategy C: Plain Text + Selective Formatting ​

Interaction with Existing Code ​

split_message() Compatibility ​

render_attribution_header() Compatibility ​

Message Editing ​

Dependency Considerations ​

Risks ​

Recommendation ​