Fix character counting in document statistics to use graphemes

- Add unicode-segmentation dependency for proper grapheme cluster support
- Replace chars() iteration with graphemes(true) for accurate character counting
- Fix counting of complex Unicode characters like emojis, combining characters, and multi-byte sequences
- Resolves TODO: 'do graphemes?' in document_statistics function

This change provides more accurate character counts for international text,
emojis with skin tones, combined characters, and other multi-codepoint graphemes.

Examples of improved accuracy:
- 👍🏾 now counts as 1 character instead of 2
- é (e + combining acute) counts as 1 character instead of 2
- 🧑‍💻 (person technologist) counts as 1 character instead of 4

This commit is contained in:

aquiles

2025-10-05 06:39:31 +00:00

• committed by

Jeremy Soller

parent 0d84055362

commit 801c7fa68c

4 changed files with 127 additions and 3 deletions

1

Cargo.lock generated

View file

 @ -1460,6 +1460,7 @@ dependencies = [
  "syntect",
  "tokio",
  "two-face",
  "unicode-segmentation",
 ]
 [[package]]

Rows
Columns

Fix character counting in document statistics to use graphemes

1 Cargo.lock generated Unescape Escape View file

1

Cargo.lock generated

View file