Fix character counting in document statistics to use graphemes
- Add unicode-segmentation dependency for proper grapheme cluster support - Replace chars() iteration with graphemes(true) for accurate character counting - Fix counting of complex Unicode characters like emojis, combining characters, and multi-byte sequences - Resolves TODO: 'do graphemes?' in document_statistics function This change provides more accurate character counts for international text, emojis with skin tones, combined characters, and other multi-codepoint graphemes. Examples of improved accuracy: - 👍🏾 now counts as 1 character instead of 2 - é (e + combining acute) counts as 1 character instead of 2 - 🧑💻 (person technologist) counts as 1 character instead of 4
This commit is contained in:
parent
0d84055362
commit
801c7fa68c
4 changed files with 127 additions and 3 deletions
1
Cargo.lock
generated
1
Cargo.lock
generated
|
|
@ -1460,6 +1460,7 @@ dependencies = [
|
|||
"syntect",
|
||||
"tokio",
|
||||
"two-face",
|
||||
"unicode-segmentation",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue