feat: add plugin schema system, tokenizer cache, and config validation

- Add plugin schema types and runtime discovery for meta/filter plugins
- Rewrite --generate-config to use schema system instead of hardcoded types
- Add Settings::validate_config() for startup validation
- Cache tokenizer instances via static Lazy to avoid repeated BPE loading
- Add split_by_token_iter() and count_bounded() to Tokenizer
- Fix double-counting bug in TokensMetaPlugin when buffer < max_buffer_size
- Eliminate unnecessary allocations in token count methods
- Refactor token filters: remove Option<Tokenizer>, use iterator API
- Fix TailTokensFilter correctness: unbounded buffer instead of ring buffer
- Add encoding option to all token filters
- Add description() to MetaPlugin and FilterPlugin traits
- Fix unused_mut warning in compression engine (feature-gated code)

Co-Authored-By: code-review-bot <noreply@anthropic.com>
This commit is contained in:
2026-03-13 20:23:17 -03:00
parent 914190e119
commit e7d8a83369
16 changed files with 831 additions and 420 deletions

View File

@@ -3,5 +3,8 @@ pub mod is_binary;
/// Detects if data is binary or text based on signatures and printable ratios.
pub mod status;
/// Plugin schema types and discovery functions.
pub mod schema;
/// Standard buffer size for I/O operations (8KB)
pub const PIPESIZE: usize = 8192;