Releases: kaivid-labs/evret
Releases · kaivid-labs/evret
v0.0.3
Added
- Added LLM-assisted evaluation dataset generation.
- integration: Haystack integration and Elasticsearch retriever support.
- example: Streamlit evaluation dashboard example.
- Improved retriever, dataset loading, and metric logging.
- Updated token-overlap judge defaults
Evaluation Dataset Improvements
- QueryExample now supports expected_doc_ids.
- When expected_doc_ids are present, Evaluator compares retrieved doc_ids directly instead of relying on answer-text judging. JSON and CSV dataset loading both support this field.
Aliases supported:
- expected_doc_ids
- relevant_doc_ids
Examples And UI
Added a Streamlit dashboard example for running and visualizing evaluations:
- examples/evals-streamlit-dashboard/index.py
- examples/evals-streamlit-dashboard/run_evals_ui.py
Dependencies
Updated pyproject.toml:
-
Version bumped from 0.0.2 to 0.0.3.
-
Added runtime dependency:
- tqdm>=4.67.0
-
Added optional extras:
- elasticsearch
- haystack
Full Changelog: v0.0.2...v0.0.3
v0.0.2
Evret 0.0.2
Added
- [new-metric] Added ERR@k metric for cascade-style graded relevance evaluation.
- [new-metric] Added RBP@k metric with tunable persistence/user-patience weighting.
- Structured logging utilities: get_logger, configure_logging, and JSON log formatting.
- Added tracing and monitoring notebook
Changed
- design change in evaluation dataset semantics from
relevant_doc_idstowardexpected_answers. - Improved TokenOverlapJudge matching logic, including negation handling and better overlap scoring.
- Reworked quickstart, architecture, dataset-format, metrics, and judge docs.
Full Changelog: v0.0.1...v0.0.2
v0.0.1
Evret 0.0.1
Added
Added a pluggable relevance judge system for text-based evaluation.
- Added TokenOverlapJudge as the default judge.
- Added semantic and LLM judge support with optional extras:
- semantic
- llm-openai
- llm-anthropic
- llm-google
- judges
- Added support for expected_answers in evaluation datasets, alongside classic relevant_doc_ids.
- Added LangChain adapter support in both directions:
- Evret retriever as a LangChain retriever
- LangChain retriever as an Evret retriever
- Added metric helper internals for ranking, DCG, set operations, and validation.
- Added full MkDocs documentation site with quickstart, architecture, API docs, metric docs, retriever docs, judge docs, and
integration docs. - Added new examples:
- Qdrant demo
- LangChain integration demo
- Evaluation dataset creation example
Changed
- Updated evaluator logic to use judges for matching retrieved content against expected answers or relevant labels.
- Improved metric behavior for empty inputs, invalid inputs, top-k handling, and score clamping.
- Revamp EvaluationDataset
- Updated judge usage, integration install commands, and current API examples.
- Updated package metadata to point to kaivid-labs/evret.
Full Changelog: v0.0.1b...v0.0.1
v0.0.1b
Evret v0.0.1b0
Initial beta release of Evret, a lightweight framework for evaluating retrievers in RAG and search systems.
What's Changed
- Core IR metrics: Hit Rate, Recall, Precision, MRR, nDCG, Average Precision
- Evaluation pipeline with
EvaluationDataset,Evaluator, and result exports - Retriever adapters for Qdrant, Chroma, Weaviate, and Milvus
- LangChain and LlamaIndex integrations
- JSON/CSV dataset loading
- Basic examples and test coverage
New Contributors
- @lucifertrj made the repo public.
Full Changelog: https://github.com/kaivid-labs/evret/commits/ff250dd