Skip to content

Releases: kaivid-labs/evret

v0.0.3

18 Jun 12:41

Choose a tag to compare

Added

  • Added LLM-assisted evaluation dataset generation.
  • integration: Haystack integration and Elasticsearch retriever support.
  • example: Streamlit evaluation dashboard example.
  • Improved retriever, dataset loading, and metric logging.
  • Updated token-overlap judge defaults

Evaluation Dataset Improvements

  • QueryExample now supports expected_doc_ids.
  • When expected_doc_ids are present, Evaluator compares retrieved doc_ids directly instead of relying on answer-text judging. JSON and CSV dataset loading both support this field.

Aliases supported:

  • expected_doc_ids
  • relevant_doc_ids

Examples And UI

Added a Streamlit dashboard example for running and visualizing evaluations:

  • examples/evals-streamlit-dashboard/index.py
  • examples/evals-streamlit-dashboard/run_evals_ui.py

Dependencies

Updated pyproject.toml:

  • Version bumped from 0.0.2 to 0.0.3.

  • Added runtime dependency:

    • tqdm>=4.67.0
  • Added optional extras:

    • elasticsearch
    • haystack

Full Changelog: v0.0.2...v0.0.3

v0.0.2

10 May 22:27

Choose a tag to compare

Evret 0.0.2

Added

  • [new-metric] Added ERR@k metric for cascade-style graded relevance evaluation.
  • [new-metric] Added RBP@k metric with tunable persistence/user-patience weighting.
  • Structured logging utilities: get_logger, configure_logging, and JSON log formatting.
  • Added tracing and monitoring notebook

Changed

  • design change in evaluation dataset semantics from relevant_doc_ids toward expected_answers.
  • Improved TokenOverlapJudge matching logic, including negation handling and better overlap scoring.
  • Reworked quickstart, architecture, dataset-format, metrics, and judge docs.

Full Changelog: v0.0.1...v0.0.2

v0.0.1

05 May 11:30

Choose a tag to compare

Evret 0.0.1

Added

Added a pluggable relevance judge system for text-based evaluation.

  • Added TokenOverlapJudge as the default judge.
  • Added semantic and LLM judge support with optional extras:
    • semantic
    • llm-openai
    • llm-anthropic
    • llm-google
    • judges
  • Added support for expected_answers in evaluation datasets, alongside classic relevant_doc_ids.
  • Added LangChain adapter support in both directions:
    • Evret retriever as a LangChain retriever
    • LangChain retriever as an Evret retriever
  • Added metric helper internals for ranking, DCG, set operations, and validation.
  • Added full MkDocs documentation site with quickstart, architecture, API docs, metric docs, retriever docs, judge docs, and
    integration docs.
  • Added new examples:
    • Qdrant demo
    • LangChain integration demo
    • Evaluation dataset creation example

Changed

  • Updated evaluator logic to use judges for matching retrieved content against expected answers or relevant labels.
  • Improved metric behavior for empty inputs, invalid inputs, top-k handling, and score clamping.
  • Revamp EvaluationDataset
  • Updated judge usage, integration install commands, and current API examples.
  • Updated package metadata to point to kaivid-labs/evret.

Full Changelog: v0.0.1b...v0.0.1

v0.0.1b

05 May 09:26
ae93992

Choose a tag to compare

v0.0.1b Pre-release
Pre-release

Evret v0.0.1b0

Initial beta release of Evret, a lightweight framework for evaluating retrievers in RAG and search systems.

What's Changed

  • Core IR metrics: Hit Rate, Recall, Precision, MRR, nDCG, Average Precision
  • Evaluation pipeline with EvaluationDataset, Evaluator, and result exports
  • Retriever adapters for Qdrant, Chroma, Weaviate, and Milvus
  • LangChain and LlamaIndex integrations
  • JSON/CSV dataset loading
  • Basic examples and test coverage

New Contributors

Full Changelog: https://github.com/kaivid-labs/evret/commits/ff250dd