Skip to content

6.4.1

Latest

Choose a tag to compare

@danilojsldanilojsl released this 25 May 21:01
a03e59c

📢 Spark NLP 6.4.1: Late Chunking, Multimodal Embeddings, and Smarter Pretrained Workflows

Spark NLP 6.4.1 is a feature-rich follow-up release that expands retrieval and multimodal capabilities while improving pretrained model loading and production robustness. This release introduces two powerful new annotators: LateChunkEmbeddings for context-aware chunk embeddings and BiEncoderMultimodalEmbeddings for dual-encoder text-image retrieval workflows.

In addition, VectorDBConnector now supports multimodal image indexing, pretrained model resolution becomes smarter with preferred engine selection, and document ingestion workflows gain important reliability fixes for HTML readers and cloud-backed pretrained caches.

For a deeper walkthrough of the major additions in this release, see our Medium post: Spark NLP 6.4.1: Context-Aware Retrieval, Engine Control, and Multimodal Indexing.

🔥 Highlights

  • LateChunkEmbeddings: A new annotator that applies the late chunking technique so chunk embeddings are computed from token embeddings generated over the full document context rather than embedding each chunk independently.
  • BiEncoderMultimodalEmbeddings: A new multimodal dual-encoder annotator for generating aligned text and image embeddings for retrieval, search, and indexing workflows.
  • Multimodal VectorDBConnector: VectorDBConnector now supports indexing image embeddings in addition to text embeddings, enabling multimodal vector database pipelines.
  • Smarter .pretrained() resolution: Pretrained model loading now supports a preferred engine parameter and resolves duplicate model names more reliably by filtering first by annotator class.
  • Improved production robustness: Fixes for cloud-backed pretrained caches and HTML reader header handling make Spark NLP more reliable in serverless and cloud-storage-based workflows.

🚀 New Features & Enhancements

LateChunkEmbeddings

Spark NLP 6.4.1 introduces LateChunkEmbeddings, a new annotator based on the Late Chunking approach. Unlike traditional chunk embedding approaches that embed each chunk in isolation, this annotator assumes the upstream embedding model has already encoded the full document in a single pass. It then pools token embeddings corresponding to each chunk span, allowing every chunk embedding to retain broader document context.

image2

This is especially valuable for:

  • Retrieval-augmented generation (RAG)
  • Semantic search over long documents
  • Chunk retrieval where local spans depend on earlier context

A follow-up improvement in this same release adds sentenceAwareFiltering, which is enabled by default. This restricts pooling to token embeddings within the chunk’s sentence boundary, reducing noise from overlapping tokens across adjacent sentences.

late_chunk = LateChunkEmbeddings() \
    .setInputCols(["document", "chunk", "token", "embeddings"]) \
    .setOutputCol("chunk_embeddings")

See the LateChunkEmbeddings example notebook for a full walkthrough and practical usage examples. You can also read our Medium article for a broader explanation of why late chunking improves retrieval quality over naive chunk-first embedding pipelines.

BiEncoderMultimodalEmbeddings

This release adds BiEncoderMultimodalEmbeddings, a new annotator for multimodal dual-encoder workflows. It accepts aligned DOCUMENT and IMAGE annotations and emits two embedding outputs:

  • <outputCol>_doc_embeddings
  • <outputCol>_image_embeddings

This enables use cases such as:

  • text-to-image retrieval
  • image-to-text retrieval
  • multimodal document search
  • image/document semantic matching
  • multimodal indexing pipelines

The first supported implementation targets dual-encoder ONNX exports in the Ops-MM / Qwen2VL-style architecture.

image1
mm = BiEncoderMultimodalEmbeddings.pretrained() \
    .setInputCols(["vision_pair_doc", "vision_pair_image"]) \
    .setOutputCol("mm")

To see this in action, check out the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates multimodal retrieval and indexing in a complete end-to-end workflow. Our Medium post also covers how these multimodal embeddings fit into retrieval and vector search pipelines.

Multimodal Support in VectorDBConnector

VectorDBConnector now supports multimodal image indexing through a new modalityMode parameter.

Supported modes:

  • text (default): expects DOCUMENT + SENTENCE_EMBEDDINGS
  • image: expects IMAGE + SENTENCE_EMBEDDINGS

In image mode, Spark NLP augments metadata with image-specific fields such as origin, width, height, and channel count, and generates deterministic vector IDs derived from the image origin path for stable re-indexing.

This makes it possible to build end-to-end pipelines that extract images, compute embeddings, and store them directly in vector databases such as Pinecone.

vectorDB = VectorDBConnector() \
    .setInputCols(["image_assembler", "image_embeddings"]) \
    .setOutputCol("vectordb_result") \
    .setProvider("pinecone") \
    .setIndexName("my-multimodal-index") \
    .setModalityMode("image")

For a complete multimodal retrieval example, see the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates how multimodal embeddings and vector indexing work together in an end-to-end RAG pipeline.

Smarter Pretrained Model Resolution

Spark NLP 6.4.1 improves .pretrained() behavior in two important ways:

  • Pretrained models are now filtered by annotator class first, avoiding ambiguity when different annotators share the same model name.
  • .pretrained() now accepts an engine parameter, allowing users to prefer a specific engine when multiple versions of a model are available.

If no engine is specified, Spark NLP uses this priority order:

  • ONNX
  • TensorFlow
  • OpenVINO

This makes model loading more predictable and helps users better control runtime behavior across supported backends.

Cloud Pretrained Cache Reliability

This release fixes several issues affecting pretrained resource caching when cache_pretrained points to cloud-backed storage such as S3, GCS, or Azure Blob Storage.

Improvements include:

  • More reliable extraction of large pretrained resources into cloud caches
  • A completion marker to prevent incomplete cache directories from being reused
  • Correct handling of Azure wasbs:// cache paths through Hadoop FS
  • Explicit propagation of spark.hadoop.* params into Hadoop configuration

Additionally, sparknlp.start() now includes a skip_sparknlp_maven option for developer workflows that need to run Spark NLP using a custom local JAR.

HTMLReader Improvements

HTMLReader receives important reliability and metadata improvements in this release:

  • Fixes failures caused by header serialization mismatches between Java and Scala map types
  • Adds paragraph-level metadata fields:
    • paragraph_index
    • paragraph_y
    • page_y

These additions improve downstream layout-aware document understanding and help preserve ordering and spatial context in HTML ingestion workflows.

LLMEntityExtractor Improvements

This release also improves LLMEntityExtractor usability and robustness:

  • Updated example notebook to reflect parameter name changes
  • Improved getFewShotExamples handling for multiple input cases, particularly in notebook and Colab-style environments

Model Card Tag Cleanup

A metadata cleanup pass also ensures Markdown model cards use engine tags more consistently by aligning engine tags with the declared engine field where available.

🐛 Bug Fixes

  • Fixed pretrained engine resolution edge cases in ResourceDownloader
  • Fixed cloud-cache extraction and incomplete cache reuse for pretrained resources
  • Fixed Azure wasbs:// cache handling
  • Fixed HTML reader header serialization issues
  • Improved paragraph metadata emission in HTMLReader
  • Improved LLMEntityExtractor few-shot example handling
  • Updated example notebooks and documentation consistency

❤️ Community Support

  • Slack real-time discussion with the Spark NLP community and team
  • GitHub issue tracking, feature requests, and contributions
  • Discussions community ideas and showcases
  • Medium latest Spark NLP articles and tutorials
  • YouTube educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.4.1

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1

Maven

spark-nlp

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-gpu

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-silicon

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-aarch64

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

FAT JARs

  • CPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.4.1.jar
  • GPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.4.1.jar
  • Apple Silicon: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.4.1.jar
  • AArch64: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.4.1.jar

What's Changed

Full Changelog: 6.4.0...6.4.1