📢 Spark NLP 6.4.1: Late Chunking, Multimodal Embeddings, and Smarter Pretrained Workflows

Spark NLP 6.4.1 is a feature-rich follow-up release that expands retrieval and multimodal capabilities while improving pretrained model loading and production robustness. This release introduces two powerful new annotators: LateChunkEmbeddings for context-aware chunk embeddings and BiEncoderMultimodalEmbeddings for dual-encoder text-image retrieval workflows.

In addition, VectorDBConnector now supports multimodal image indexing, pretrained model resolution becomes smarter with preferred engine selection, and document ingestion workflows gain important reliability fixes for HTML readers and cloud-backed pretrained caches.

For a deeper walkthrough of the major additions in this release, see our Medium post: Spark NLP 6.4.1: Context-Aware Retrieval, Engine Control, and Multimodal Indexing.

🔥 Highlights

LateChunkEmbeddings: A new annotator that applies the late chunking technique so chunk embeddings are computed from token embeddings generated over the full document context rather than embedding each chunk independently.
BiEncoderMultimodalEmbeddings: A new multimodal dual-encoder annotator for generating aligned text and image embeddings for retrieval, search, and indexing workflows.
Multimodal VectorDBConnector: VectorDBConnector now supports indexing image embeddings in addition to text embeddings, enabling multimodal vector database pipelines.
Smarter .pretrained() resolution: Pretrained model loading now supports a preferred engine parameter and resolves duplicate model names more reliably by filtering first by annotator class.
Improved production robustness: Fixes for cloud-backed pretrained caches and HTML reader header handling make Spark NLP more reliable in serverless and cloud-storage-based workflows.

🚀 New Features & Enhancements

LateChunkEmbeddings

Spark NLP 6.4.1 introduces LateChunkEmbeddings, a new annotator based on the Late Chunking approach. Unlike traditional chunk embedding approaches that embed each chunk in isolation, this annotator assumes the upstream embedding model has already encoded the full document in a single pass. It then pools token embeddings corresponding to each chunk span, allowing every chunk embedding to retain broader document context.

This is especially valuable for:

Retrieval-augmented generation (RAG)
Semantic search over long documents
Chunk retrieval where local spans depend on earlier context

A follow-up improvement in this same release adds sentenceAwareFiltering, which is enabled by default. This restricts pooling to token embeddings within the chunk’s sentence boundary, reducing noise from overlapping tokens across adjacent sentences.

late_chunk = LateChunkEmbeddings() \
    .setInputCols(["document", "chunk", "token", "embeddings"]) \
    .setOutputCol("chunk_embeddings")

See the LateChunkEmbeddings example notebook for a full walkthrough and practical usage examples. You can also read our Medium article for a broader explanation of why late chunking improves retrieval quality over naive chunk-first embedding pipelines.

BiEncoderMultimodalEmbeddings

This release adds BiEncoderMultimodalEmbeddings, a new annotator for multimodal dual-encoder workflows. It accepts aligned DOCUMENT and IMAGE annotations and emits two embedding outputs:

<outputCol>_doc_embeddings
<outputCol>_image_embeddings

This enables use cases such as:

text-to-image retrieval
image-to-text retrieval
multimodal document search
image/document semantic matching
multimodal indexing pipelines

The first supported implementation targets dual-encoder ONNX exports in the Ops-MM / Qwen2VL-style architecture.

mm = BiEncoderMultimodalEmbeddings.pretrained() \
    .setInputCols(["vision_pair_doc", "vision_pair_image"]) \
    .setOutputCol("mm")

To see this in action, check out the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates multimodal retrieval and indexing in a complete end-to-end workflow. Our Medium post also covers how these multimodal embeddings fit into retrieval and vector search pipelines.

Multimodal Support in VectorDBConnector

VectorDBConnector now supports multimodal image indexing through a new modalityMode parameter.

Supported modes:

text (default): expects DOCUMENT + SENTENCE_EMBEDDINGS
image: expects IMAGE + SENTENCE_EMBEDDINGS

In image mode, Spark NLP augments metadata with image-specific fields such as origin, width, height, and channel count, and generates deterministic vector IDs derived from the image origin path for stable re-indexing.

This makes it possible to build end-to-end pipelines that extract images, compute embeddings, and store them directly in vector databases such as Pinecone.

vectorDB = VectorDBConnector() \
    .setInputCols(["image_assembler", "image_embeddings"]) \
    .setOutputCol("vectordb_result") \
    .setProvider("pinecone") \
    .setIndexName("my-multimodal-index") \
    .setModalityMode("image")

For a complete multimodal retrieval example, see the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates how multimodal embeddings and vector indexing work together in an end-to-end RAG pipeline.

Smarter Pretrained Model Resolution

Spark NLP 6.4.1 improves .pretrained() behavior in two important ways:

Pretrained models are now filtered by annotator class first, avoiding ambiguity when different annotators share the same model name.
.pretrained() now accepts an engine parameter, allowing users to prefer a specific engine when multiple versions of a model are available.

If no engine is specified, Spark NLP uses this priority order:

ONNX
TensorFlow
OpenVINO

This makes model loading more predictable and helps users better control runtime behavior across supported backends.

Cloud Pretrained Cache Reliability

This release fixes several issues affecting pretrained resource caching when cache_pretrained points to cloud-backed storage such as S3, GCS, or Azure Blob Storage.

Improvements include:

More reliable extraction of large pretrained resources into cloud caches
A completion marker to prevent incomplete cache directories from being reused
Correct handling of Azure wasbs:// cache paths through Hadoop FS
Explicit propagation of spark.hadoop.* params into Hadoop configuration

Additionally, sparknlp.start() now includes a skip_sparknlp_maven option for developer workflows that need to run Spark NLP using a custom local JAR.

HTMLReader Improvements

HTMLReader receives important reliability and metadata improvements in this release:

Fixes failures caused by header serialization mismatches between Java and Scala map types
Adds paragraph-level metadata fields:
- paragraph_index
- paragraph_y
- page_y

These additions improve downstream layout-aware document understanding and help preserve ordering and spatial context in HTML ingestion workflows.

LLMEntityExtractor Improvements

This release also improves LLMEntityExtractor usability and robustness:

Updated example notebook to reflect parameter name changes
Improved getFewShotExamples handling for multiple input cases, particularly in notebook and Colab-style environments

Model Card Tag Cleanup

A metadata cleanup pass also ensures Markdown model cards use engine tags more consistently by aligning engine tags with the declared engine field where available.

🐛 Bug Fixes

Fixed pretrained engine resolution edge cases in ResourceDownloader
Fixed cloud-cache extraction and incomplete cache reuse for pretrained resources
Fixed Azure wasbs:// cache handling
Fixed HTML reader header serialization issues
Improved paragraph metadata emission in HTMLReader
Improved LLMEntityExtractor few-shot example handling
Updated example notebooks and documentation consistency

❤️ Community Support

Slack real-time discussion with the Spark NLP community and team
GitHub issue tracking, feature requests, and contributions
Discussions community ideas and showcases
Medium latest Spark NLP articles and tutorials
YouTube educational videos and demos

💻 Installation

Python

pip install spark-nlp==6.4.1

Spark Packages

CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1

Maven

spark-nlp

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-gpu

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-silicon

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

spark-nlp-aarch64

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.4.1</version>
</dependency>

FAT JARs

CPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.4.1.jar
GPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.4.1.jar
Apple Silicon: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.4.1.jar
AArch64: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.4.1.jar

What's Changed

Adding preferred engine to pretrained and minor fixes #14689 by @ahmedlone127
Fix tags #14695 by @AbdullahMubeenAnwar
Update llama_cpp_in_Spark_NLP_LLMEntityExtractor.ipynb #14756 by @AbdullahMubeenAnwar
Add multimodal image support to VectorDBConnector #14760 by @AbdullahMubeenAnwar
Update GetFewShotExamples handling in LLMEntityExtractor #14762 by @ahmedlone127
[SPARKNLP-1369] Fix cloud pretrained cache handling for large models #14763 by @danilojsl
Add LateChunkEmbeddings annotator #14764 by @AbdullahMubeenAnwar
[SPARKNLP-1359] Implement BiEncoderMultimodalEmbeddings #14767 by @danilojsl
Fix pretrained engine parameter behavior #14769 by @ahmedlone127
[SPARKNLP-1378] HTMLReader Default Headers Error #14770 by @danilojsl
[SPARKNLP-1385] Add sentenceAwareFiltering param #14772 by @danilojsl

Full Changelog: 6.4.0...6.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.4.1

Choose a tag to compare

Sorry, something went wrong.