📢 Spark NLP 6.4.1: Late Chunking, Multimodal Embeddings, and Smarter Pretrained Workflows
Spark NLP 6.4.1 is a feature-rich follow-up release that expands retrieval and multimodal capabilities while improving pretrained model loading and production robustness. This release introduces two powerful new annotators: LateChunkEmbeddings for context-aware chunk embeddings and BiEncoderMultimodalEmbeddings for dual-encoder text-image retrieval workflows.
In addition, VectorDBConnector now supports multimodal image indexing, pretrained model resolution becomes smarter with preferred engine selection, and document ingestion workflows gain important reliability fixes for HTML readers and cloud-backed pretrained caches.
For a deeper walkthrough of the major additions in this release, see our Medium post: Spark NLP 6.4.1: Context-Aware Retrieval, Engine Control, and Multimodal Indexing.
🔥 Highlights
LateChunkEmbeddings: A new annotator that applies the late chunking technique so chunk embeddings are computed from token embeddings generated over the full document context rather than embedding each chunk independently.BiEncoderMultimodalEmbeddings: A new multimodal dual-encoder annotator for generating aligned text and image embeddings for retrieval, search, and indexing workflows.- Multimodal
VectorDBConnector:VectorDBConnectornow supports indexing image embeddings in addition to text embeddings, enabling multimodal vector database pipelines. - Smarter
.pretrained()resolution: Pretrained model loading now supports a preferredengineparameter and resolves duplicate model names more reliably by filtering first by annotator class. - Improved production robustness: Fixes for cloud-backed pretrained caches and HTML reader header handling make Spark NLP more reliable in serverless and cloud-storage-based workflows.
🚀 New Features & Enhancements
LateChunkEmbeddings
Spark NLP 6.4.1 introduces LateChunkEmbeddings, a new annotator based on the Late Chunking approach. Unlike traditional chunk embedding approaches that embed each chunk in isolation, this annotator assumes the upstream embedding model has already encoded the full document in a single pass. It then pools token embeddings corresponding to each chunk span, allowing every chunk embedding to retain broader document context.
This is especially valuable for:
- Retrieval-augmented generation (RAG)
- Semantic search over long documents
- Chunk retrieval where local spans depend on earlier context
A follow-up improvement in this same release adds sentenceAwareFiltering, which is enabled by default. This restricts pooling to token embeddings within the chunk’s sentence boundary, reducing noise from overlapping tokens across adjacent sentences.
late_chunk = LateChunkEmbeddings() \
.setInputCols(["document", "chunk", "token", "embeddings"]) \
.setOutputCol("chunk_embeddings")See the LateChunkEmbeddings example notebook for a full walkthrough and practical usage examples. You can also read our Medium article for a broader explanation of why late chunking improves retrieval quality over naive chunk-first embedding pipelines.
BiEncoderMultimodalEmbeddings
This release adds BiEncoderMultimodalEmbeddings, a new annotator for multimodal dual-encoder workflows. It accepts aligned DOCUMENT and IMAGE annotations and emits two embedding outputs:
<outputCol>_doc_embeddings<outputCol>_image_embeddings
This enables use cases such as:
- text-to-image retrieval
- image-to-text retrieval
- multimodal document search
- image/document semantic matching
- multimodal indexing pipelines
The first supported implementation targets dual-encoder ONNX exports in the Ops-MM / Qwen2VL-style architecture.
mm = BiEncoderMultimodalEmbeddings.pretrained() \
.setInputCols(["vision_pair_doc", "vision_pair_image"]) \
.setOutputCol("mm")To see this in action, check out the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates multimodal retrieval and indexing in a complete end-to-end workflow. Our Medium post also covers how these multimodal embeddings fit into retrieval and vector search pipelines.
Multimodal Support in VectorDBConnector
VectorDBConnector now supports multimodal image indexing through a new modalityMode parameter.
Supported modes:
text(default): expectsDOCUMENT + SENTENCE_EMBEDDINGSimage: expectsIMAGE + SENTENCE_EMBEDDINGS
In image mode, Spark NLP augments metadata with image-specific fields such as origin, width, height, and channel count, and generates deterministic vector IDs derived from the image origin path for stable re-indexing.
This makes it possible to build end-to-end pipelines that extract images, compute embeddings, and store them directly in vector databases such as Pinecone.
vectorDB = VectorDBConnector() \
.setInputCols(["image_assembler", "image_embeddings"]) \
.setOutputCol("vectordb_result") \
.setProvider("pinecone") \
.setIndexName("my-multimodal-index") \
.setModalityMode("image")For a complete multimodal retrieval example, see the BiEncoderMultimodalEmbeddings + Pinecone RAG notebook, which demonstrates how multimodal embeddings and vector indexing work together in an end-to-end RAG pipeline.
Smarter Pretrained Model Resolution
Spark NLP 6.4.1 improves .pretrained() behavior in two important ways:
- Pretrained models are now filtered by annotator class first, avoiding ambiguity when different annotators share the same model name.
.pretrained()now accepts anengineparameter, allowing users to prefer a specific engine when multiple versions of a model are available.
If no engine is specified, Spark NLP uses this priority order:
- ONNX
- TensorFlow
- OpenVINO
This makes model loading more predictable and helps users better control runtime behavior across supported backends.
Cloud Pretrained Cache Reliability
This release fixes several issues affecting pretrained resource caching when cache_pretrained points to cloud-backed storage such as S3, GCS, or Azure Blob Storage.
Improvements include:
- More reliable extraction of large pretrained resources into cloud caches
- A completion marker to prevent incomplete cache directories from being reused
- Correct handling of Azure
wasbs://cache paths through Hadoop FS - Explicit propagation of
spark.hadoop.*params into Hadoop configuration
Additionally, sparknlp.start() now includes a skip_sparknlp_maven option for developer workflows that need to run Spark NLP using a custom local JAR.
HTMLReader Improvements
HTMLReader receives important reliability and metadata improvements in this release:
- Fixes failures caused by header serialization mismatches between Java and Scala map types
- Adds paragraph-level metadata fields:
paragraph_indexparagraph_ypage_y
These additions improve downstream layout-aware document understanding and help preserve ordering and spatial context in HTML ingestion workflows.
LLMEntityExtractor Improvements
This release also improves LLMEntityExtractor usability and robustness:
- Updated example notebook to reflect parameter name changes
- Improved
getFewShotExampleshandling for multiple input cases, particularly in notebook and Colab-style environments
Model Card Tag Cleanup
A metadata cleanup pass also ensures Markdown model cards use engine tags more consistently by aligning engine tags with the declared engine field where available.
🐛 Bug Fixes
- Fixed pretrained engine resolution edge cases in
ResourceDownloader - Fixed cloud-cache extraction and incomplete cache reuse for pretrained resources
- Fixed Azure
wasbs://cache handling - Fixed HTML reader header serialization issues
- Improved paragraph metadata emission in
HTMLReader - Improved
LLMEntityExtractorfew-shot example handling - Updated example notebooks and documentation consistency
❤️ Community Support
- Slack real-time discussion with the Spark NLP community and team
- GitHub issue tracking, feature requests, and contributions
- Discussions community ideas and showcases
- Medium latest Spark NLP articles and tutorials
- YouTube educational videos and demos
💻 Installation
Python
pip install spark-nlp==6.4.1Spark Packages
CPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.4.1GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.4.1Apple Silicon
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.4.1AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.4.1Maven
spark-nlp
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.4.1</version>
</dependency>spark-nlp-gpu
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.4.1</version>
</dependency>spark-nlp-silicon
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.4.1</version>
</dependency>spark-nlp-aarch64
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.4.1</version>
</dependency>FAT JARs
- CPU:
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.4.1.jar - GPU:
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.4.1.jar - Apple Silicon:
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.4.1.jar - AArch64:
https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.4.1.jar
What's Changed
- Adding preferred engine to pretrained and minor fixes #14689 by @ahmedlone127
- Fix tags #14695 by @AbdullahMubeenAnwar
- Update
llama_cpp_in_Spark_NLP_LLMEntityExtractor.ipynb#14756 by @AbdullahMubeenAnwar - Add multimodal image support to
VectorDBConnector#14760 by @AbdullahMubeenAnwar - Update
GetFewShotExampleshandling inLLMEntityExtractor#14762 by @ahmedlone127 - [SPARKNLP-1369] Fix cloud pretrained cache handling for large models #14763 by @danilojsl
- Add
LateChunkEmbeddingsannotator #14764 by @AbdullahMubeenAnwar - [SPARKNLP-1359] Implement
BiEncoderMultimodalEmbeddings#14767 by @danilojsl - Fix pretrained engine parameter behavior #14769 by @ahmedlone127
- [SPARKNLP-1378] HTMLReader Default Headers Error #14770 by @danilojsl
- [SPARKNLP-1385] Add
sentenceAwareFilteringparam #14772 by @danilojsl
Full Changelog: 6.4.0...6.4.1