This document explains how to integrate OpenLineage with Knowledge Catalog (formerly Dataplex Universal Catalog) to visualize lineage data from various systems.
Overview
OpenLineage is an open platform for collecting and analyzing data lineage information. Using an open standard for lineage data, OpenLineage captures lineage events from data pipeline components which use an OpenLineage API to report on runs, jobs, and datasets.
Through the Data Lineage API, you can import OpenLineage events to display in the Knowledge Catalog web interface alongside lineage information from Google Cloud services, such as BigQuery, Managed Service for Apache Airflow, Cloud Data Fusion, and Managed Service for Apache Spark.
To import OpenLineage events that use the
OpenLineage specification,
use the
ProcessOpenLineageRunEvent
REST API method, and map OpenLineage facets to Data Lineage API attributes.
Limitations
The Data Lineage API supports OpenLineage major version 1.
The Data Lineage API endpoint
ProcessOpenLineageRunEventonly acts as a consumer of OpenLineage messages, not a producer. The API lets you send lineage information generated by any OpenLineage-compliant tool or system into Knowledge Catalog. Some Google Cloud services, such as Managed Service for Apache Spark and Managed Airflow, include built-in OpenLineage producers that can send events to this endpoint, automating lineage capture from those services.The Data Lineage API doesn't support the following:
- Any subsequent OpenLineage release with message format changes
DatasetEventJobEvent
Maximum size of a single message is 5 MB.
Length of each Fully Qualified Name in inputs and outputs is limited to 4000 characters.
Links are grouped by events, with a maximum of 100 links per event. The maximum aggregate number of table-level links is 1000. If a message contains more than 1500 column-level links, the column-level information is skipped.
Knowledge Catalog displays a lineage graph for each job run, showing the inputs and outputs of lineage events. It doesn't support lower-level processes such as Spark stages.
OpenLineage mapping
For information about OpenLineage mapping, see OpenLineage mapping.
Import an OpenLineage event
If you haven't yet set up OpenLineage, see Getting started.
To import an OpenLineage event into Knowledge Catalog, call the API method
ProcessOpenLineageRunEvent.
C#
Before trying this sample, follow the C# setup instructions in the
Knowledge Catalog quickstart using
client libraries.
For more information, see the
Knowledge Catalog C# API
reference documentation.
To authenticate to Knowledge Catalog, set up Application Default Credentials.
For more information, see
Set up authentication for a local development environment.
C#
Go
Before trying this sample, follow the Go setup instructions in the
Knowledge Catalog quickstart using
client libraries.
For more information, see the
Knowledge Catalog Go API
reference documentation.
To authenticate to Knowledge Catalog, set up Application Default Credentials.
For more information, see
Set up authentication for a local development environment.
Go
Java
Before trying this sample, follow the Java setup instructions in the
Knowledge Catalog quickstart using
client libraries.
For more information, see the
Knowledge Catalog Java API
reference documentation.
To authenticate to Knowledge Catalog, set up Application Default Credentials.
For more information, see
Set up authentication for a local development environment.
Java
Python
Before trying this sample, follow the Python setup instructions in the
Knowledge Catalog quickstart using
client libraries.
For more information, see the
Knowledge Catalog Python API
reference documentation.
To authenticate to Knowledge Catalog, set up Application Default Credentials.
For more information, see
Set up authentication for a local development environment.
Python
Ruby
Before trying this sample, follow the Ruby setup instructions in the
Knowledge Catalog quickstart using
client libraries.
For more information, see the
Knowledge Catalog Ruby API
reference documentation.
To authenticate to Knowledge Catalog, set up Application Default Credentials.
For more information, see
Set up authentication for a local development environment.
Ruby
REST
To import an OpenLineage event, use the
processOpenLineageRunEvent method.
Before using any of the request data, make the following replacements:
PROJECT_ID: your Google Cloud project ID.LOCATION_ID: the Google Cloud location, such asus-central1.
HTTP method and URL:
POST https://datalineage.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID:processOpenLineageRunEvent
Request JSON body:
{
"eventTime": "2023-04-04T13:21:16.098Z",
"eventType": "COMPLETE",
"inputs": [
{
"name": "somename",
"namespace": "customnamespace"
}
],
"job": {
"name": "somename",
"namespace": "customnamespace"
},
"outputs": [
{
"name": "somename",
"namespace": "customnamespace"
}
],
"producer": "someproducer",
"run": {
"runId": "somerunid"
},
"schemaURL": "https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunEvent"
}
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{
"process": "projects/my-project/locations/us-central1/processes/my-process",
"run": "projects/my-project/locations/us-central1/processes/my-process/runs/my-run",
"lineageEvents": [
"projects/my-project/locations/us-central1/processes/my-process/runs/my-run/lineageEvents/my-lineage-event"
]
}
Tools for sending OpenLineage messages
To simplify sending events to the Data Lineage API, you can use various tools and libraries:
- Google Cloud Java Producer Library: Google provides an open-source Java library to help construct and send OpenLineage events to the Data Lineage API. For more information, see the blog post Producer java library for Data Lineage is now open source. The library is available on GitHub and Maven.
- OpenLineage GCP Transport: For Java-based OpenLineage producers, a
dedicated
GcpLineage Transport
is available. It simplifies integration with Data Lineage API, by
minimizing the code needed for sending events to Data Lineage API. The
GcpLineageTransportcan be configured as the event sink for any existing OpenLineage producer such as Airflow, Spark, and Flink. For more information and examples, see GcpLineage.
Analyze information from OpenLineage
To analyze the imported OpenLineage events, see View lineage graphs in Knowledge Catalog UI.
Stored data
The Data Lineage API doesn't store all facets data from the OpenLineage messages. The Data Lineage API stores the following facet fields:
spark_versionopenlineage-spark-versionspark-version
- all
spark.logicalPlan.* environment-properties(custom Google Cloud lineage facet)origin.sourcetypeandorigin.namespark.app.idspark.app.namespark.batch.idspark.batch.uuidspark.cluster.namespark.cluster.regionspark.job.idspark.job.uuidspark.project.idspark.query.node.namespark.session.idspark.session.uuid
The Data Lineage API stores the following information:
eventTimerun.runIdjob.namespacejob.name
What's next
- Learn more about data lineage with Managed Service for Apache Spark and Hive data lineage integrations.
- Try it in an interactive lab: Capture and Explore Data Updates With Data Lineage and OpenLineage