A data profile scan for unstructured data in Knowledge Catalog transforms dark data or unstructured files such as PDFs in Cloud Storage into structured, queryable assets in BigQuery. While standard discovery tools are limited to file-level metadata such as size and type, a data profile scan for unstructured data powered by Vertex AI Gemini models analyzes file contents. It automatically extracts the business context required to ground AI agents and power advanced analytics.
This automation eliminates the need for manual document parsing and custom ETL code, letting you discover, classify, and use data that was previously inaccessible.
A data profile scan for unstructured data analyzes the content of unstructured files to extract information and infer schemas. This is different from the data insights for structured data feature, which generates descriptions and SQL queries based on the metadata of existing structured tables, and from standard statistical data profiling, which calculates metrics like null counts and value distributions.
Automated discovery and semantic profiling
You can perform unstructured data profiling using two different workflows, depending on your starting point:
During a Cloud Storage discovery scan: A discovery scan automatically locates your unstructured files in Cloud Storage and catalogs them into one or multiple object tables in BigQuery for analysis. An object table is a read-only table over unstructured data objects that reside in Cloud Storage. When you run a discovery scan with Enable semantic inference enabled, it serves as the automated entry point for unstructured data profiling.
As a standalone data profile scan for unstructured data: If you already have existing BigQuery object tables, you can run a data profile scan for unstructured data directly on those tables. In this standalone workflow, you can also guide the extraction by providing a customized prompt in the DataScan specification.
When unstructured data profiling is performed (either automatically during a discovery scan or as a standalone scan), the system registers the object tables as entries in Knowledge Catalog. An entry represents a data asset for which you capture metadata. When multiple tables are created due to a discovery scan, each entry has its own insights tab. You can then open this entry to explore the generated data insights. The system performs these actions:
Identifies and groups files (Discovery scan only). Automatically identifies and organizes unstructured files in Cloud Storage into object tables. These object tables are read-only tables that provide a structured interface to your unstructured data.
Performs a data profile scan for unstructured data. Uses Vertex AI Gemini models to analyze the content within the files to understand their meaning and structure. This includes entity inference, which uses generative AI to extract specific attributes, for example,
Company,Product, orSerial Number, from the file content. It also includes relationship extraction, which identifies how these entities connect, for example,Component is_part_of Product, to create a semantic graph. If you are running a standalone profile scan, you can guide this extraction by providing a customized prompt in the DataScan specification.Generates schemas and graph profiles. Provides an AI-suggested relational schema and attaches a
Graph Profileaspect (dataplex-types.global.graph-profile) to the catalog entry representing the object table. Aspects let you capture metadata within entries. This metadata aspect contains the inferred schemas for the entities (NodeType) and relationships (EdgeType).Enriches metadata. Automatically populates the Knowledge Catalog with AI-generated metadata. This makes the data searchable and ready for extraction.
Instead of manually designing database schemas, you can perform data extraction using one-click SQL or pipeline orchestration. This process materializes inferred entities and relationships into structured formats, such as physical BigQuery tables or views.
API methods
You can configure, run, and manage data profile scans for unstructured data and their resulting catalog entries using the following REST API methods:
| API method | Description |
|---|---|
projects.locations.dataScans.create |
Creates a discovery scan (using dataDiscoverySpec) or a standalone data profile scan for unstructured data (using unstructuredDataProfileSpec). |
projects.locations.dataScans.run |
Triggers an on-demand data profile scan or discovery scan job to analyze unstructured files and generate semantic insights. |
projects.locations.dataScans.get |
Retrieves the configuration details and latest job results of an existing data profile scan. |
projects.locations.dataScans.jobs.list |
Lists historical scan jobs for a specific data profile scan or discovery scan. |
projects.locations.dataScans.jobs.get |
Retrieves detailed execution results and logs for a specific data profile scan job. |
projects.locations.entryGroups.entries.get |
Retrieves a catalog entry representing an object table, including its attached AI-generated metadata aspects (such as GraphProfile). |
projects.locations.entryGroups.entries.patch |
Updates a catalog entry to attach, modify, or curate metadata aspects (such as dataplex-types.global.graph-profile). |
Use cases
You can use data profile scans for unstructured data for various purposes across different industry domains, including the following:
Pipeline setup and zero-ETL normalization. Ease data extraction from Cloud Storage to BigQuery by replacing custom parsers with automated schema suggestion and one-click deployment to materialize data into BigQuery tables, views, or semantic graphs.
For example, in ecommerce and retail, a marketplace can automatically normalize supplier invoices and purchase orders in hundreds of differing PDF layouts into a cohesive, unified BigQuery schema (mapping
Unit Pr.,Price/Pkg, andItem Costto a singleUnit_Pricecolumn) without writing custom parsing code. In healthcare, biostatisticians can ingest multi-center clinical trial protocols and case report forms (CRFs) into structured tables for rapid cohort analysis.Content classification and validation. Automatically group dark data into searchable assets enriched with AI-generated metadata, which lets data stewards perform human-in-the-loop validation and monitoring of extracted entities at scale.
For example, in financial services, an investment bank conducting M&A due diligence can automatically classify large repositories of historical contracts and credit agreements, extracting complex legal entities (
Contracting_Parties,Indemnity_Cap,Governing_Law). Data stewards can explore the visual knowledge graph on the Insights tab to identify high-risk liabilities before exporting data to executive reports.AI agent grounding. Ground retrieval-augmented generation (RAG) agents with verified graphs. This provides a clear "chain of traceability" connecting raw files to structured business logic, reducing hallucination, which lets AI agents navigate multi-table joins with zero ambiguity.
For example, in manufacturing and industrial operations, a heavy machinery company can extract equipment relationships from decades of unstructured field maintenance logs and incident reports. When an on-site technician asks a conversational AI agent how to resolve an uncharacteristic hydraulic pressure drop, the agent uses the verified relationship graph (
Error_Code indicates_failure Hydraulic_Valve) to deliver an accurate, step-by-step repair plan citing the exact historical incident report.
Limitations
Review the following limitations before using data profile scans for unstructured data:
Supported formats. While discovery scans automatically identify and group various unstructured file types into BigQuery object tables, the semantic inference engine for data profile scans for unstructured data is optimized primarily for PDF documents.
Locations. Data profile scans for unstructured data are only available in locations that support Vertex AI Gemini 2.5 Pro models (for example,
us-central1,europe-west1,asia-southeast1). For a list of supported regions, see the Supported regions section in Gemini 2.5 Pro. Scans created in unsupported regions return validation or execution errors.Resource scope. Data profile scans for unstructured data operate exclusively on BigQuery object tables. They don't support standard BigQuery structured tables, external tables over structured data, or BigQuery views.
Pricing
During the Public Preview phase, data profile scans for unstructured data are available for experimentation and testing under specialized promotional terms:
Semantic inference. There is no charge for using Vertex AI Gemini models to extract semantic information and infer graph profiles during discovery scans throughout the preview period.
Underlying resource costs. Standard charges apply for the resources required to store and process your data:
Knowledge Catalog
Discovery scans are billed based on Knowledge Catalog Premium processing SKUs (DCU hours) for the baseline scanning and grouping of unstructured files. For more information, see Knowledge Catalog pricing.
AI-generated metadata aspects, including graph profiles, incur standard Knowledge Catalog catalog storage charges.
BigQuery and Dataform
If using the pipeline extraction method, standard charges for Dataform execution and BigQuery jobs apply.
If using the SQL method, standard BigQuery ML charges (
ML.PROCESS_DOCUMENT) and BigQuery query processing fees apply.Any data materialized into BigQuery, including object tables, inferred metadata, and extracted entities, incurs standard BigQuery storage and query charges. For more information, see BigQuery pricing.
Official dedicated billing structures for data profile scans for unstructured data and semantic inference start upon General Availability (GA).
Quotas
Standard DataScan resource and API quotas apply to each individual discovery scan or data profile scan job. A specific quota governs semantic inference volume: Total daily executions of data profile scans for unstructured data on BigQuery object tables are limited to 140 executions per project per day.
When unstructured data profiling is performed during a discovery scan, the limits for how many tables a discovery scan supports also apply. For more information, see BigQuery quotas and limits.
What's next
- Learn how to use discovery scan for unstructured data.
- Learn how to use data profile for unstructured data.
- Learn more about Discovering data.
- Read About data profiling.