For Tecton to successfully read your data, Tecton requires the proper permissions and configuration. Permissions and configuration can vary per data source.
Supported Data Sources
Tecton on GCP supports the following data sources:
-
Files stored in Google Cloud Storage. The supported file formats are CSV, Parquet, and JSON.
-
BigQuery tables
-
Other GCP data sources that have a Spark Connector
-
Kafka
-
Pub/Sub
Connecting to a file in GCP
To connect to a file in GCP, follow these steps.
1. Add the bucket and give the service account permission to access the bucket
Give the service account you created for the data plane access to the GCP bucket for your data source.
2. Register the GCP data source
Once Tecton has access, register the data sources with Tecton in the file(s) in your feature repository that contain the data source objects, as shown below.
Create a config object using FileConfig and place it in a BatchSource object. For example:
sample_data_config = FileConfig(uri="gs://{YOUR-BUCKET-NAME-HERE}/{YOUR-FILENAME}.pq", file_format="parquet")
sample_data_vds = BatchSource(name="sample_data", batch_config=sample_data_config, ...)
After you have created these objects in your local feature repository, run tecton apply to submit them to the production Feature Store.
3. Test the GCP data source
To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:
ds = tecton.get_data_source("sample_data")
ds.get_dataframe().to_pandas().head(10).show()
If you get a 403 ERROR when calling the get_dataframe command, Tecton does not have permission to access the data. Check the bucket permissions. If you continue to get errors, contact Tecton support.
Connecting to Kafka
Follow these instructions.
Connecting to BigQuery
1. Create your data source
Grant access to the table to the service account configured for your Spark jobs. Then, create a batch config function that reads from your BigQuery table.
@spark_batch_config()
def bigquery_config(spark):
df = (
spark.read.format("com.google.cloud.spark.bigquery")
.option("table", "bigquery-public-data.google_trends.international_top_terms")
.load()
)
return df
data_source = BatchSource(name="bigquery_source", batch_config=bigquery_config)
2. Test your data source
To test that the connection to the GCP data source has been made correctly, open the interactive notebook that you use for Tecton development and preview the data:
ds = tecton.get_data_source("bigquery_source")
ds.get_dataframe().to_pandas().head(10).show()