h2

8 minute read · August 6, 2024

Getting Hands-on with Snowflake Managed Polaris

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

In previous blogs, we've discussed understanding Polaris Catalog's architecture and getting hands-on with Polaris Catalog self-managed OSS; in this article, I hope to show you how to get hands-on with the Snowflake Managed version of Polaris Catalog, which is currently in public preview.

Getting Started

To get started, you'll need a Snowflake account; if you don't already have one, you can create a trial account for free at snowflake.com.

Once you have an account, head over to the "Admin" section, and you can add another account and you'll see the option "Create Polaris Account".

The account will be added to the list of accounts, and you'll want to copy your locator ID (the URL associated with this account) somewhere so you can access it when you need it. You'll find this under the "locator" column in the list of accounts.

Take that URL and open it in another URL to access the Polaris management console logging with the credentials you created when you created the Polaris account. Then, you'll click on "catalogs" and create a new catalog, which will open a dialogue box that looks like this.

Then, fill out the fields with an S3 path where you want data stored under "default base location" and an ARN for a role with read/write access to that bucket.

Then, you'll want to click on the catalog in the list of catalogs and create a catalog role and assign that role that CATALOG_MANAGE_CONTENT privilege. Then head over to the "roles" section under "connections" and create a principal role then head back to the catalog and assign the principal role the catalog role you created earlier.

Once this is created, create a new connection/principal and assign the principal role you created to the new connection (for this demo, not for production). Afterwards, you'll get the credentials for this user; Make sure to copy this over somewhere you can access it for later.

Now, the catalog is set and can be used from Snowflake as an external catalog, as documented here. It is also possible to use the Polaris catalog with other engines that support the Apache Iceberg REST Catalog, like Spark.

Trying Out Our Catalog in Spark

You can spin up spark on your laptop with the following command.

docker run -d \
  --platform linux/x86_64 \
  --name spark \
  -p 8080:8080 \
  -p 7077:7077 \
  -p 8888:8888 \
  -e AWS_REGION=us-east-1 \
  -e AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  -e AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  alexmerced/spark35notebook:latest

Keep an eye out for the URL to access our python notebooks in the out put that should look like:

http://127.0.0.1:8888/lab?token=5ca0733a9874062c80bc0227c66011f121bd287279d5093a

Then in Pyspark you should be able to run code like the following:

import pyspark
from pyspark.sql import SparkSession
import os

## DEFINE SENSITIVE VARIABLES
POLARIS_URI = 'https://xxxxxxxxxx.snowflakecomputing.com/polaris/api/catalog' # Locator URL From Snowflake
POLARIS_CATALOG_NAME = 'polaris'
POLARIS_CREDENTIALS = 'xxxxxxxxxxxx:yyyyyy' #Principal Credentials accesskey:secret
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'



conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
  		#packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0')
  		#SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
  		#Configuring Catalog
        .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
        .set('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation', 'true')
        .set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
        .set('spark.sql.catalog.polaris.uri', POLARIS_URI)
        .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
        .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
        .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

## Run a Query
spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show()
spark.sql("CREATE TABLE polaris.db.names (name STRING) USING iceberg").show()
spark.sql("INSERT INTO polaris.db.names VALUES ('Alex Merced'), ('Andrew Madson')").show()
spark.sql("SELECT * FROM polaris.db.names").show()

Conclusion

Keep in mind that Polaris is in the early stages of public preview, so there may be imperfections and troubleshooting as it gets refined. But hopefully, this will help you on your journey to getting started with Polaris at these early stages.

As mentioned in this Datanami article, some of the open-source Nessie catalog code may find its way into Polaris. Below are some exercises to get hands-on with Nessie and learn about what may be in store for Polaris's future.

Here are Some Exercises for you to See Nessie’s Features at Work on Your Laptop

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.