Prithvi S

Posted on Apr 16

Why Polaris Never Touches Your Cloud Credentials: Storage Config Internals

#polaris #security #api #cloud

Every data engineer has a nightmare: discovering that a credential spreadsheet with AWS keys got committed to git. Or worse, finding that production credentials are sitting in a YAML file on 50 developer laptops.

Most data platforms solve this by asking you to trust them with your cloud credentials. Snowflake stores them. Hive stores them. Glue stores them. Then they promise really hard not to leak them.

Apache Polaris takes a different approach entirely. It never asks for your cloud credentials at all.

Instead, it does something cleverly different: it establishes trust relationships with your cloud provider, then mints temporary, scoped credentials on-the-fly whenever an engine needs to read or write data. You set it up once. Then Polaris handles the rest.

This is the foundation of Polaris's entire security model, and it's worth understanding deeply. Not just because it's clever, but because it fundamentally changes what's possible in multi-tenant, regulated, or security-conscious environments.

Let's dig into how it works.

The Traditional Problem: Credential Storage

When you set up Snowflake to read from S3, you provide your AWS credentials. Snowflake stores them (encrypted, they promise). When a query runs, Snowflake uses those credentials to access S3.

This creates several problems:

1. Long-lived credentials in the system. If Snowflake's database gets compromised, those credentials are exposed for months or years until someone notices and rotates them.

2. One set of credentials for many operations. The same credential can be used to read, write, delete, or modify anything in your S3 account. There's no granularity.

3. Difficult audit trails. When suspicious S3 access happens, you can't pinpoint which Snowflake query or which user triggered it. The logs just show "snowflake_service_account accessed this bucket."

4. Compliance friction. Regulated organizations (healthcare, finance) have strict rules about where credentials can live. Storing them in Snowflake often violates those policies.

5. Credential rotation is manual and risky. You have to update credentials in Snowflake, hope nothing breaks mid-rotation, and coordinate with other systems.

Polaris was designed to solve all of these at once.

How Polaris Does It: The Trust Model

Instead of storing credentials, Polaris stores a configuration that establishes trust with your cloud provider. Let's walk through S3 as the example.

Step 1: Register Your Cloud Storage

When you create a catalog in Polaris, you provide a storage configuration. For S3, that looks like this:

{
  "storageType": "S3",
  "config": {
    "externalId": "polaris-prod-7f92ac",
    "roleArn": "arn:aws:iam::123456789012:role/polaris-catalog-role",
    "bucket": "my-company-data-lake"
  }
}

Notice what's not here: no AWS access key. No secret key. No credentials of any kind.

What is here is a reference to an IAM role that you've already created in AWS, plus an external ID that makes the trust relationship unique to this Polaris instance.

Step 2: Set Up the Trust Relationship in AWS

Before Polaris can mint credentials, you need to create that IAM role and configure it to trust Polaris. Here's the trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::AWS_ACCOUNT_ID:user/polaris-service"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "polaris-prod-7f92ac"
        }
      }
    }
  ]
}

This says: "The Polaris service can assume this role, but only if it provides the external ID polaris-prod-7f92ac."

The external ID is crucial. It prevents confused deputy attacks. Even if an attacker compromises Polaris, they can't assume random IAM roles in other AWS accounts without the correct external ID.

Step 3: Attach Policies to the IAM Role

You then attach an S3 policy to that IAM role that limits what it can do:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-company-data-lake",
        "arn:aws:s3:::my-company-data-lake/*"
      ]
    }
  ]
}

This role can only read from your data lake bucket. It can't write, delete, or access anything else.

Now Polaris is set up. It has a configuration (not credentials) that points to this IAM role. It has an external ID. And the trust relationship is wired up in AWS.

The Credential Vending Flow

Here's where the magic happens. When a Spark engine wants to read data from a Polaris-managed table:

Request Phase

The Spark engine calls the Polaris REST API:

GET /v1/catalogs/my_catalog/namespaces/raw/tables/customers/data

Polaris receives this request and extracts the context: who is asking, what are they trying to do, and what table do they want to access.

Authorization Phase

Polaris checks its RBAC model. Does this principal have TABLE_READ_DATA permission on the customers table? It consults its role hierarchy:

The user's identity is bound to a principal role (e.g., "analytics_engineers")
That principal role is granted a catalog role (e.g., "read_raw_data")
That catalog role has TABLE_READ_DATA on the customers table

If the authorization check passes, Polaris moves to the next phase.

Credential Minting Phase

Polaris looks up the storage configuration for this table. It sees:

roleArn: arn:aws:iam::123456789012:role/polaris-catalog-role
externalId: polaris-prod-7f92ac

It then calls AWS STS AssumeRole:

aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/polaris-catalog-role \
  --external-id polaris-prod-7f92ac \
  --duration-seconds 900  # 15 minutes

AWS validates the external ID, checks the trust policy, and returns:

{
  "Credentials": {
    "AccessKeyId": "ASIA...",
    "SecretAccessKey": "...",
    "SessionToken": "...",
    "Expiration": "2026-04-16T09:28:00Z"
  }
}

These are temporary credentials. They expire in 15 minutes. They can only do what the IAM role allows (in this case, read S3).

Scope Restriction Phase

Polaris could stop here, but it doesn't. It further restricts these credentials to just the table being accessed. It uses S3 path restrictions or additional policy layers to ensure the credential can only touch s3://my-company-data-lake/raw/customers/, not s3://my-company-data-lake/sensitive/.

Response Phase

Polaris returns the temporary credentials to Spark:

{
  "credentials": {
    "aws_access_key_id": "ASIA...",
    "aws_secret_access_key": "...",
    "aws_session_token": "...",
    "expires_at": "2026-04-16T09:28:00Z"
  },
  "path": "s3://my-company-data-lake/raw/customers/"
}

Spark now has everything it needs. It can read data for 15 minutes. After that, the credential is useless.

Why This Matters: The Security Benefits

This design has profound implications:

1. No long-lived credentials stored anywhere. Polaris doesn't store AWS keys. Your laptop doesn't have them. They're generated on-demand and expire quickly.

2. Instant revocation. If you need to immediately revoke a user's access, you update their Polaris role. The next credential mint fails. There's no delay.

3. Audit trails. AWS logs show exactly which Polaris instance, with which external ID, assumed the role. You can trace every data access back to a specific user and query.

4. Fine-grained access control. Different tables can have different IAM roles with different permissions. Read-only tables get read-only roles. Write-enabled tables get write roles. A user's access to each table is independently controlled.

5. Multi-cloud compatibility. Polaris supports the same pattern for GCS (using service account tokens) and Azure (using managed identities). The mechanism changes, but the principle is the same: temporary, scoped credentials.

6. Compliance-friendly. Regulated organizations can enforce policies like "credentials must expire in under 30 minutes" or "all access must be auditable." Polaris handles both automatically.

The GCS and Azure Equivalents

The S3 pattern generalizes to other clouds.

Google Cloud Storage

With GCS, you don't provide a credential. Instead, you provide a service account:

{
  "storageType": "GCS",
  "config": {
    "projectId": "my-gcp-project",
    "serviceAccount": "polaris@my-gcp-project.iam.gserviceaccount.com"
  }
}

You configure GCS IAM so that Polaris's service account can impersonate a restricted role. When a credential is needed, Polaris calls GCS APIs to get a short-lived access token. Same pattern, different mechanism.

Azure

With Azure, you use managed identities:

{
  "storageType": "AZURE",
  "config": {
    "tenantId": "12345678-...",
    "storageAccount": "mycompanydatalake",
    "containerId": "raw"
  }
}

Polaris (running in a managed identity or service principal) gets a short-lived token from Azure AD. Again, the principle is identical: temporary, scoped, revocable credentials.

Credential Caching and Performance

One question you might have: doesn't minting a new credential for every request add latency?

Yes, but Polaris optimizes for this. It caches credentials locally. If the same user asks for credentials for the same table within a few minutes, Polaris returns the cached credential instead of calling the cloud provider again. This reduces latency to under 10ms in most cases.

The tradeoff is acceptable: an extra 100-200ms on the first request for a credential is well worth the security benefits of never storing cloud credentials.

Deployment Implications

How do you actually deploy this? Polaris itself needs to run somewhere, and it needs to be able to call AWS STS (or GCS, or Azure AD).

Typically, you run Polaris in a Kubernetes cluster with a Kubernetes service account. You configure IRSA (IAM Roles for Service Accounts) to bind that service account to an IAM role. Polaris then inherits permissions to call STS.

The configuration looks like:

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/polaris-service-role

This means Polaris gets credentials to call AWS APIs, but those credentials are also temporary and scoped. You've just nested trust relationships.

What Gets Stored in Polaris

Since Polaris doesn't store cloud credentials, what does it store?

Storage configurations (roleArn, externalId, bucket, project, etc.)
Entity metadata (table names, schemas, partitions)
RBAC definitions (which roles have which privileges)
Audit logs (who accessed what, when)

All of this lives in Polaris's metadata store, typically a PostgreSQL database. The metadata store itself should be encrypted at rest and in transit, but it doesn't contain cloud credentials. Even if the metadata store is compromised, an attacker can't access your data lake.

Real-World Example: A Multi-Tenant SaaS

Imagine you're building a data platform SaaS. You have 100 customers, each with their own S3 bucket. You can't ask each customer for their AWS credentials (security nightmare for them). Instead:

Each customer creates an IAM role in their AWS account and trusts your Polaris instance
They register that role ARN in Polaris during onboarding
Your single Polaris instance now manages access to 100 buckets securely
Each customer's queries get credentials scoped to their bucket only
You can audit which customer accessed what, when

This is impossible with traditional credential storage.

Looking Forward: Polaris v1.3.0 Enhancements

Polaris v1.3.0 extends this pattern to federated catalogs. You can now register external catalogs (Snowflake, AWS Glue, Databricks) with Polaris, and Polaris will vend credentials for them too.

This means you could have a single Polaris instance managing access across Iceberg catalogs, Glue catalogs, and Snowflake, all without storing credentials for any of them.

Conclusion

The reason Polaris never touches your cloud credentials is because it doesn't have to. By establishing trust relationships upfront and minting temporary credentials on-demand, Polaris achieves something that traditional data platforms can't: security without storing secrets.

This is why enterprises are moving to Polaris. Not just because Iceberg is open-source, but because the entire access control model is built for environments where credentials are liabilities, not assets.

If you're building data infrastructure at scale, this pattern is worth understanding. It might change how you think about credential management in your own systems.

Want to learn more?

I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community