Palantir Foundry Developer Handbook
A production-oriented handbook for engineers, FDEs, analytics leads, and application builders working across Foundry's data integration, transforms, Ontology, and operational application stack.
Table of Contents
This handbook follows the same mental model Palantir uses in the documentation: data first lands as datasets, is shaped into reliable pipelines, becomes business-native through the Ontology, and is then activated in analytics and operational applications.
Platform philosophy, Compass, Projects, folders, RIDs, data branching, and Data Lineage.
How Foundry connects to JDBC systems, APIs, file stores, and operational systems, then lands raw data as governed datasets.
Foundry-native Python transforms, incremental pipelines, and when to choose Code Repositories over Pipeline Builder.
Why Foundry models the business as an operational graph instead of leaving teams with raw tables and joins.
How analysts, data scientists, and operational teams consume and act on data without breaking governance.
Mandatory controls, project roles, policy enforcement, and how security propagates with data rather than relying on ad hoc dashboards.
End-to-end scenarios tying ingestion, transforms, Ontology, ML, analytics, and actions into operational systems.
Module 1: Foundry Architecture and Core Concepts
The Platform Philosophy
Foundry is not trying to be a prettier data lake. It is trying to close the gap between data engineering and operations. The core journey is: Integration -> Transformation -> Ontology -> Applications. Each step adds more structure, more accountability, and more operational usefulness.
A useful analogy is an industrial refinery. Raw crude oil is valuable, but nobody wants to run a logistics operation directly on crude oil. You refine it into diesel, jet fuel, and lubricants with known quality and governance. Foundry does the same for enterprise data: raw ERP extracts and API payloads are refined into reusable datasets, then elevated into business-native entities like Factory, Part, Shipment, or Transaction.
Compass and the Filesystem
Compass is the shared file-and-resource layer of Foundry. Public documentation describes Projects and resources as the basic building blocks of the platform. A Project is the collaboration boundary. A resource is the thing inside it: dataset, repository, analysis, application, report, or other artifact.
- Projects are the main collaboration and security boundary. They carry project-level roles, discovery settings, and inherited access controls.
- Folders are organizational structure inside a project. They keep a project navigable but do not replace the project as the primary governance boundary.
- Resources are the actual working assets. In Foundry terms, a dataset is a resource, a code repository is a resource, and a Workshop application is a resource.
- RIDs are globally unique resource identifiers. They matter because names and folder locations can change, while the RID remains the canonical machine identity.
Think of a Project like a secured building, folders like rooms, and resources like the equipment in those rooms. The label on the door may change, but the serial number engraved on the machine stays fixed. That serial number is the RID.
| Concept | What it is | Why it matters |
|---|---|---|
| Project | Primary collaboration and permission boundary | Standardizes access, ownership, and discoverability for related work |
| Folder | Organizational container inside a Project | Keeps complex delivery programs navigable |
| Resource | Dataset, repo, dashboard, app, report, or other asset | Common security, metadata, comments, sharing, and auditing model |
| RID | Stable unique identifier for a resource | Decouples references from fragile human-readable paths |
Datasets themselves are wrappers around files stored in a backing filesystem, often cloud object storage. The value of the dataset abstraction is not the bytes alone. It is the managed metadata around them: schema, transactions, branches, permissions, lineage, and build semantics.
Branching and Data Lineage
This is where Foundry usually clicks for engineering teams. Most platforms version code. Foundry versions code and data together. Public documentation explicitly describes dataset transactions and dataset branches as the basis of Foundry's "Git for data" behavior.
- Dataset transactions are atomic changes to a dataset's contents. Foundry supports
SNAPSHOT,APPEND,UPDATE, andDELETEtransactions. - Dataset branches are pointers to transaction histories on a dataset. They are conceptually similar to Git branches, but the docs note that dataset branches themselves are not merged like Git branches.
- Build branches tie Git-like code branching to dataset branching. When you build on a feature branch, Foundry compiles the branch's JobSpecs and writes outputs only to that branch.
- Fallback chains let a branch read branch-local logic or data where present and fall back to
masterwhere nothing changed.
Traditional data stacks usually force teams into one of two bad choices: either test against production-like data outside the main pipeline, or run risky changes directly in shared production tables. Foundry's branch-aware build system provides a third option: a safe rehearsal environment where both transformation logic and downstream datasets can evolve together.
Code branch: feature/late-shipments
Dataset branches: raw_orders@master, curated_orders@feature, alerts@feature
Build fallback: feature -> master
Result:
- unchanged upstream inputs can still be read from master
- changed transforms publish branch-specific JobSpecs
- changed outputs materialize only on the feature branch
- downstream users on master see no disruption
Data Lineage is the explainer surface for all of this. The docs describe it as an interactive tool for holistically viewing how data flows through the platform. In practice, it is the control tower for understanding:
- where a dataset came from,
- which transforms produced it,
- which downstream datasets and applications depend on it,
- what code generated it,
- whether it is stale or out of date, and
- what security requirements were inherited along the way.
The best analogy is software dependency tracing plus change impact analysis, but for data products. Instead of asking "What services call this API?", you ask "If I change this transform or this upstream source, which datasets, object types, dashboards, and Workshop modules become affected?"
| Traditional stack problem | Foundry answer | Enterprise value |
|---|---|---|
| Opaque ETL jobs | Lineage graph across datasets, code, and builds | Faster root-cause analysis and safer change review |
| Shared production tables make experimentation risky | Code branches plus dataset branches | Parallel development without corrupting shared outputs |
| Security reviews happen after data is copied around | Lineage-aware security inheritance | Access control moves with the data automatically |
Module 2: Data Integration and Ingestion
Data Connection
Foundry's public documentation frames ingestion around the Data Connection framework. That framework is designed to manage source connections over time using dataset transactions, branching, granular security, and synchronized metadata. In field language, practitioners often refer to the underlying connector and orchestration pattern as Magritte and source agents; in the product docs, the supported surface is the Data Connection application and its source-specific sync capabilities.
At a practical level, Data Connection gives Foundry a standard contract for pulling from very different sources:
A good analogy is a managed loading dock for the enterprise. The loading dock does not care whether the incoming goods arrived by truck, ship, or rail. It standardizes intake, manifests, timestamps, security checks, and hand-off into the warehouse. Data Connection plays that role for source systems.
Syncs and Schedules
Foundry distinguishes between getting data in once and getting data in reliably over time. That sounds obvious, but it is where many data platforms quietly fail. A one-off extract is not a data product. A repeatable sync with lineage, permissions, and clear transaction semantics is.
| Pattern | How it works | When to use it |
|---|---|---|
| Direct or manual ingestion | Upload or one-time import into a dataset resource | Bootstrapping, prototypes, ad hoc investigations |
| Scheduled sync | Recurring ingestion that lands new dataset transactions over time | Operational production pipelines |
| Virtual access | Expose source data without full replication in some cases | Latency-sensitive or governance-constrained access patterns |
| External transforms | Code-driven scheduled API interaction using Code Repositories | Custom REST ingestion and outbound integration workflows |
What lands in the Foundry filesystem is usually a raw dataset. That raw dataset is intentionally close to source reality. You do not want business logic hidden in the loading step. The source system should remain auditable, and the refinement should happen downstream in transformations.
Transaction-Aware Ingestion
The public dataset documentation is explicit that ingestion style matters because it determines downstream pipeline behavior:
- SNAPSHOT means the sync replaces the full current view. Simple, but expensive at scale.
- APPEND means only new files are added. This is the foundation for performant incremental pipelines.
- UPDATE means new files may arrive and old files may be overwritten. Useful when the source mutates records, but it breaks append-only assumptions.
- DELETE supports retention and controlled removal from the current dataset view.
If you are building a serious production pipeline, the ingestion mode is not just a connector setting. It is an architectural decision that determines cost profile, latency, and whether downstream pipelines can remain incremental.
Module 3: Data Transformation in Code Repositories
The Python Transforms API
Foundry's transforms.api is the contract between your Python code and Foundry's build system. This is what turns a PySpark function into a governed pipeline step with declared inputs, outputs, lineage, checks, preview support, branching, and scheduling.
That declaration step is the important difference from generic PySpark. In open Spark, you can read anything and write anywhere as long as the cluster permits it. In Foundry, you declare the data contract up front so the platform can reason about lineage, impact, permissions, and builds.
Copy-Pasteable PySpark Transform
from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
@transform_df(
Output("/Acme/SupplyChain/curated/factory_part_demand"),
shipments=Input("/Acme/SupplyChain/raw/erp_shipments"),
part_master=Input("/Acme/SupplyChain/master/parts"),
)
def compute(shipments, part_master):
return (
shipments
.join(part_master, on="part_id", how="left")
.withColumn("required_date", F.to_date("required_timestamp"))
.withColumn("late_flag", F.col("promised_timestamp") > F.col("required_timestamp"))
.withColumn("open_value_usd", F.round(F.col("quantity") * F.col("unit_cost_usd"), 2))
.select(
"shipment_id",
"factory_id",
"part_id",
"part_description",
"required_date",
"quantity",
"open_value_usd",
"late_flag",
)
)
This snippet uses the exact Foundry wrapper style requested: from transforms.api import transform_df, Input, Output. The platform knows:
- which datasets are read,
- which dataset is produced,
- how to render the node in lineage,
- what changes a pull request may impact, and
- which branch-specific output should be built during development.
Why Incremental Processing Matters
Official Foundry documentation is direct here: incremental pipelines avoid recomputing unchanged data and are often necessary when input scale is high. If your transaction logs are growing by millions of rows per day, full snapshots are a tax you will keep paying forever.
The analogy is simple. A nightly batch rebuild is like recalculating every bank account in the country because one customer made a deposit. Incremental processing instead says: process the new deposit, update the affected state, and move on.
Incremental PySpark Transform
from pyspark.sql import functions as F
from transforms.api import (
incremental,
transform,
Input,
Output,
IncrementalTransformInput,
)
@incremental()
@transform(
risk_scores=Output("/Acme/AML/curated/transaction_risk_scores"),
transactions=Input("/Acme/AML/raw/daily_transactions"),
customers=Input("/Acme/AML/master/customers"),
)
def compute(ctx, risk_scores, transactions: IncrementalTransformInput, customers):
new_transactions = transactions.dataframe("added")
if ctx.is_incremental and new_transactions.rdd.isEmpty():
return
customer_df = customers.dataframe()
scored = (
new_transactions
.join(customer_df, on="customer_id", how="left")
.withColumn(
"risk_score",
F.when(F.col("amount_usd") >= 10000, F.lit(0.60)).otherwise(F.lit(0.10))
+ F.when(F.col("high_risk_country") == F.lit(True), F.lit(0.25)).otherwise(F.lit(0.00))
+ F.when(F.col("pep_flag") == F.lit(True), F.lit(0.15)).otherwise(F.lit(0.00))
)
.withColumn(
"risk_bucket",
F.when(F.col("risk_score") >= 0.75, F.lit("HIGH")).otherwise(F.lit("STANDARD"))
)
.select(
"transaction_id",
"customer_id",
"booking_date",
"amount_usd",
"risk_score",
"risk_bucket",
)
)
risk_scores.write_dataframe(scored)
This example uses IncrementalTransformInput directly and reads only the added window from the transactions input, which is the exact capability documented in the API reference. That is what keeps the transform proportional to new data instead of proportional to total historical data.
SNAPSHOT or UPDATE events. Use incremental logic when the volume justifies the added complexity.Code Repositories vs Pipeline Builder
Both are first-class. The wrong move is treating one as "for engineers" and the other as "for non-engineers." The real decision is about control, complexity, and maintainability.
| Choose this | Best for | Trade-off |
|---|---|---|
| Code Repositories | Complex business logic, PySpark, custom libraries, tests, code review, reusable engineering standards | Higher engineering overhead, slower for simple mappings |
| Pipeline Builder | Fast delivery, visual composition, common joins/filters, streaming and batch patterns, lower-code delivery | Less expressive for specialized logic or heavy software-engineering workflows |
A useful heuristic:
- Use Pipeline Builder when the data flow is legible as a pipeline diagram and your transformations are primarily declarative.
- Use Code Repositories when you need software-engineering discipline: abstractions, libraries, advanced logic, unit tests, branch review, or specialized Spark behavior.
Module 4: The Ontology
Objects, Links, and Properties
The Ontology is the heart of Foundry because it changes the question from "What tables do we have?" to "What parts of the business are we representing, and how do they relate?" Public documentation describes the Ontology as an operational layer sitting on top of datasets, models, and other digital assets, connecting them to their real-world counterparts.
The conceptual shift is from row-oriented thinking to domain-oriented thinking. In a warehouse, a user may need to remember that fact_shipments.factory_id = dim_factory.id. In the Ontology, a user works with a Factory object that already knows its related Shipment, Supplier, or Part objects. The join logic becomes part of the platform's semantic contract rather than tribal knowledge in SQL.
Why Foundry Pushes the Ontology So Hard
- Reuse: multiple applications can rely on the same semantic layer rather than each dashboard re-implementing business joins and definitions.
- Operational consistency: the same object and action definitions can be exposed in Workshop, Quiver, SDK-driven apps, and search.
- Governance: security and change control apply at the same layer where business users actually work.
- Decision capture: the platform does not stop at analytics. Actions and functions let users change operational state in a governed way.
Actions and Functions
Foundry distinguishes between action types and functions. Action types are the user-facing, governed transaction surface. Functions are server-side logic units that can compute values, return object sets, or generate Ontology edits. When an edit function is wired into an action type, users can safely write decisions back into the Ontology.
That is why Foundry applications are more than dashboards. A planner can change the state of a shipment. An investigator can escalate a transaction. An operations lead can assign ownership. The logic is centralized, audited, and permissioned.
Function-Backed Ontology Action Example
import { Shipment } from "@ontology/sdk";
import { Client, Osdk } from "@osdk/client";
import { createEditBatch, Edits } from "@osdk/functions";
type ShipmentEdit = Edits.Object<Shipment>;
export default function requestExpedite(
client: Client,
shipment: Osdk.Instance<Shipment>,
requestedBy: string,
reason: string,
): ShipmentEdit[] {
const batch = createEditBatch<ShipmentEdit>(client);
batch.update(shipment, {
status: "EXPEDITE_REQUESTED",
expediteRequestedBy: requestedBy,
expediteReason: reason,
expediteRequestedAt: new Date().toISOString(),
});
return batch.getEdits();
}
This snippet follows the official TypeScript v2 functions pattern for Ontology edits: define an Edits type, create an edit batch with createEditBatch, update the object, and return the edits. Per the docs, the edits only take effect when the function is configured as a function-backed action.
Read-Oriented Python Function Example
from functions.api import function
from ontology_sdk import FoundryClient
from ontology_sdk.ontology.objects import Transaction
from ontology_sdk.ontology.object_sets import TransactionObjectSet
@function
def high_risk_transactions(min_score: float) -> TransactionObjectSet:
client = FoundryClient()
return client.ontology.objects.Transaction.where(
Transaction.object_type.riskScore >= min_score
)
Use read-oriented functions like this when you need server-side logic for Workshop, Quiver, or other operational interfaces. Use edit-returning functions when the workflow needs governed writeback.
Module 5: Analytics and Operational Applications
Code Workspaces
Code Workspaces gives users managed JupyterLab, RStudio, and VS Code environments inside Foundry. The public docs emphasize that these workspaces inherit Foundry's security, permissions, branching, scheduling, and repository infrastructure.
For data scientists, the value is straightforward: work in a familiar notebook or IDE, but on the same governed datasets and object model as the rest of the platform. No shadow copy. No side channel. No unsecured export just to train a model.
- Use Code Workspaces for model exploration, feature engineering, evaluation, and research workflows.
- Use Code Repositories or Pipeline Builder for large-scale production transforms, because the docs explicitly note that Code Workspaces is single-node while the other tools leverage Spark-oriented infrastructure.
- Use the same repository and branching discipline to move successful exploratory work into production pipelines.
Contour and Quiver
Contour and Quiver are both analysis tools, but they sit on different mental models.
| Tool | Best mental model | Best use case |
|---|---|---|
| Contour | Table-centric, point-and-click analysis at scale | Large tabular analysis, dataset derivation, low-code transformations, dashboards over tables |
| Quiver | Object-aware and time-series-aware analytics | Ontology-driven analysis, linked-object exploration, operational dashboards, time-series workflows |
A simple analogy: Contour is closer to a governed, scalable spreadsheet-plus-query environment for tables. Quiver is closer to an operational analytics canvas where objects and signals are native citizens.
- Use Contour when your data is still mostly tabular, some data is not yet mapped into the Ontology, or you need large joins and transformations without writing code.
- Use Quiver when your data is mapped in the Ontology, relationships matter, time series matters, and the result should plug directly into operational applications.
Workshop
Workshop is Foundry's operational application builder. The docs describe it as a flexible, object-oriented application building tool that uses Ontology objects, links, actions, and functions as first-class building blocks. That is the key distinction: it is not merely a dashboard builder.
Think of Workshop as the last mile between a governed digital twin and the humans who need to operate the business. A CRM, an alert triage desk, a maintenance queue, a parts shortage cockpit, and a fraud review inbox are all natural Workshop workloads.
Why Foundry does this differently: in a traditional stack, the BI layer is read-only and the operational app is a separate engineering program. Foundry tries to collapse that gap so that analytics and operations run on the same semantic and governance substrate.
Module 6: Security and Governance
Markings and Mandatory Controls
Foundry's security model is built around the idea that access control should travel with the data, not be bolted onto the final dashboard. The docs frame this as a combination of mandatory controls and discretionary controls.
- Mandatory controls include Organizations and Markings. If a user does not meet them, roles do not help.
- Discretionary controls are roles like Owner, Editor, Viewer, and Discoverer on Projects and resources.
- Markings are conjunctive. A user must satisfy all applied markings to access the resource.
- Project roles determine what a user can do once they are allowed through the mandatory gate.
The public docs are especially clear that markings inherit both through the file hierarchy and through data dependencies. That means a sensitive upstream dataset can automatically impose additional data requirements on downstream derivatives.
If the raw dataset carries a PII marking, that constraint propagates unless it is deliberately and correctly removed as part of an approved transformation stage. This is exactly why compliance teams like the platform. You do not have to hope every downstream analyst remembered the sensitivity level. The platform enforces it.
CBAC and PBAC in Practice
In customer conversations you will often hear CBAC and PBAC. Foundry's public docs emphasize markings, organizations, roles, restricted views, and additional data requirements more than those acronyms, but the enterprise interpretation is usually:
| Model | How to think about it in Foundry | Typical implementation surface |
|---|---|---|
| CBAC | Classification-based access control. Access depends on the sensitivity classification attached to data. | Markings, organizations, inherited data requirements, project boundaries |
| PBAC | Purpose-based access control. Access is constrained to approved workflows and legitimate business purpose. | Project roles, application-specific access, action permissions, functions, restricted views, policy-driven workflow design |
In other words, CBAC answers "What classification is this data?" PBAC answers "Even if I can see it, what am I allowed to do with it in this workflow?" Foundry's advantage is that both questions can be enforced inside the same lineage-aware platform.
Automatic Propagation Through Lineage
The docs explicitly state that restricting access to a dataset restricts access to downstream derived data because markings inherit along data dependencies. That propagation is one of the most important reasons Foundry commands enterprise budget:
- It reduces accidental oversharing of sensitive derivatives.
- It makes data lineage a live control surface for security review.
- It allows teams to reason about the impact of expanding or removing access.
- It makes downstream application builders inherit governance rather than recreate it badly.
Security review becomes much more legible too. The access checker and Data Lineage views let teams inspect not only whether a user has access to a resource, but whether they meet additional data requirements inherited from lineage.
Module 7: Real-World Use Cases
Scenario 1: Supply Chain Command Center
Goal: ingest ERP and logistics data, produce operational Factory and Part objects, and let planners trigger an Expedite Shipment action from a Workshop application.
How the pieces combine
- Integration: Data Connection ingests ERP purchase orders, inventory balances, supplier confirmations, and shipment events from databases, S3 drops, or APIs.
- Transformation: Code Repositories standardize plant IDs, deduplicate part masters, calculate shortage risk, and produce curated datasets for supply planning.
- Ontology: Curated datasets are mapped into
Factory,Part,Supplier, andShipmentobjects with links likeFactory consumes PartandSupplier ships Shipment. - Applications: Workshop shows shortages, at-risk shipments, and planner work queues. Quiver charts lead-time deterioration over time. Actions capture interventions.
Why Foundry is strong here
In a conventional stack, the command center is often a fragile front-end project sitting on top of replicated warehouse views and custom service endpoints. In Foundry, the app can directly use object-aware search, links, actions, and security on top of the Ontology.
Representative action
A Workshop button calls a function-backed action similar to the requestExpedite example above. That action can require planner permissions, enforce that only at-risk shipments are eligible, and write the decision into the shipment object's writeback dataset so the whole organization sees the new state.
Scenario 2: Anti-Money Laundering Alerting
Goal: process daily transaction logs incrementally, score transactions in a model workflow, and surface high-risk Transaction objects in an investigator inbox.
How the pieces combine
- Integration: daily or near-real-time transaction files land as append-oriented datasets so the pipeline can stay incremental.
- Transformation: Foundry incremental transforms process only new transactions, enrich them with customer and sanctions context, and calculate base risk features.
- Code Workspaces: investigators and data scientists use JupyterLab or RStudio in Code Workspaces to train and validate models on the same governed data foundation.
- Ontology: scored records become
Transaction,Account,Customer, andCaseobjects, linked for search and triage. - Applications: Workshop provides an inbox for high-risk transactions, Quiver provides trend and time-series views, and actions let investigators escalate, dismiss, or open cases.
Where governance matters most
AML is a textbook case for lineage-aware security. Case data may require investigation-specific markings so one case team cannot casually inspect another case's evidence. The documentation's markings examples explicitly highlight case-based access control as a strong use case.
Representative model workflow
1. Land daily transactions as APPEND dataset transactions.
2. Run an incremental transform to derive features only for newly added records.
3. Score the new feature set from Code Workspaces or a model integration workflow.
4. Publish high-risk records to a curated dataset and map them into Transaction objects.
5. Surface those objects in a Workshop inbox.
6. Let investigators trigger actions such as Open Case, Escalate, or Dismiss with Reason.
The result is not just "a fraud dashboard." It is a governed operational system that unifies ingestion, scoring, review, and decision capture.
Foundry Patterns to Remember
Official Docs Used
- Overview Foundry documentation home
- Projects Projects and resources
- Datasets Datasets and transactions
- Branching Branching
- Lineage Data Lineage
- Integration Data connectivity and integration
- Connect Connecting to data
- Pipelines Building pipelines
- Transforms transforms.api
- Transforms Python transforms getting started
- Incremental Incremental transforms
- Ontology Ontology overview
- Actions Action types overview
- Functions Functions overview
- TS edits TypeScript v2 ontology edits
- Py objects Python functions on objects
- Code WS Code Workspaces
- Analytics Analytics overview
- Contour Contour overview
- Quiver Quiver overview
- Workshop Workshop overview
- Security Security and governance
- Glossary Security glossary
- Markings Markings
- Access Checking permissions