tabular-ml

1 posts

google

Graph foundation models for relational data (opens in new tab)

Google researchers have introduced Graph Foundation Models (GFMs) as a solution to the limitations of traditional tabular machine learning, which often ignores the rich connectivity of relational databases. By representing tables as interconnected graphs where rows are nodes and foreign keys are edges, this approach enables a single model to generalize across entirely different schemas and feature sets. This shift allows for transferable graph representations that can perform inference on unseen tasks without the costly need for domain-specific retraining. ### Transforming Relational Schemas into Graphs The core methodology involves a scalable data preparation step that converts standard relational database structures into a single heterogeneous graph. This process preserves the underlying logic of the data while making it compatible with graph-based learning: * **Node Mapping:** Each unique table is treated as a node type, and every individual row within that table is converted into a specific node. * **Edge Creation:** Foreign key relationships are transformed into typed edges that connect nodes across different tables. * **Feature Integration:** Standard columns containing numerical or categorical data are converted into node features, while temporal data can be preserved as features on either nodes or edges. ### Overcoming the Generalization Gap A primary hurdle in developing GFMs is the lack of a universal tokenization method, unlike the word pieces used in language models or patches used in vision models. Traditional Graph Neural Networks (GNNs) are typically locked to the specific graph they were trained on, but GFMs solve this through several technical innovations: * **Schema Agnosticism:** The model avoids hard-coded embedding tables for specific node types, allowing it to interpret database schemas it has never encountered during training. * **Feature Interaction Learning:** Instead of training on "absolute" features (like specific price distributions), the model captures how different features interact with one another across diverse tasks. * **Generalizable Encoders:** The architecture uses transferable methods to derive fixed-size representations for nodes, whether they contain three continuous float features or dozens of categorical values. ### Scaling and Real-World Application To handle the requirements of enterprise-level data, the GFM framework is built to operate on a massive scale using Google’s specialized infrastructure: * **Massive Throughput:** The system utilizes JAX and TPU infrastructure to process graphs containing billions of nodes and edges. * **Internal Validation:** The model has been tested on complex internal Google tasks, such as spam detection in advertisements, which requires analyzing dozens of interconnected relational tables simultaneously. * **Performance Benefits:** By considering the connections between rows—a factor traditional tabular baselines like decision trees often ignore—the GFM provides superior downstream performance in high-stakes prediction services. Transitioning from domain-specific models to Graph Foundation Models allows organizations to leverage relational data more holistically. By focusing on the connectivity of data rather than just isolated table features, GFMs provide a path toward a single, generalist model capable of handling diverse enterprise tasks.