MIDS Capstone Project Spring 2026

GroupMatch AI

Problem and Motivation

Making meaningful friendships in a new city can be quite challenging. Existing platforms either focus on one-on-one matching (e.g., dating-style apps) or large, unstructured group events, but none effectively form small, compatible friend groups.

Platforms that support group interactions are typically event based or self organized, meaning users must find compatibility themselves. This creates a clear gap: no system currently exists to form cohesive friend groups based on shared interests and personality.

GroupMatch AI addresses this gap by learning a notion of “group compatibility” and using it to algorithmically form groups of four individuals likely to be socially compatible.

Data Sources and Data Science Approach

Data Sources

A central challenge of this project is the lack of ground truth data. No publicly available dataset of “ideal friend groups” exists.

We approached this in two stages:

1. Real-World Proxy Data as a Feasibility Study

We used the Facebook SNAP Dataset[1] as a proxy for real-world friendships to approximate compatibility signals between users. Our first goal was to answer a key question: how predictable is group compatibility?

The dataset consists of labeled social circles, which are not immediately suitable for supervised learning. We restructured the data by: 

  • Sampling users from social circles
  • Constructing positive, negative, and partially positive group examples

This allowed us to train models to recognize varying levels of compatibility.

2. Synthetic Data Generation

The Facebook SNAP dataset helped us validate feasibility, but it contains anonymized users and attributes limiting its direct usefulness for our application.

To address this, we generated synthetic user profiles using LLMs, incorporating attributes like interests, personality traits, and location. The synthetic profiles enabled us to simulate the act of creating a profile and matching with other users in our app. Importantly, we applied the patterns observed in the SNAP data to structure the synthetic dataset.

Modeling Approach

We modeled compatibility as a function over sets of four users and framed group formation as an optimization problem. We trained models to predict compatibility scores for candidate groups, which we then use to rank and select the best groupings.

To address scalability challenges from the combinatorial number of possible groups, we incorporated a KNN-based recaller that narrows the candidate set before applying the ranker model.

Evaluation

We formulated group compatibility as a multi-class classification problem, where each group is labeled based on how many members belong to the same true group:

  • [Class 1] 0/4 (no overlap)
  • [Class 2] 2/4 (partial overlap)
  • [Class 3] 3/4
  • [Class 4] 4/4 (fully compatible group)

We evaluated performance using the following metrics:

  • Accuracy
  • Within +/- 1 Accuracy
  • Ordinal RMSE

We include +/- 1 accuracy to account for near-correct predictions and ordinal RMSE to capture the ordered nature of the labels by penalizing larger mistakes more heavily than smaller ones.

To evaluate our model under realistic deployment conditions, we used three distinct train/test split strategies:

1. Seen Users

All users in the test set appear in the training set, but the specific group combinations are new. See Figure 2. This simulates forming new groups among existing users.

2. Cold Start

None of the users in the test set appear in the training set (Figure 3). This mimics new users joining the platform and matching amongst themselves.

3. Partial Cold Start

Test set groups contain a mix of known and unseen users, emulating forming groups with both new and seen users. See Figure 4.

Model Results

Our best-scoring model (XGBoost) performed as follows:

SNAP FACEBOOK DATASET
ScenarioRMSEAccuracy+/-1 Accuracy
Seen Users0.71230.68770.9450
Cold Start0.86650.58890.9081
Partial Cold Start0.82550.60920.9157 

As expected, we see a noticeable drop in the cold start setting. While additional feature engineering and model tuning could likely improve these results, the primary goals of using SNAP were to validate feasibility and develop the modeling framework.

SYNTHETIC DATASET
ScenarioRMSEAccuracy+/-1 Accuracy
Seen Users0.29780.91131.0000
Cold Start 0.52860.77060.9875
Partial Cold Start0.32400.89650.999

The synthetic data performance is much better than that of the SNAP dataset. We attribute this improvement to two main factors:

  1. LLM generated data is less noisy and follows more structured patterns, and

  2. Unlike SNAP, the synthetic dataset includes detailed, non-anonymized attributes, enabling stronger predictions.

Key Learnings

1. Problem Formulation Matters more than Modeling

The most challenging part of this project was not model selection or designing a fancy architecture, but defining the problem itself. Unlike many ML tasks, there was no existing dataset or clear mapping from data to labels. We had to design a framework that translated “group compatibility” into a learnable objective.

2. Complexity of Group-Level Modeling

Moving from pairwise matching to set-based (group) modeling significantly increases complexity due to combinatorial growth and interaction effects. This required careful dataset construction and system design (candidate recall + ranking) to make the problem tractable.

3. Evaluation

We designed our evaluation scenarios (seen, cold start, partial cold start) to understand how the model would perform in practice. With our graph-based data, this required careful handling of train/test splits. For example, in the partial cold start scenario, we severed connections between new users, while preserving connections between new and existing users to simulate realistic onboarding conditions without introducing information leakage.

Impact

GroupMatch AI demonstrates the feasibility of algorithmic friend group formation, a largely unexplored area compared to pairwise recommendation systems. We provide a foundation for building a system that can form small, compatible social groups. 

Acknowledgments 

Thank you to our instructors, Puya Vahabi and Daniel Aranki, for their guidance, support, and continuous feedback throughout our capstone project.

Sources and Citations

[1] J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.

Last updated: April 14, 2026