MIDS Capstone Project Spring 2025

DressSense

Team members

Fashion That Cares — For You and the Planet

Motivation

Getting dressed should be simple—but for many, it’s a daily source of stress and wasted time. On average, people spend over 4.3 days a year¹ just deciding what to wear. At the same time, the fashion industry is one of the largest contributors to global waste, with tons of clothing discarded each year.

Even more telling: 80% of people buy clothes they never wear². Contributing to the millions of tons of textile waste every year. That’s money wasted, space cluttered, and resources consumed—without adding real value.

We saw an opportunity to solve all of this with one smart solution. What if people could make better use of what they already own, while saving time and reducing their environmental impact?

DressSense was born from this idea: to help people unlock the full potential of their closets. By making outfit planning faster, easier, and more sustainable, DressSense empowers users to dress with confidence—without the need to constantly buy more.

Our Solution

DressSense is a smart wardrobe assistant that helps you get more out of the clothes you already own. Instead of buying something new, just snap photos of what’s in your closet, and DressSense does the rest. Using AI, it generates outfit combinations tailored to your personal style, the weather, and the occasion—so you always have something to wear.

Our solution eliminates daily decision fatigue by making getting dressed effortless. It also encourages more mindful fashion habits by helping you see new potential in your existing wardrobe. Whether you're prepping for work, a date, or a trip, DressSense gives you quick, personalized outfit ideas that save time, reduce clutter, and cut down on unnecessary shopping.

How It Works

Upload photos of your existing clothes.
Input prompt for occasion, weather, mood, etc.
Get AI-powered outfit suggestions.
Track what you’ve worn to keep outfits fresh.

Architecture Summary

The DressSense application is built on two data pipelines: the Recognition Pipeline and the Recommendation Pipeline. In the classification pipeline, users upload wardrobe images through a React.js frontend hosted on AWS Amplify. These images are stored in Amazon S3 and processed by a Flask backend deployed on AWS Elastic Beanstalk. A qwen model classifies each item, and the results are saved to Amazon DynamoDB for future use.

The recommendation pipeline uses these stored clothing parameters to generate outfit suggestions. When a user submits a prompt the backend inputs to a deepseek model that generates tailored outfit combinations. These recommendations are based on user prompt inputs, and the final outfit suggestions are returned to the user through the website. This architecture streamlines wardrobe management by automating clothing classification and outfit generation in a scalable, cloud-based environment.

Recognition and recommendation pipelines

Data

4 different datasets were explored for this endeavor:

HM Fashion Caption (HuggingFace) N = 20k
Fashion Product Images (Kaggle) N = 44k
Fashionpedia N = 48k
Deepfashion N = 290k

Two limitations were faced for the datasets: available attributes per image and dataset bias

There were 12 attributes we determined were helpful for outfit generation including clothing category, color, etc. However, only one dataset (Fashion Product Images) achieved 7 out of 12 attributes available per image.

When exploring the Fashion Product Images database we discovered there is a heavy bias toward men's casual wear.

This posed 2 important considerations:
1. Available attributed per image would limit the ability to generate relevant outfits
2. Bias in the dataset will skew out evaluations toward men's casual wear

Due to these considerations we chose to implement the recognition model in order to bolster our available attributes and augment the available dataset with other image sets that did not have the appropriate labeling

Models

There were two main models we used, one for clothing feature recognition and one for outfit recommendation.

For the clothing feature recognition, we chose to use the qwen Vl instruct model as it was rated highly in visual question answering benchmarks.

We chose a distilled version of DeepSeek for outfit recommendations because it's one of the most powerful open-source reasoning models available, while also offering low inference costs. Compared to the Anthropic reasoning model, DeepSeek was one-third the cost per 1,000 input tokens and half the cost per 1,000 output tokens. This balance allows us to retain strong reasoning capabilities—important for interpreting the implications of an outfit prompt—without sacrificing cost efficiency.

Model Evaluations - Recognition

Given no ground truth existed for 5 out of 12 attributes for one of the datasets and 0 out of 12 attributes for the others.: We evaluated model performance using an LLM as a judge to assess accuracy across 18 key clothing attributes for outfit recommendation. This allowed us to ensure no unevaluated recommendations and high interpretability.

LLM as a Judge: The LLM achieved near-perfect accuracy (≥0.98) for attributes like fit, pattern, primary color, and style, with lower accuracy for functional features (0.82) and brand (0.32). Classification of clothing into topwear, bottomwear, outerwear, and footwear showed strong results, with particularly high precision for bottoms and footwear. One expected pitfall of this recognition evaluation is that brands are normally either hard to see on the exterior of clothings, difficult to read with low resolution images, or not deducible from the image. However, given this is not critical to create a visually cohesive outfit we determined it was ok to proceed.

Semantic Similarity vs Ground Truth: To further evaluate textual alignment between predicted and ground truth outputs, we used BERTScore and LLaMA-based cosine similarity. BERTScore averaged 0.92 across all labels, while LLaMA embeddings outperformed vector-based similarity across every category—indicating LLM embeddings are more semantically robust for fashion features. This is especially evident when digging into the results showing that semantically similar options like "Top" and "Topwear" are rated with 0 similarity when utilizing vector-based similarity.

Vs Other Feature Extraction: Feature extraction from raw images used CNNs (ResNet, DenseNet, EfficientNet), vision-language models (CLIP), and transformers (Swin). Among these, ResNet produced the most accurate matches (63.48%), while Swin Transformer underperformed. However, matching was limited by missing metadata and lack of documentation about feature extraction in the original dataset.

Model Evaluations: Recommendation

To evaluate the quality of outfit recommendations, we used both automated NLP-based scoring and human judgment through Mechanical Turk.

Human Evaluation: AWS Mechanical Turk allowed for human labeled to determine if generated outfits are good or bad based on appropriateness, coordination, and practicality. Overall, ~60% of outfits were rated positively, with the strongest approval (71%) for occasion-based prompts. Style-based prompts saw lower agreement, likely due to subjectivity and diverse fashion preferences. This was executed through the use of N = 500 different prompt and hundreds of wardrobe items across multiple datasets to cover a large basis.

BLEU/Rouge: For the NLP-based evaluation, we manually constructed outfit references for various prompts, then assessed the model’s generated combinations using BLEU and ROUGE scores vs the manually generated one. Results varied by context—recommendations for “Men’s Beach Outfit” achieved strong alignment (BLEU: 0.72, ROUGE-L: 0.84), while “Women’s Winter” outfits were less aligned (BLEU: 0.35, ROUGE-L: 0.52),

Key Learnings

DressSense demonstrates the potential of large language models (LLMs) to act as fashion stylist.

Using an LLM as a judge, we achieved over 98% accuracy across 18 core clothing attributes such as fit, pattern, and style, with the exception of brand recognition (32%), which is often visually unavailable or illegible in images. LLM-based semantic similarity metrics, such as BERTScore (mean: 0.92), consistently outperformed traditional vector similarity methods, offering a more nuanced understanding of fashion concepts—for example, recognizing that "Top" and "Topwear" are functionally equivalent.

Human evaluation via Mechanical Turk revealed that around 60% of generated outfits were judged as appropriate, with occasion-based prompts performing best (71% approval) and style-based prompts showing higher disagreement (potentially due to subjective fashion preferences). While many of the generated outfits still fall short of being reliably fashionable or personalized, these results represent a promising step toward assistive fashion technology. By combining interpretable attribute modeling, prompt-driven recommendations, and scalable human-in-the-loop feedback, DressSense offers a solid foundation for building systems that helps users cut down on both time and material wastes.

Future Roadmap

Looking ahead, our medium-term roadmap focuses on enhancing accessibility, precision, and overall user experience through three key improvements:

Voice Chatbot
To support users with visual impairments, we plan to integrate voice chatbot functionality (potentially throughOpenAI Whisper) for voice-based interaction. This enables seamless access to clothing recommendations without relying on a visual interface.
Targeted Clothing Recommendation
To reduce waste at the point of purchase, we aim to integrate store APIs with the recommendation model. This will allow for smarter, targeted suggestions for purchases that make sense with what you already have - reducing the potential of wasteful purchases.
Improved Recognition/Recommendation Models
To minimize false positives and boost recommendation accuracy, we plan to address two major gaps:
- Integrating a comprehensive brand database, addressing the model’s weakest attribute (currently 32% accuracy).
- Finetuning on fabric types, which will help correct subtle but important misclassifications and strengthen overall feature granularity.

References:

1. de Klerk, A. (2016, June 6). Women spend 17 minutes each morning choosing an outfit. Harper's Bazaar. https://www.harpersbazaar.com/uk/people-parties/bazaar-at-work/news/a37318/women-average-time-choosing-outfit/

2. Wolstenholme, H. (2018, November 30). More than 80 per cent of shoppers are buying clothes they never wear, study shows. Evening Standard. https://www.standard.co.uk/news/uk/more-than-80-per-cent-of-shoppers-are-buying-clothes-they-never-wear-study-shows-a4004996.html

Image credits:

U.S. Environmental Protection Agency. (n.d.). Textiles: Material-Specific Data. Retrieved April 13, 2025, from https://www.epa.gov/facts-and-figures-about-materials-waste-and-recycling/textiles-material-specific-data
Choat, I. (2023, May 31). Stop dumping your cast-offs on us, Ghanaian clothes traders tell EU. The Guardian. https://www.theguardian.com/global-development/2023/may/31/stop-dumping-your-cast-offs-on-us-ghanaian-clothes-traders-tell-eu

logo generated from ChatGPT 4o

Course

Data Science 210. Capstone , Summer 2025

Class Project Gallery

Last updated: April 14, 2025