HomeMe
Introduction
80% of Singapore residents live in public housing homes, managed by the Housing and Development Board (HDB) Singapore. However, as with any government-managed resource, the many policies designed to safeguard it have resulted in a complex web of eligibility criteria for housing schemes and grants. This has made the process of buying, selling, and living in HDB apartments challenging and stressful.
Singaporeans have two real sources of information when it comes to HDB policies. They could refer to the HDB’s website, where they face a maze of hyperlinks, information overload, and many nested rabbit holes. This makes for a stressful experience. Alternatively, they could pay up to 1% of the property price in hiring a real estate agent - this could be between USD$3,000 to USD$12,000.
Figure 1: Navigating HDB's Webpage Can Be Tough
Finding a flat can also be a challenge – HDB’s resale portal is limited in functionality, supporting basic filters by address and flat type, but missing out geospatial-rich filters like distance to schools, parks, etc.
HomeMe AI seeks to meet this gap in needs. Our solution? A powerful Agentic-Based Question and Answer (QA) system that empowers non-technical users to easily query policies and find dream homes via a conversant natural language interface. Employing a unique HTML RAG architecture, and equipped with a custom geospatial-rich database, HomeMe AI outperforms existing solutions in the market. In doing so, we are one step closer to our mission: democratizing access to housing information to the masses.
Data Sources
We used a combination of structured, unstructured, and geospatial datasets to power the HomeMe project. All data sources are public and originate from Singapore’s open government data platforms such as data.gov.sg and hdb.gov.sg
Unstructured Data (for RAG QA system):
- Housing and Development Board Website: Web-scraped content covering housing policies, eligibility criteria, and FAQs. This data was chunked and processed for retrieval-augmented generation (RAG) to answer conceptual and policy-related queries.
Structured Data (for SQL analysis):
- HDB Resale Flat Prices (multiple years): Detailed records of flat resale transactions, including attributes like flat type, block, lease commencement, and resale price.
- HDB Property Information: Building-level attributes including residential/commercial classification, number of units, and amenities.
- Geospatial Data:
- HDB Existing Building Dataset: Geocoded data of HDB building locations.
- LTA MRT Station Exit Locations: Geospatial coordinates of MRT exits across Singapore.
Figure 2: Data Sources and Pipelines
QA System Approach
Figure 3: HomeMe's QA HTML RAG System
Our QA system architecture features two core innovations that enable our system's high performance. These include:
- Bespoke Context-Aware Extractions for Vector DB Context Retrieval
- LLM HTML Comprehension
Bespoke Context-Aware Extractions
Figure 4: QA Data Pipeline
The HTML files our team encountered had irregular and inconsistent internal structures. Before working with them, we ran them through a custom-coded HTML clean-up script, which helped remove redundant HTML elements (e.g., headers, navigation bars, etc.). This helped to reduce token sizes and prevent context overloading for subsequent hand-offs to the LLM models.
Figure 5: HTML Cleaning
To support context retrieval, we utilized a methodology we call Context-Aware Extractions. Vector databases require natural sentence structures to maximize similarity searches. However, HTML tables and structures do not easily lend themselves to being formatted in such ways. At the same time, given the complex nature of presenting HDB policy information, normal chunking strips out highly important nuances in the page. For example, a single vital statement at the top of the webpage indicating that the policy is only applicable for "Couples & Families" would be completely missed out. Context-Aware Extractions help to repeatedly fold these vital nuances back into every chunk that is stored in the vector DB.
Figure 6: Context-Aware Chunk Extraction
LLM HTML Comprehension
Subsequently, once relevant chunks have been extracted by our system, a mapper helps to extract the corresponding cleaned HTML. Borrowing inspiration from the paper "HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems", we designed our system to capitalize on the ability of modern LLMs to interpret HTML content. As argued by the paper, modern LLMs have seen more than enough HTML code in their training to thoroughly understand their structures. At the same time, native HTML content is already best represented via HTML, which retains structured content the way that website designers intended them to be interpreted. In our system, the cleaned HTML files are already self-contained policy documents that are within acceptable context token limits. These are passed asynchronously to LLM models that are instructed to carefully inspect the structure of the HTML pages, and surgically extract relevant information based on the question provided. These surgical LLM HTML comprehensions are then compiled into a precision summary that is passed to the final LLM model for the purpose of formatting a response to the posed question.
Based on our metrics (discussed below), our QA system with HTML RAG architectures outperformed all other architectures we tested. Our base architecture, comprising only the retrieval of context-aware extractions from a Qdrant vector database, fed directly to the final LLM model achieved a score of 5.1 on our QA Double Filter Score (QADF) - a composite score comprising a content capture score (CC), and a relevance and faithfulness score. On content capture alone, it scored 5.08, meaning that nearly 50% of gold answer statements were captured. Notably, our base model already outperformed all competing models in the market. With hyperparameter and prompt tuning, performance of the base architecture rose to a peak of 6.12 on QADF and 6.64 on CC. When our architecture was further enhanced with our mapper and LLM HTML Comprehension system, scores were significantly boosted over our reference base architecture. This final system achieved a score of 7.59 on QADF and 8.25 on CC.
Figure 7: QA Experiment Performance Scores
When compared to existing solutions on the market, our system significantly outperforms them on our Content Capture (CC) metric. We did not utilize the full QA Double Filter (QADF) Score as retrieved context is required for the metric, which is unavailable from these production systems. (Note that EP Buddy has character limits to its input text that rendered it incapable of answering longer and more complex queries - this prevented their system from properly answering 10% of our gold answer set, which comprises of 30 gold answers.)
Figure 8: QA Performance Comparisons
Figure 9: QA Sample Outputs
SQL Agent System Approach
To create a rich repository of potential addresses that the public may wish to lookup for their future home, we joined several structured datasets from data.gov.sg. This included HDB property information, flat attributes, and past resale transaction prices. For spatial datasets, we first performed coordinate system projections to ensure that all datasets are on the same project coordinate system (PCS). We then performed a series of spatial feature engineering in the form of Point-of-Interest (POI) distance calculations and spatial joins. We used GeoPandas to perform all spatial data engineering, with the final output being a rich spatial repository of HDB property information for our SQL agent to address flat lookup queries. (Note: We did not include new sales by the HDB as these are launched exclusively by HDB, with information only being made available at the point of announcement.)
We implemented a SQL lookup agent with the goal of retrieving accurate housing property information from our custom database using the LangGraph framework. The tasks of the SQL Lookup agent are as follows.
Figure 10: SQL Agent Architecture
- Rewrite user query to improve clarity for an LLM to understand
- Fetch available tables from the database and decide which tables are relevant to the question
- Fetch the schemas for the relevant tables
- Generate a SQL query based on the user query and information from the fetched schemas
- Check the SQL query for common mistakes
- Execute the SQL query on the database and return the results
Formulate a response based on the results
Multi-Agent Assembly
Figure 11: Multi-Agent System
With our two specialized systems set up (QA system and SQL lookup system), we created a supervisor agent to link the two systems together. The supervisor agent will receive the user query and decide whether the query should be passed to either the QA system or the SQL lookup system. If it sees that both systems are required, it will pass the task to either system sequentially. After the supervisor receives the final responses from the systems, it will return the consolidated response back to the user.
QA System Evaluation
We custom-designed our metrics for our domain-specific system with 3 principles in mind:
- Capture: Gold answer (expert-crafted, n=30) content is key and must all be included
- Reward: The system is rewarded if relevant and faithful content not found in gold is included
- Penalize: Knock off points from the system for noise
Figure 12: QA Metrics Survey
Taking reference from our survey of existing metrics and metric architectures, we designed our metrics essentially cover the following:
- Content Capture: An evaluation of model responses against gold answers. In our approach, we want the strength of RAGAS' statement decomposition and observability, but with NVIDIA's inclusion of the user prompt for better gold answer assessment.
- Relevance & Faithfulness: In theory, content capture covers the majority of what we need. However, we wanted an additional way to reward the system if relevant and faithful content was generated that was not found in the gold answers. In this case, we crafted our own relevance and faithfulness metric prompts, focusing on enriching the metric prompts and architecture for HDB context and providing more specific illustrations of relevance and faithfulness.
While a number of off-the-shelf evaluations were already available in the form of RAGAS and LangChain OpenEvals, we found their products to be lacking in documentation. At the time of writing, frequent changes were being made to RAGAS' production code that rendered their documentation and code walkthroughs unusable. At the same time, our previous efforts at using RAGAS failed to scale to our domain's specific evaluation needs. For example, RAGAS' decision to not include user input prompts for the correctness metric architecture led to many wrongful dismissals of generated responses statements when compared to gold statements. Given these, and many other reasons that we do not list here, we opted for a custom implementation.
Our custom implementation consists of the following approach:
- We borrow the statement decomposition methodology from Ragas, which we found helpful for more thorough and cohesive evaluations.
- Content Capture: We created a modified blend of RAGAS and NVIDIA's approach to factual correctness - we include the input prompt into the metric architecture as per NVIDIA's design, while ensuring explainability as with RAGAS' approach. After decomposing gold answers and generated responses into statements, we determined the number of Gold Answer statements fully captured by generated statements first with our own custom prompts and chains
- Relevance and Faithfulness: We then test additional statements generated by LLM that were not used in gold answers and tested them for relevance and faithfulness to ensure that plausible answers not found in gold answers would still be rewarded - statements failing this test would penalize the system.
A demonstration of this metric can be found in the following diagram:
Figure 13: QA Double Filter Approach
SQL Agent Evaluation
To assess the performance of our SQL agent, we created a set of golden question-answer pairs based on the final structured dataset. These manually curated pairs represent accurate, ground-truth responses to a range of analytical queries, including average prices, trends, and geospatial comparisons.
We used these golden pairs to evaluate the correctness and completeness of the agent’s generated SQL queries and answers, allowing us to systematically measure model performance and guide further improvements.
Figure 14: SQL Agent Evaluation
The three metrics include:
- Correctness: Check final response from the agentic workflow with final response in the gold answers. Evaluated by using a Claude model from Anthropic as an LLM as a judge.
- Routing Correct Score: Check if the right “tool”,aka agent (QA rag agent or SQL lookup agent), was called during an end to end run.
- SQL Result Score: Check the SQL query results from the SQL lookup agent with the SQL query results in the gold answers. Two versions of the metric were created, an exact match score and a fuzzy match score. Exact match score matches exact words between the agent created results and gold answers. Fuzzy match score uses semantic similarity between the agent created results and gold answers. Semantic similarity score was evaluated using the Sentence Transformers library.
Figure 15: Agent Scoreboard. An example of what the evaluations look like within LangSmith where each column represents the metrics being measured (From left to right: Correctness, Routing Score, SQL Result Score (fuzzy)) while each row is the data that is being tested on.
Key Learnings and Impact
- Intricate and complex website HTML structures pose a significant challenge to traditional RAG systems
- Context aware chunks provide a strong solution, delivering strong performance with traditional RAG, but requiring non-trivial amounts of manual effort to hardcode extraction rules
- When coupled with LLM HTML Comprehension, existing systems can be given substantial boosts in performance compared to traditional RAG.
- HTML RAG's greatest limitation is the requirement to pass a large HTML context to the LLM, potentially overloading its context
- We propose a proven solution to work around the context problem by passing multiple HTML files to individual smaller LLMs (adept at HTML interpretation) to generate precision summaries of the context for the input question, in an asynchronous manner
- Current metrics frameworks remain fragmented for the RAG QA landscape - we develop a proprietory solution that blends existing metrics from RAGAS and NVIDIA
