AI-Driven Image Crawler for Hospitality Content Optimization
Our partner is a leading provider of digital solutions for the hospitality sector. They aimed to improve the visual content on their platform by automating the process of collecting, filtering, and labeling hotel images from websites.
This enhancement was intended to reduce manual work, ensure only the most relevant images are showcased, and ultimately elevate the user experience on hotel listings.
Solution
Unidatalab developed an AI-powered, serverless image crawler leveraging AWS services. The solution uses advanced machine learning models to automate every step of the process: from extracting images to filtering and scoring them for relevance.
How it works
The solution begins by using image scrapers to extract image URLs from specified hotel website URLs. It collects all relevant images from different sections of the site, including galleries and main pages.
Images are filtered based on size, format, and naming conventions (e.g., ignoring logos and icons). This ensures only high-quality images are processed further.
AWS Rekognition analyzes each image, extracting semantic labels that describe the visual content. AWS Bedrock’s LLM then scores each image based on its relevance to the hotel’s description, assigning a numeric relevance score for the evaluation.
All images and their associated metadata (labels, relevance scores, image source details) are stored in a structured format in AWS S3. A callback mechanism notifies the client of processing completion, sharing detailed image metadata for seamless integration.
The crawler includes an optional scheduler to automatically re-crawl hotel websites on a set schedule (e.g., weekly or monthly), keeping the platform’s image library current.
Our challenges:
Time-consuming manual image collection
Previously, collecting and curating high-quality hotel images was a labor-intensive process. This manual approach was prone to errors and inconsistencies, leading to incomplete or irrelevant visual content for hotels on the platform.
Inconsistent image quality and relevance
Not all images collected from hotel websites were suitable for showcasing to customers. Filtering out low-quality or irrelevant images required significant manual review, making it challenging to maintain a polished and visually engaging digital presence.
Scalability and reliability concerns
With the constant addition of new hotel partners and updates to existing listings, manual curation was not scalable. A more automated and intelligent approach was needed to keep the platform up to date with fresh and relevant imagery.
Project stages
We started by developing the Python-based crawler and integrating it with AWS services for image filtering, labeling, and relevance scoring. A REST API microservice was designed to trigger the entire pipeline upon receiving a hotel URL.
In this phase, we addressed edge cases, such as handling websites with unusual structures or access restrictions. We transitioned the system to a fully serverless architecture using AWS Lambda, added caching of processed images to avoid redundant work, and introduced scheduled crawling to keep datasets current.