How Web Categorization Works

Have you ever clicked on a webpage at your workplace and been blocked because access is restricted? Or maybe, for those of you who don't have an ad-blocker installed, you've frequently seen that the ads on a page match its content?

This is done using website categorization services that provide ad-tech platforms, web filters, and more with data about the content of these web pages.

Hello, I’m Dan, and today I'd like to share some insights we’ve gained while building an enterprise-grade AI categorization service, WebClassifAI.

Who Relies on Web Categorization?

The usefulness of website categorization extends far beyond these examples, supporting a broad range of services.

A non-exhaustive list includes DSPs (Demand-Side Platforms), SSPs (Supply-Side Platforms), web filters, SEO tools & experts, brand safety providers, banks & fintech companies, schools, mobile carriers & broadband providers, and affiliate marketing platforms.

With such diversity, a single solution cannot fulfill the needs of every service. For example, web filters require millisecond response times, while brand-safety platforms demand real-time data with high prediction accuracy.

That is why web categorization products offer different services, such as offline databases, text-only content scraping, or a more comprehensive analysis that includes a web page's text and images.

Under the Hood: The Core Architecture

While the final product can vary depending on the use case, the core implementation remains the same.

In its simplest form, every application of this type consists of two main parts: web content extraction and data categorization. Let's explore each of these parts in the following sections.

Step 1: Harvesting Web Content

What distinguishes a standard web classifier from an enterprise-grade one is its reliability in consistently extracting content from a wide variety of websites.

Here, you have two options: use a third-party scraping provider (which is more expensive) or build a custom scraper.

For the latter, you'll need different implementations depending on the content you want to scrape.

If you only need the HTML text and metadata like title, description, keywords, and tags, you can create the scraper using simple libraries for executing HTTP(S) requests, such as the ‘requests’ library for Python, or Axios, Cheerio, and the native fetch function for JavaScript.

If you also need to extract dynamic content, which is loaded by JavaScript, you will need to use a headless browser like Playwright, Puppeteer, or Selenium.

The main challenge here is bypassing protections like Cloudflare, country restrictions, or anti-bot measures. For this, you will need to implement techniques like rotating proxies and user-agents, including certain headers in the request, or assigning proper timeouts between requests made to the same domain. This applies to both methods described previously.

Of course, a retry mechanism must be in place to handle connection, network, or other retryable errors.

Step 2: From Noise to Signal

Once the content has been extracted, you cannot send it directly to the AI/ML algorithm.

Only the significant content, such as the article body, should be sent to the model. All of the "junk," like HTML tags, cookie modals, and navigation and footer links, needs to be removed beforehand.

There are multiple open-source libraries capable of extracting the main content, such as ReadabilityJS, Trafilatura, Dragnet, or Goose3. Some, like Trafilatura, specialize in news articles or blogs, while others, such as ReadabilityJS, are better for JavaScript-heavy pages. You can view some benchmarks here.

This part becomes especially difficult on websites with a messy structure where the important content is found primarily in images or videos. For these cases, more complex methods are needed, such as taking a screenshot of the page and analyzing it with Computer Vision algorithms.

An Optional Step: Handling Multiple Languages

If a web page's content is not in English, you may need to add a translation step, especially if your classifier is an ML model trained exclusively on English data.

Step 3: The Classification Engine

Finally, we have arrived at the core of the service, which classifies the content according to a given taxonomy. Here, there are multiple options depending on the clients' needs.

The Keyword-Based Approach: Heuristics

You could create a heuristic algorithm using bag-of-words techniques to classify content based on category-related keywords.

This is a fast and easy way to categorize content, and its key advantage is that you can add new categories without much complexity.

The disadvantage is that the algorithm doesn't consider the semantic context. For example, a page containing the word 'apple' could be about the fruit or the technology company; a simple keyword-based approach would struggle to differentiate between the two.

The Balanced Approach: Traditional ML

A better approach is to build a Machine Learning algorithm that understands context, not just category-specific words. However, the complexity of building such a system increases exponentially.

This approach is recommended if your application requires a good balance between performance and high prediction accuracy.

The dataset is built by following the steps described earlier for each URL, which is then carefully labelled according to one or more taxonomies (such as the IAB Content Taxonomy). With the dataset, you can train multiple algorithms like Naive Bayes, Support Vector Machines, Random Forests, and 1D Neural Networks. No single algorithm is universally better than another; the choice is highly dependent on the task at hand (dataset size, number of categories, etc.). However, we have found SVMs to perform very well for URL classification.

The State-of-the-Art: Transformers and LLMs

If high precision and reasoning are your main requirements, then training a transformer model or fine-tuning a Large Language Model (LLM) will yield the best results.

Their biggest advantage is the ability to analyze a website's images, which significantly increases prediction accuracy on pages with sparse text.

Here, depending on your budget, you have many options: from using pre-trained models like BERT, RoBERTa, and Longformer (which are in the 100M-parameter range) to larger models like Llama 3, DeepSeek, Gemini 1.5 Flash, or GPT-4o.

But these also come with disadvantages: larger models mean higher infrastructure costs and greater latency. These issues are exacerbated when choosing a cloud-based API instead of a self-hosted solution.

Bringing It All Together

That was a high-level overview of what we've learned about website categorization while creating WebClassifAI, an API built for businesses and services that need real-time, high-precision classification for custom or IAB taxonomies.

We love that passionate ML engineers are interested in this subject. If you have any questions or want to learn more, send us a message at contact@webclassifai.com. Happy coding! 🚀

Building an AI Web Classifier at Enterprise Scale