com.norconex.collector.http.handler
Interface IURLNormalizer

All Superinterfaces:
Serializable
All Known Implementing Classes:
GenericURLNormalizer

public interface IURLNormalizer
extends Serializable

Responsible for normalizing URLs. Normalization is taking a raw URL and modifying it to its most basic or standard form. In other words, this makes different URLs "equivalent". This allows to eliminate URL variations that points to the same content (e.g. URL carrying temporary session information). This action takes place right after URLs are extracted from a document, before each of these URLs is even considered for further processing. Returning null will effectively tells the crawler to not even consider it for processing (it won't go through the regular document processing flow). You may want to consider IURLFilter to exclude URLs as part has the regular document processing flow (may create a trace in the logs and gives you more options). Implementors also implementing IXMLConfigurable must name their XML tag urlNormalizer to ensure it gets loaded properly.

Author:
Pascal Essiembre

Method Summary
 String normalizeURL(String url)
          Normalize the given URL.
 

Method Detail

normalizeURL

String normalizeURL(String url)
Normalize the given URL.

Parameters:
url - the URL to normalize
Returns:
the normalized URL


Copyright © 2009-2013 Norconex Inc.. All Rights Reserved.