com.norconex.collector.http.handler.impl
Class GenericURLNormalizer

java.lang.Object
  extended by com.norconex.collector.http.handler.impl.GenericURLNormalizer
All Implemented Interfaces:
IURLNormalizer, IXMLConfigurable, Serializable

public class GenericURLNormalizer
extends Object
implements IURLNormalizer, IXMLConfigurable

Generic implementation of IURLNormalizer that should satisfy most URL normalization needs. This implementation relies on URLNormalizer. Please refer to it for complete documentation and examples.

By default, this class applies these RFC 3986 normalizations:

To overwrite this default, you have to specify a new list of normalizations to apply, via the XXXXX method, or via XML configuration. Each normalizations is identified by a code name. The following is the complete code name list for supported normalizations. Click on any code name to get a full description from URLNormalizer:

In addition, this class allows you to specify any number of URL value replacements using regular expressions.

XML configuration usage:

  <urlNormalizer class="com.norconex.collector.http.handler.impl.GenericURLNormalizer">
    <normalizations>
      (normalization code names, coma separated) 
    </normalizations>
    <replacements>
      <replace>
         <match>(regex pattern to match)</match>
         <replacement>(optional replacement value, default to blank)</replacement>
      </replace>
      (... repeat replace tag  as needed ...)
    </replacements>
  </urlNormalizer>
 

Example:

  <urlNormalizer class="com.norconex.collector.http.handler.impl.GenericURLNormalizer">
    <normalizations>
      lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, 
      removeDotSegments, removeDirectoryIndex, removeFragment, addWWW 
    </normalizations>
    <replacements>
      <replace><match>&view=print</match></replace>
      <replace>
         <match>(&type=)(summary)</match>
         <replacement>$1full</replacement>
      </replace>
    </replacements>
  </urlNormalizer>

Author:
Pascal Essiembre
See Also:
Serialized Form

Nested Class Summary
static class GenericURLNormalizer.Normalization
           
 class GenericURLNormalizer.Replace
           
 
Constructor Summary
GenericURLNormalizer()
           
 
Method Summary
 boolean equals(Object obj)
           
 GenericURLNormalizer.Normalization[] getNormalizations()
           
 GenericURLNormalizer.Replace[] getReplaces()
           
 int hashCode()
           
 void loadFromXML(Reader in)
           
 String normalizeURL(String url)
          Normalize the given URL.
 void saveToXML(Writer out)
           
 void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
           
 void setReplaces(GenericURLNormalizer.Replace... replaces)
           
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

GenericURLNormalizer

public GenericURLNormalizer()
Method Detail

normalizeURL

public String normalizeURL(String url)
Description copied from interface: IURLNormalizer
Normalize the given URL.

Specified by:
normalizeURL in interface IURLNormalizer
Parameters:
url - the URL to normalize
Returns:
the normalized URL

getNormalizations

public GenericURLNormalizer.Normalization[] getNormalizations()

setNormalizations

public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)

getReplaces

public GenericURLNormalizer.Replace[] getReplaces()

setReplaces

public void setReplaces(GenericURLNormalizer.Replace... replaces)

loadFromXML

public void loadFromXML(Reader in)
Specified by:
loadFromXML in interface IXMLConfigurable

saveToXML

public void saveToXML(Writer out)
               throws IOException
Specified by:
saveToXML in interface IXMLConfigurable
Throws:
IOException

hashCode

public int hashCode()
Overrides:
hashCode in class Object

equals

public boolean equals(Object obj)
Overrides:
equals in class Object

toString

public String toString()
Overrides:
toString in class Object


Copyright © 2009-2013 Norconex Inc.. All Rights Reserved.