|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.norconex.collector.http.handler.impl.GenericURLNormalizer
public class GenericURLNormalizer
Generic implementation of IURLNormalizer
that should satisfy
most URL normalization needs. This implementation relies on
URLNormalizer
. Please refer to it for complete documentation and
examples.
By default, this class applies these RFC 3986 normalizations:
To overwrite this default, you have to specify a new list of normalizations
to apply, via the XXXXX method, or via XML configuration. Each
normalizations is identified by a code name. The following is the
complete code name list for supported normalizations. Click on any code
name to get a full description from URLNormalizer
:
lowerCaseSchemeHost
upperCaseEscapeSequence
decodeUnreservedCharacters
removeDefaultPort
addTrailingSlash
removeDotSegments
removeDirectoryIndex
removeFragment
replaceIPWithDomainName
unsecureScheme
secureScheme
removeDuplicateSlashes
removeWWW
addWWW
sortQueryParameters
removeEmptyParameters
removeTrailingQuestionMark
removeSessionIds
In addition, this class allows you to specify any number of URL value replacements using regular expressions.
XML configuration usage:
<urlNormalizer class="com.norconex.collector.http.handler.impl.GenericURLNormalizer"> <normalizations> (normalization code names, coma separated) </normalizations> <replacements> <replace> <match>(regex pattern to match)</match> <replacement>(optional replacement value, default to blank)</replacement> </replace> (... repeat replace tag as needed ...) </replacements> </urlNormalizer>
Example:
<urlNormalizer class="com.norconex.collector.http.handler.impl.GenericURLNormalizer"> <normalizations> lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, removeDotSegments, removeDirectoryIndex, removeFragment, addWWW </normalizations> <replacements> <replace><match>&view=print</match></replace> <replace> <match>(&type=)(summary)</match> <replacement>$1full</replacement> </replace> </replacements> </urlNormalizer>
Nested Class Summary | |
---|---|
static class |
GenericURLNormalizer.Normalization
|
class |
GenericURLNormalizer.Replace
|
Constructor Summary | |
---|---|
GenericURLNormalizer()
|
Method Summary | |
---|---|
boolean |
equals(Object obj)
|
GenericURLNormalizer.Normalization[] |
getNormalizations()
|
GenericURLNormalizer.Replace[] |
getReplaces()
|
int |
hashCode()
|
void |
loadFromXML(Reader in)
|
String |
normalizeURL(String url)
Normalize the given URL. |
void |
saveToXML(Writer out)
|
void |
setNormalizations(GenericURLNormalizer.Normalization... normalizations)
|
void |
setReplaces(GenericURLNormalizer.Replace... replaces)
|
String |
toString()
|
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public GenericURLNormalizer()
Method Detail |
---|
public String normalizeURL(String url)
IURLNormalizer
normalizeURL
in interface IURLNormalizer
url
- the URL to normalize
public GenericURLNormalizer.Normalization[] getNormalizations()
public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
public GenericURLNormalizer.Replace[] getReplaces()
public void setReplaces(GenericURLNormalizer.Replace... replaces)
public void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
public int hashCode()
hashCode
in class Object
public boolean equals(Object obj)
equals
in class Object
public String toString()
toString
in class Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |