com.norconex.collector.http.db.impl
Class DerbyCrawlURLDatabase

java.lang.Object
  extended by com.norconex.collector.http.db.impl.DerbyCrawlURLDatabase
All Implemented Interfaces:
ICrawlURLDatabase

public class DerbyCrawlURLDatabase
extends Object
implements ICrawlURLDatabase


Constructor Summary
DerbyCrawlURLDatabase(HttpCrawlerConfig config, boolean resume)
           
 
Method Summary
 int getActiveCount()
          Gets the number of active URLs (currently being processed).
 CrawlURL getCached(String url)
          Gets the cached URL from previous time crawler was run (e.g.
 int getProcessedCount()
          Gets the number of URLs processed.
 int getQueueSize()
          Gets the size of the URL queue (number of URLs left to process).
 boolean isActive(String url)
          Whether the given URL is currently being processed (i.e.
 boolean isCacheEmpty()
          Whether there are any URLs the the cache from a previous crawler run.
 boolean isProcessed(String url)
          Whether the given URL has been processed.
 boolean isQueued(String url)
          Whether the given URL is in the queue or not (waiting to be processed).
 boolean isQueueEmpty()
          Whether there are any URLs to process in the queue.
 boolean isVanished(CrawlURL crawlURL)
          Whether a url has been deleted.
 CrawlURL next()
          Returns the next URL to be processed and marks it as being "active" (i.e.
 void processed(CrawlURL crawlURL)
          Marks this URL as processed.
 void queue(String url, int depth)
          Queues a URL for future processing.
 void queueCache()
          Queues URLs cached from a previous run so they can be processed again.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DerbyCrawlURLDatabase

public DerbyCrawlURLDatabase(HttpCrawlerConfig config,
                             boolean resume)
Method Detail

queue

public final void queue(String url,
                        int depth)
Description copied from interface: ICrawlURLDatabase
Queues a URL for future processing.

Specified by:
queue in interface ICrawlURLDatabase
Parameters:
url - the URL to eventually be processed
depth - how many clicks away from starting URL(s)

isQueueEmpty

public final boolean isQueueEmpty()
Description copied from interface: ICrawlURLDatabase
Whether there are any URLs to process in the queue.

Specified by:
isQueueEmpty in interface ICrawlURLDatabase
Returns:
true if the queue is empty

getQueueSize

public final int getQueueSize()
Description copied from interface: ICrawlURLDatabase
Gets the size of the URL queue (number of URLs left to process).

Specified by:
getQueueSize in interface ICrawlURLDatabase
Returns:
queue size

isQueued

public final boolean isQueued(String url)
Description copied from interface: ICrawlURLDatabase
Whether the given URL is in the queue or not (waiting to be processed).

Specified by:
isQueued in interface ICrawlURLDatabase
Parameters:
url - url
Returns:
true if the URL is in the queue

next

public final CrawlURL next()
Description copied from interface: ICrawlURLDatabase
Returns the next URL to be processed and marks it as being "active" (i.e. currently being processed).

Specified by:
next in interface ICrawlURLDatabase
Returns:
next URL

isActive

public final boolean isActive(String url)
Description copied from interface: ICrawlURLDatabase
Whether the given URL is currently being processed (i.e. active).

Specified by:
isActive in interface ICrawlURLDatabase
Parameters:
url - the url
Returns:
true if active

getActiveCount

public final int getActiveCount()
Description copied from interface: ICrawlURLDatabase
Gets the number of active URLs (currently being processed).

Specified by:
getActiveCount in interface ICrawlURLDatabase
Returns:
number of active URLs.

getCached

public CrawlURL getCached(String url)
Description copied from interface: ICrawlURLDatabase
Gets the cached URL from previous time crawler was run (e.g. for comparison purposes).

Specified by:
getCached in interface ICrawlURLDatabase
Parameters:
url - URL cached from previous run
Returns:
url

isCacheEmpty

public final boolean isCacheEmpty()
Description copied from interface: ICrawlURLDatabase
Whether there are any URLs the the cache from a previous crawler run.

Specified by:
isCacheEmpty in interface ICrawlURLDatabase
Returns:
true if the cache is empty

processed

public final void processed(CrawlURL crawlURL)
Description copied from interface: ICrawlURLDatabase
Marks this URL as processed. Processed URLs will not be processed again in the same crawl run.

Specified by:
processed in interface ICrawlURLDatabase

isProcessed

public final boolean isProcessed(String url)
Description copied from interface: ICrawlURLDatabase
Whether the given URL has been processed.

Specified by:
isProcessed in interface ICrawlURLDatabase
Parameters:
url - url
Returns:
true if processed

getProcessedCount

public final int getProcessedCount()
Description copied from interface: ICrawlURLDatabase
Gets the number of URLs processed.

Specified by:
getProcessedCount in interface ICrawlURLDatabase
Returns:
number of URLs processed.

queueCache

public final void queueCache()
Description copied from interface: ICrawlURLDatabase
Queues URLs cached from a previous run so they can be processed again. This method is normally called when a job is done crawling, and entries remain in the cache. Those are re-processed in case they changed or are no longer valid.

Specified by:
queueCache in interface ICrawlURLDatabase

isVanished

public final boolean isVanished(CrawlURL crawlURL)
Description copied from interface: ICrawlURLDatabase
Whether a url has been deleted. To find this out, the URL has to be of an invalid state (e.g. NOT_FOUND) and must exists in the URL cache in a valid state.

Specified by:
isVanished in interface ICrawlURLDatabase
Parameters:
crawlURL - the URL


Copyright © 2009-2013 Norconex Inc.. All Rights Reserved.