This is an exported version of the JIRA issue tracker. Please use the Google Code site to open new tickets or report updates to these existing tickets. Feel free to contact the mailing list with any questions.

[SCB-26] Upgrade to HttpClient 3.0.1 from HttpClient 2.0.2
Created: Mon, 23 Oct 2006 12:25:36 -0700 (PDT)  Updated: Mon, 23 Oct 2006 12:25:36 -0700 (PDT)

Status:Open
Project:Scraping
Component/s:scraping-engine
Affects Version/s:scraping-engine-0.6
Fix Version/s:

Type:ImprovementPriority: Major
Reporter:Alan B. CanonAssignee:Henri Yandell
Resolution:Unresolved 
Environment:


 Description   
Problem statement: The 0.5 release of scraping-engine depends on HttpClient 2.0.2. A more recent stable version of this dependency, HttpClient 3.0.1, is available from Apache Commons. HttpClient 3.0.1 offers improved functionality which would expand the capabilities of scraping-engine, if the latter could be built using it. However, the HttpClient.startSession() method of HttpClient 2.0.2, which scraping-engine uses in spite of it being a deprecated method, is entirely absent in the HttpClient 3.0.1 version of the API. It is referenced in two classes within scraping-engine, namely HttpFetcher and HttpsFetcher, where it is invoked as a part of those classes own startSession() methods.

Suggested fix: reimplement startSession() within HttpFetcher and HttpsFetcher to use methods compatible with HttpClient 3.0.1, including a call to the constructors for org.apache.commons.httpclient.HttpURL and org.apache.commons.httpclient.HttpsURL. For example, the body of the proposed new implementation of HttpsFetcher.startSession() looks like this:

client.getHostConfiguration().setHost(
              new HttpsURL(
                cfg.getString("username"),
                cfg.getString("password"),
                url.getHost(),
                port,
                url.getPath(),
                url.getQuery()
              )
            );

Modules affected:

src/java/org/osjava/scraping/AbstractHttpFetcher.java
src/java/org/osjava/scraping/HttpFetcher.java
src/java/org/osjava/scraping/HttpsFetcher.java

The benefit of the suggested fix is the possibility to run scraping-engine using either HttpClient 2.0.2 or 3.0.1.

A potential pitfall of the above method is that the constructors for HttpURL and HttpsURL throw a subclass of java.io.IOException, namely org.apache.commons.httpclient.URIException. This may cause a compile-time incompatibility with existing custom implementations of Fetcher. However, within the existing API, the only invocation of the AbstractHttpFetcher.startSession() method is found within AbstractHttpFetcher.fetch(), and it already traps IOExceptions.

As an alternative to adding the throws clause, the new implementations could catch this exception, issue a logging message, and then consume the exception (as opposed to re-throwing it, although a later exception is almost certain to occur as a consequence.) A hybrid between this method and that described above could be acheived with the addition of a property file setting to conditionally suppress the throwing of URIExceptions from startSession().