[SCB-26] Upgrade to HttpClient 3.0.1 from HttpClient 2.0.2 | |
| Status: | Open |
| Project: | Scraping |
| Component/s: | scraping-engine |
| Affects Version/s: | scraping-engine-0.6 |
| Fix Version/s: | |
| Type: | Improvement | Priority: | Major |
| Reporter: | Alan B. Canon | Assignee: | Henri Yandell |
| Resolution: | Unresolved | ||
| Environment: | |||
| Description |
| Problem statement: The 0.5 release of scraping-engine depends on HttpClient 2.0.2. A more recent stable version of this dependency, HttpClient 3.0.1, is available from Apache Commons. HttpClient 3.0.1 offers improved functionality which would expand the capabilities of scraping-engine, if the latter could be built using it. However, the HttpClient.startSession() method of HttpClient 2.0.2, which scraping-engine uses in spite of it being a deprecated method, is entirely absent in the HttpClient 3.0.1 version of the API. It is referenced in two classes within scraping-engine, namely HttpFetcher and HttpsFetcher, where it is invoked as a part of those classes own startSession() methods. Suggested fix: reimplement startSession() within HttpFetcher and HttpsFetcher to use methods compatible with HttpClient 3.0.1, including a call to the constructors for org.apache.commons.httpclient.HttpURL and org.apache.commons.httpclient.HttpsURL. For example, the body of the proposed new implementation of HttpsFetcher.startSession() looks like this: client.getHostConfiguration().setHost( new HttpsURL( cfg.getString("username"), cfg.getString("password"), url.getHost(), port, url.getPath(), url.getQuery() ) ); Modules affected: src/java/org/osjava/scraping/AbstractHttpFetcher.java src/java/org/osjava/scraping/HttpFetcher.java src/java/org/osjava/scraping/HttpsFetcher.java The benefit of the suggested fix is the possibility to run scraping-engine using either HttpClient 2.0.2 or 3.0.1. A potential pitfall of the above method is that the constructors for HttpURL and HttpsURL throw a subclass of java.io.IOException, namely org.apache.commons.httpclient.URIException. This may cause a compile-time incompatibility with existing custom implementations of Fetcher. However, within the existing API, the only invocation of the AbstractHttpFetcher.startSession() method is found within AbstractHttpFetcher.fetch(), and it already traps IOExceptions. As an alternative to adding the throws clause, the new implementations could catch this exception, issue a logging message, and then consume the exception (as opposed to re-throwing it, although a later exception is almost certain to occur as a consequence.) A hybrid between this method and that described above could be acheived with the addition of a property file setting to conditionally suppress the throwing of URIExceptions from startSession(). |