This is an exported version of the JIRA issue tracker. Please use the Google Code site to open new tickets or report updates to these existing tickets. Feel free to contact the mailing list with any questions.

[GJC-17] Performance Bottleneck: extensive use of toLowerCase in com.generationjava.web.HtmlW.getIndexOpeningTag
Created: Fri, 18 Jun 2004 12:55:12 -0700 (PDT)  Updated: Sat, 19 Jun 2004 11:16:20 -0700 (PDT)

Status:Closed
Project:Genjava
Component/s:gj-scrape
Affects Version/s:scrape-1.0
Fix Version/s:scrape-2.0

Type:BugPriority: Major
Reporter:NoneAssignee:Henri Yandell
Resolution:Fixed 
Environment:


 Description   
Bottleneck in calling toLowerCase while getting tags, specifically in the com.generationjava.web.HtmlW class.
Comment by bayard [ Fri, 18 Jun 2004 13:47:18 -0700 (PDT) ]
Going to go with the easy issue of adding a:

tagsToLowerCase method onto HtmlW.

HtmlScraper would then tagsToLowerCase on its scraping target.
Comment by bayard [ Fri, 18 Jun 2004 14:02:42 -0700 (PDT) ]
Changed my mind. HtmlScraper will maintain a lower-cased version, jump to XmlW for optimal performance when possible and juggle things.

Would be nice to get rid of the data variable in HtmlScraper and have it merely remember an index on the main page text. This will create quite a few method calls to substring, so that may hurt performance.
Comment by bayard [ Fri, 18 Jun 2004 15:21:24 -0700 (PDT) ]
This is done. Will test in the originator's application (work colleague) to see if the modifications have created the necessary dramatic speed improvement.
Comment by bayard [ Sat, 19 Jun 2004 11:16:20 -0700 (PDT) ]
First index is the number of tr tags, second is the number of td tags in them. Speed is massively improved.

1.0
==
[1,1] 0:00:00.049
[10,2] 0:00:00.201
[100,5] 0:00:01.349
[100,10] 0:00:03.522
[250,5] 0:00:05.840

2.0
==
[1,1] 0:00:00.048
[10,2] 0:00:00.016
[100,5] 0:00:00.148
[100,10] 0:00:00.104
[250,5] 0:00:00.027