[GJC-17] Performance Bottleneck: extensive use of toLowerCase in com.generationjava.web.HtmlW.getIndexOpeningTag | |
| Status: | Closed |
| Project: | Genjava |
| Component/s: | gj-scrape |
| Affects Version/s: | scrape-1.0 |
| Fix Version/s: | scrape-2.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | None | Assignee: | Henri Yandell |
| Resolution: | Fixed | ||
| Environment: | |||
| Description |
| Bottleneck in calling toLowerCase while getting tags, specifically in the com.generationjava.web.HtmlW class. |
| Comment by bayard [ Fri, 18 Jun 2004 13:47:18 -0700 (PDT) ] |
| Going to go with the easy issue of adding a: tagsToLowerCase method onto HtmlW. HtmlScraper would then tagsToLowerCase on its scraping target. |
| Comment by bayard [ Fri, 18 Jun 2004 14:02:42 -0700 (PDT) ] |
| Changed my mind. HtmlScraper will maintain a lower-cased version, jump to XmlW for optimal performance when possible and juggle things. Would be nice to get rid of the data variable in HtmlScraper and have it merely remember an index on the main page text. This will create quite a few method calls to substring, so that may hurt performance. |
| Comment by bayard [ Fri, 18 Jun 2004 15:21:24 -0700 (PDT) ] |
| This is done. Will test in the originator's application (work colleague) to see if the modifications have created the necessary dramatic speed improvement. |
| Comment by bayard [ Sat, 19 Jun 2004 11:16:20 -0700 (PDT) ] |
| First index is the number of tr tags, second is the number of td tags in them. Speed is massively improved. 1.0 == [1,1] 0:00:00.049 [10,2] 0:00:00.201 [100,5] 0:00:01.349 [100,10] 0:00:03.522 [250,5] 0:00:05.840 2.0 == [1,1] 0:00:00.048 [10,2] 0:00:00.016 [100,5] 0:00:00.148 [100,10] 0:00:00.104 [250,5] 0:00:00.027 |