The usefulness and design of screen scraping entire websites.
Background
A while ago a client needed an easy way of duplicating the functionality of their website, for use by international agents. Since some functionality needed to remain with the client and some at the agents’ own website, instead of duplicating entire databases’ and code, screen scraping became an alternative.
This article will look into the architectural design of a website that request, changes and redirects another websites content to a visitor. It is not intended as a complete specification, but code is available on request.
This resulted in a small .Net application that requested initial content from the client, filtered and replaced regional and other content of commercial nature, and presented the viewer of the agents’ website with what looked like original and local agent web content.
Content that will change in the above explanation will be in order for the agent to ensure his/her commission on goods or services that the source presents via the web.
Systems that might qualify as candidates for this technology can include from ecommerce sites, like shopping carts to personnel agencies advertising jobs.
Requirements
The proxy application needs to be hosted on a web server. Since the application was written in ASP.Net the .Net framework will be required as well.
Conceptual Design
The following diagram illustrates the subsystems and relation to each other, with an explanation below. Click on the image to view the full size.
Explanation
The web user requests a URL from the proxy, where after the proxy requests the posted URL from the true source. Content like links, addresses etc. is searched for and replaced and the altered html streamed back to the web user. Because all web addresses has been replaced subsequent requests from the web user will point to the proxy instead of the target site. In addition subsequent requests can be searched and content entered by the web user can be replaced, altered or logged.
Disadvantages
The most obvious risk is connectivity, if the clients site that will be redirected is inaccessible, no content will be redirected.
No content caching has been catered for or investigated in this exercise, although it might be achievable on an IIS/.NET level.
Slow connections may also decidedly hamper performance.
Legal
This technique can also relate to website hijacking, because any content can be replaced, such as contact details, prices and descriptions. The middle man or ‘agent’ then presents other entities content in its own name, for its own benefit.
A commonly observed case of redirection is usually persons that redirects and intercept online banking web pages, in order to obtain the visitors login credentials.
Another is the harvesting of internet surfers credit card details, during online purchases.
It is therefore always good practice to verify the URL when submitting login details, like usernames and passwords.
Conclusion
A simple and small application (12 KB) where string manipulation and search and replacement routines precedes webpage development efforts, in order to create entire fully functional web sites.
Links
None
