Skip to main content

Web Crawler API

Japanese Crawler project is a research and development project was done by me in my company.It was a really challenge to me.My host is a Japanese patent site and want to crawling data from it.Normally web host protocols are base on www but amazing this one is www7 and www8.When i saw it first time, I'm confusing.However It was not impotent to my works.
First I was did some research about crawlers and bots before start the project design.First time I thought this is a common type web Crawler project and then i was download some sample crawler source code from sorceforge.After I was been running sample crawler source code giving my host(www7.xxxx.com).Ohhhhh There was a no out put only show one page called www7.xxx.ipdl.I can't believe that because it is working with other site like www.google.com.... etc.After I was understood, this is not a common type of web crawler project..
Then I want to work hard.First I was installed Firebug Mozilla Firefox plugging for my browser.Firebug was very helpful to me to success my project.Next I started my code review.Mmmmmmmmmmmmmmm there are lot of Java script playing big role.In my host(www7.xxx.com),have thousand of out put pages but rally it has only one page.All other pages are generating dynamically in runtime.Big problem because normally web crawler want to static link to crawling data but my host site haven't any static links all are dynamically generating.
Then I want to find links which generating in runtime.These all links are generating by Java script using special algorithms.These algorithm want to lot of parameters to give output link.I was choose Web Browser virtual methods to my project.Give page inputs and make http request manually and bind cookies to it.you can got idea from picture1.Some times you may feel it is very easy task but it is not easy because crawler web site developer put lot of security.

Comments

Popular posts from this blog

Java Source Code to Change Local IP Address

Hi guys..

Try This code to change your Local IP address.


import java.io.IOException;
import java.lang.Runtime;
public class Chang_Ip {



public static void main(String args[]) throws IOException
{

String str1="192.168.0.201";
String str2="255.255.255.0";
String[] command1 = { "netsh", "interface", "ip", "set", "address",
"name=", "Local Area Connection" ,"source=static", "addr=",str1,
"mask=", str2};
Process pp = java.lang.Runtime.getRuntime().exec(command1);

}


}

How to enable proxy service security in ESB 4.9.0?

Security is  one of the major concern when we developing API base integrations or application developments. WSO2 supports WS Security, WS-Policy and WS-Security Policy specifications. These specifications define a behavior model for web services. Proxy service security requirements are different from each others. WSO2 ESB providing pre-define commonly used twenty security scenarios to choose based on the security requirements. This functionality is provided by the security management feature which is bundled by default in service management feature in ESB. This configuration can be done via the web console until ESB 4.8.1 release, but this has been removed from the ESB 4.9.0. Even though this feature isn't provided by the ESB web console itself same functionality can be achieved by the new WSO2 Dev Studio. WSO2 always motivate to use dev studio to prepare required artifacts to the ESB rather than the web console. Better way to explain this scenario is by example. Following example …

How SSL Tunneling working in the WSO2 ESB

This blog post assumes that the user who reads has some basic understanding of SSL tunneling and the basic message flow of the ESB. If you are not familiar with the concepts of the SSL tunneling you can refer my previous blog post about the SSL tunneling and you can get detail idea about the message flow from this article.
I will give brief introduction about the targetHandler for understand concepts easily. As you may already know TargetHandler(TH) is responsible for handling requests and responses for the backend side. It is maintaining status (REQUEST_READY, RESPONSE_READY .. ,etc) based on the events which fired by the IOReactor and executing relevant methods. As the example if a response which is coming from the backend side hits to the ESB, IOReactor fire the responseRecived method in the targetHandler side. Followings are the basic methods contain in the target handler and their responsibilities.

Connect: - This is executed when new outgoing connection needed.RequestReady:- Thi…