Skip to main content

Web Crawler API

Japanese Crawler project is a research and development project was done by me in my company.It was a really challenge to me.My host is a Japanese patent site and want to crawling data from it.Normally web host protocols are base on www but amazing this one is www7 and www8.When i saw it first time, I'm confusing.However It was not impotent to my works.
First I was did some research about crawlers and bots before start the project design.First time I thought this is a common type web Crawler project and then i was download some sample crawler source code from sorceforge.After I was been running sample crawler source code giving my host(www7.xxxx.com).Ohhhhh There was a no out put only show one page called www7.xxx.ipdl.I can't believe that because it is working with other site like www.google.com.... etc.After I was understood, this is not a common type of web crawler project..
Then I want to work hard.First I was installed Firebug Mozilla Firefox plugging for my browser.Firebug was very helpful to me to success my project.Next I started my code review.Mmmmmmmmmmmmmmm there are lot of Java script playing big role.In my host(www7.xxx.com),have thousand of out put pages but rally it has only one page.All other pages are generating dynamically in runtime.Big problem because normally web crawler want to static link to crawling data but my host site haven't any static links all are dynamically generating.
Then I want to find links which generating in runtime.These all links are generating by Java script using special algorithms.These algorithm want to lot of parameters to give output link.I was choose Web Browser virtual methods to my project.Give page inputs and make http request manually and bind cookies to it.you can got idea from picture1.Some times you may feel it is very easy task but it is not easy because crawler web site developer put lot of security.

Comments

Popular posts from this blog

Java Source Code to Change Local IP Address

Hi guys..

Try This code to change your Local IP address.


import java.io.IOException;
import java.lang.Runtime;
public class Chang_Ip {



public static void main(String args[]) throws IOException
{

String str1="192.168.0.201";
String str2="255.255.255.0";
String[] command1 = { "netsh", "interface", "ip", "set", "address",
"name=", "Local Area Connection" ,"source=static", "addr=",str1,
"mask=", str2};
Process pp = java.lang.Runtime.getRuntime().exec(command1);

}


}

How to preserving HTTP headers in WSO2 ESB 4.9.0 ?

Preserving HTTP headers are important when executing backend services via applications/middleware. This is because most of the time certain important headers are removed or modified by the applications/middleware which run the communication. The previous version of our WSO2 ESB, version 4.8.1, only supported “server” and “user agent” header fields to preserve with, but with the new ESB 4.9.0, we’ve introduced a new new property (http.headers.preserve) for the passthru (repository/conf/passthru-http.properties) and Nhttp(repository/conf/nhttp.properties) transporters to preserve more HTTP headers.
Passthru transporter – support header fields LocationKeep-AliveContent-LengthContent-TypeDateServerUser-AgentHostNhttp transport – support headersServerUser-AgentDate
You can specify header fields which should be preserved in a comma-separated list, as shown below. http.headers.preserve = Location, Date, Server Note that properties(http.user.agent.preserve, http.server.preserve), which were used …

How Schedule failover message processor helps for the guaranteed delivery ?

Before we talk about the failover message forwarding processor, it’s better to understand the big picture of the concepts and use cases. The Scheduled Failover Message Forwarding Processor is part of the bigger picture of themessage store and message processor.

Message Store Message Processor. WSO2 ESB’s Message-stores and Message-processorsare used to store incoming messages and then deliver them to a particular backend with added Quality of Services (QoS), such as throttling and guaranteed delivery. The basic advantage of the MSMP is that it allows you to send messages reliably to a backend service. These messages can be stored in a different reliable storage such as JMS, JDBC message stores. The MSMP powered by three basic components:



1. Store Mediator.
The Store mediator is the synapse mediator and can be used to store messages in the message store.

2. Message Store.
A message store is storage in the ESB for messages. The WSO2 ESB comes with four types of message store implementations …