The Search Engine Professionals at Rank for $ales.com --- In business since 1997.
Back to our Homepage SEO Tips that will make a big difference in your rankings and our most popular ** How To ** section The most common myths about SEO -- Read what the experts have to say about today's most common SEO myths and misconceptions Frequently Asked Questions to Search Engine Optimization and Positioning Search Engine Optimization Industry News -- Stay in tune with the most recent developments in search engine technology and the SEO industry Contact Rank for $ales today and get your site's rankings high in the engines-- Right where they should be!

  SEARCH FOR:   CITY or STATE:

Search this site


Convergence of search engines and blogs

April 18, 2003

With about 3 billion pages in its database, Google is the most comprehensive search engine today. With an ability to crawls approximately 150 million pages of the Internet a day Google can visit all of the 3 billion pages in about 20 days.

However, this is not how it works. Microdoc News comments on Grub and LookSmart initiative of building and operating a distributed crawler and identifies yet another step towards convergence of search engines and weblogs.

Grub: Distributed Crawling
Microdoc News registered and downloaded the Grub client. Microdoc News is one of 1096 clients running - crawling 47,840,029 URLs in the last 24 hours. Microdoc News assigned one computer on its network dedicated to Grub crawling. In the six hours so far, Microdoc News computer has crawled 12,750 URLs. 14% of these were changed, 73% were identified as unchanged and 5% of these URLs were down.

Microdoc News can see the benefits of distributed computing and the power each additional computer adds to the network -- we disagree with jimlog 2.0 where it was purported there that there would be little gained:

The big problem with this project is the bandwidth needed. With a non-distributed crawler, the pages have to be downloaded to the main servers just once. With a distributed crawler, they have to be downloaded at the client, and then uploaded to the server. Uploading to the server from the client is the same as having the server download the page in the first place. So the work is doubled. While it's possible to reduce the size of the data uploaded to the central server by parsing the web pages, to build an effective search engine you need all the data, so the client can't reduce the size much. For example, Google keeps the entire page intact. [Search: By the People, For the People at jimlog 2.0]

Microdoc News checked 75% of the sites and did not therefore have to communicate anything with the central Grub servers. It would seem that there is at least a 70% increment in power with each computer added to the network with an overhead of about 30% that needs to be carried by the central Grub servers.

Convergence of Weblogs and Search Engines
Weblog writing is a highly distributed activity, particularly when a weblogger hosts her/his weblog on a different server than a main hosting server. Grub is an example of crawling the web using a distributed model. Both are information management activities conducted using a distributed model.

Blogging, on the one hand, is a mechanism used by Google to good effect, to identify which parts of the Internet are considered important by readers of the web. Blogging is one of the mechanisms used by Google Inc. to identify what should be crawled more often, and which pages are important to those reading the web, as Peter Norving suggests:

"I want more clues about which page to look at rather than another page. . . "It isn't a problem of computing resources but deciding what parts of the Web should be updated more frequently than others," he said. [Wired News: Building a Bigger Search Engine].

Crawling could move towards a more distributed model as well. One of the important steps that has been glossed over to this point is Grub's "localized crawling" feature. This is where, I as both crawling agent and webmaster place a grub.txt file in the root of each of my websites. This is a clear pointer to a website that Grub has not had to locate. It is located for Grub by the distributed crawling agent. This is also a site that the webmaster/crawling agent wants to get listed in this new search engine.

Now you could have blogger and crawling agent as one and the same person. Get the word out that the best way to get you site listed is to actually also become a crawling agent . . . what is the potential? Every webmaster, every blogger, every person who wants to get traffic then has an incentive to be a crawling agent.

The next step is to build each crawling client so that it has a detection device on the machine it is hosted on to identify when new pages are added. A client, if you like, as it is most likely sitting on a machine that is also hosting a web site or blog site, becomes a distributed spy to detect when that site has pages that need updating.

Voting
To ensure that a webmaster does not hack the results and therefore get a better listing in the distributed system than otherwise, other bloggers sites need to provide information for scheduling the updating of my site. My client indicates there are new pages, the scheduling machine lists those pages as to be crawled soon, and according to the number of links pointing to that site that contains that page, the pages are listed to be crawled in order of link importance.

Now this is all conjecture and pulling ideas out of a hat, and some of it may never happen. This space, however, is a space to be watched. Convergence is happening now and going to be an exciting space of new things in the next months.


Source: Microdoc News

Back to the top of the page.         
Fill out your e-mail address
to receive our free newsletter!

Read Serge Thibodeau's daily blogs on search engines at Serge Thibodeau Live.
We strongly suggest you bookmark our web site by clicking here.

Tired of receiving unwanted spam in your in box? Then get SpamArrest™ and put a stop to all that nonsense. Click here to get all the details.
Tired of receiving unwanted spam in your in box? Get SpamArrest™ and put a stop to all that SPAM. Click here and get rid of SPAM forever!

Get your business or company listed in the Global Business Listing directory and increase your business. It takes less then 24 hours to get a premium listing in the most powerful business search engine there is. Click here to find out all about it.

Rank for $ales strongly recommends the use of WordTracker to effectively identify all your right industry keywords. Accurate identification of the right keywords and key phrases used in your industry is the first basic step in any serious search engine optimization program. The keywords you think are the best may be totally different than the ones recommended by WordTracker. Click here to start your keyword and key phrase research.

Pay Rank for $ales securely with your Visa, MasterCard, Discover, or American Express credit card through the secure PayPal network. (Note: PayPal is an eBay company, and maintains a net free capital of US $ 50 Million).
VisaMasterCardDiscoverAmerican Express

You can link to the Rank for Sales web site as much as you like. Read our section on how your company can participate in our reciprocal link exchange program and increase your rankings in all the major search engines such as Google, AltaVista, Yahoo and all the others.

Powered by Sun Hosting          Protected by Proxy Sentinel™          Traffic stats by Site Clicks™

Site design by GCIS              SEO enhanced by Pagina+™            Online sales by Web Store™


Call Rank for Sales toll free from anywhere in the US or Canada:   1-800-631-3221
email:   info@rankforsales.com

| Home | SEO Tips | SEO Myths | FAQ | SEO News | Articles | Sitemap | Contact |


Copyright © Rank for Sales 2003    Terms of use    Privacy agreement    Legal disclaimer