Indexing all the site
To crawl and index all the site, click on the 'Generate' link in the menu and then click on the first button 'Crawl Site'. This causes the program to start crawling from the root page as defined in set up.
It will collect all the links from the page, discarding any links that do not have the same base domain name as defined in the set up.
It will also discard any urls that are in the exclude folder list.
Once it has a valid url, it extracts the text from the page. This includes the title, the meta description if available, the meta keywords if available, the body text and any text in the alt tags of images.
It also discards any text in comment tags it discards because it is only concerned with 'visible' text.
The text is then collected together and words which are to be excluded are removed before they are stored in the database with the number of occurrences.
The script continues by recursively looking for new links until all links have been found and all pages have been indexed.
When you click on the 'Crawl Site' button you are taken to a page that monitors the progress of the indexing. This page is refreshed every few seconds and displays the information on the number of words indexed and the last page to be indexed.
Indexing changed pages only
If you wish to index only changed pages, that is pages that have been modified since the last index, click on the 'Changed pages' button.
In this situation the script does not crawl the site, so any new pages are not found. The script looks up the url in the database and checks the page size. It will re-index the page if the page size has changed.
This method greatly improves the speed of indexing.
When you click on 'Changed pages' button, it goes to a progress page indicating the state of indexing.
Incorporating searchdb into your site
The crawler system requires standard href coded links on the web pages to crawl its way from one page to the next. So if your site uses a dhtml drop down menu system there may not be any such links available for the system to crawl through.
In those situations you will need to create a site map or place fixed links onto each page. The site map may also help to index your site with external search engines such as Google.
Note that the crawler does not attempt to crawl pages which are not part of the defined domain. This is to prevent the crawler disappearing into the internet.
Excluding defined text on a page
You may have certain parts of your web page which you do not wish to be indexed. This may be menu or footer details which appear on every page. To exclude text, place comments around the text as follows :
<!-- exclude_index_start //-->ignore this section
<!-- exclude_index_end //-->
You may place as many exclude comments on a page as you wish.
Web configuration file (web.config)
As well as the configuration entries needed to run the program, you may have to make other modifications to make your asp.net applications run such as :
<customErrors mode="Off" />Database update errors
A common error is not having the correct permissions on the Access database. The database requires write permissions.
You may get error messages such as 'Query needs an updatable recordset' or similar. This indicates that you have not set the permissions on the Access database.
If you are testing it on a local PC, right-mouse click on the Access database and click on properties, then click on the Security Tag. Click on the user name to set permissions for an internet browser. This is usually 'User'. Change the permissions for this user to Write.
If you do not see the security tag it is probably because you do not have the correct Folder options set which is set in Windows explorer under Tools, Folder Options, View and uncheck 'Use simple file sharing'.
Extract text
The text that is displayed as the extract text in the search results is created depending on the option you choose in the set up displays.
You may choose the text to be displayed as the meta text defined in the meta description tag which is placed in the head of the web page.
e.g. <meta name="description" content="this is a page about something">
If you do not select this option then the script will extract the words from the body of the page. In this situation, you may define a start point from which the text is extracted.
| Copyright © 2010 | Page updated April 2010 |