|
LDSE - local domain search engine
This program is a result of my diploma thesis (with the same title). It is
a distributed search engine, consisting of multiple search nodes which report their results
to a master server. Each node should be responsible for indexing
and querying a local node (ideally this is a single web server). The nodes
are connected in a hierachical way. Every super node can execute a query with
its own index and it can query all (or a subset) of its sub nodes. This is done
by determining the sub-nodes which can give the best results for the query.
It is possible to query every node directly, even if it is not the top-node. It will then use
its own data and the data of its sub-nodes for answering the query.
Special features are:
- distributed search engine
- tolerant against writing errors and other words formes
- separated data server and data gatherer
- can support many file formats via plugin mechanism, supported are
- PDF
- HTML
- plain text
- ZIP and gzip files
- uses any relational database (maybe with small changes because of differences in the SQL dialect)
- tested with InstantDB and Oracle Lite
- can gather data via HTTP or from local file system
- HTTP spider is resistent against loops (if a document links against itself, but in another path)
- HTTP Spider is resitsnet againss HTML errors (like missing "'s in parameters or non-quoted &'s)
For the fault tolerance, a so-called "trigram index" is used. This index takes
all trigrams (3-letter-combinations) which a word contains, and stores this
information in a reverse index. From the words to the documents there is another
reverse index.
This gives a high speed for queries and a tolerance against mis-spelled words (either in the searched documents
or in the query). It can also find substrings in words.
|
Download
the current version is 0.5.4 (source, all)
(each file is about 2.0 meg)
|
Change History
- the default gatherer.xml specified a wrong class name for the HTMLDocument
(missing the de.)
- the HTTPSpider no longer pops up windows for authorization and cookies
- DataServer now checks for null-strings read from the input
- build.xml now references Acme.zip for the server classpath
- changed DataClient
- now the input reader is not emptied when issuing a command to the server
- this is because when using JDK 1.3, the input blocks regardless that ready() returns true :(
- changed the query string in query.xml to "document"
- added the directory 'testdata' containing 9 simple files, changed 'gatherer.xml' to use this directory
- corrected the start script for the queryclient (tstuder.zip was missing)
- corrected the start script for the adminclient (wromng JAR file name, wrong class name)
- corrected the client JARs (resources bundles were missing)
- renamed resource bundles so that english is now the default (instead of german)
- corrected config files for the servlet (used wrong paths)
- included the missing source for the tstuder table component
- better error handling in Gatherer
- now an exception is printed if a spider class cannot be found
- changed gatherer.xml to contain the correct class names for the document handler classes
- corrected more occurences of wrong column names (SIZE instead of DOC_SIZE, ID instead of DOC_ID :(
- the last-changed-date is now stored as string (milliseconds since 1/1/1970), because with JDK 1.3
InstantDB only the day, but not the time, was stored when using DATE as data type
- corrected the buildfile to exclude unnecessary files from the distributions
(not released)
- now ant 1.2 is used as make tool
- reworked the build.xml
- introduced filesets and patternsets to define the filesets for the certain programs
- introduced configuration variables (e.g. for the database) to be able to change
setting quickly
- an up-to-date check is done before buildung the JARs
- for building the JARs no temporary files are used
- moved some common classes to the new package de.hendriklipka.ldse.server.common
- translated the description of the communication protocols into english
- bugfix: there were wrong column names (MIMETYPE instead of MIME_TYPE) in the class RDBMSDBConnection
- NOTE to myself: test ALL functions before releasing code :(
- document server and gatherer can now read their config files from another directory
(so they could be read from the cfg subdirectory...)
- corrected the startup scripts to use the correct class names
- build.xml was changed to include all necessary classes to the jar files
- build.xml was changes to create correct binary and source distributions
- documentserver.xml was changed to use InstantDB as default database
- translates more documentation into english (config)
- new email address for contact: ldse@hendriklipka.de
- initial release
|
|