Hendrik Lipka - LDSE - local domain search engine

LDSE - local domain search engine

This program is a result of my diploma thesis (with the same title). It is a distributed search engine, consisting of multiple search nodes which report their results to a master server. Each node should be responsible for indexing and querying a local node (ideally this is a single web server). The nodes are connected in a hierachical way. Every super node can execute a query with its own index and it can query all (or a subset) of its sub nodes. This is done by determining the sub-nodes which can give the best results for the query.
It is possible to query every node directly, even if it is not the top-node. It will then use its own data and the data of its sub-nodes for answering the query.
Special features are:
- distributed search engine
- tolerant against writing errors and other words formes
- separated data server and data gatherer
- can support many file formats via plugin mechanism, supported are
- PDF
- HTML
- plain text
- ZIP and gzip files
- uses any relational database (maybe with small changes because of differences in the SQL dialect)
- tested with InstantDB and Oracle Lite
- can gather data via HTTP or from local file system
- HTTP spider is resistent against loops (if a document links against itself, but in another path)
- HTTP Spider is resitsnet againss HTML errors (like missing "'s in parameters or non-quoted &'s)

For the fault tolerance, a so-called "trigram index" is used. This index takes all trigrams (3-letter-combinations) which a word contains, and stores this information in a reverse index. From the words to the documents there is another reverse index. This gives a high speed for queries and a tolerance against mis-spelled words (either in the searched documents or in the query). It can also find substrings in words.

Download

the current version is 0.5.4 (source, all) (each file is about 2.0 meg)

Change History

version 0.5.4 ( bin / src )

- the default gatherer.xml specified a wrong class name for the HTMLDocument (missing the de.)
- the HTTPSpider no longer pops up windows for authorization and cookies
- DataServer now checks for null-strings read from the input
- build.xml now references Acme.zip for the server classpath

version 0.5.3 ( bin / src )

- changed DataClient
- now the input reader is not emptied when issuing a command to the server
- this is because when using JDK 1.3, the input blocks regardless that ready() returns true :(
- changed the query string in query.xml to "document"
- added the directory 'testdata' containing 9 simple files, changed 'gatherer.xml' to use this directory
- corrected the start script for the queryclient (tstuder.zip was missing)
- corrected the start script for the adminclient (wromng JAR file name, wrong class name)
- corrected the client JARs (resources bundles were missing)
- renamed resource bundles so that english is now the default (instead of german)
- corrected config files for the servlet (used wrong paths)
- included the missing source for the tstuder table component
- better error handling in Gatherer
- now an exception is printed if a spider class cannot be found
- changed gatherer.xml to contain the correct class names for the document handler classes
- corrected more occurences of wrong column names (SIZE instead of DOC_SIZE, ID instead of DOC_ID :(
- the last-changed-date is now stored as string (milliseconds since 1/1/1970), because with JDK 1.3
InstantDB only the day, but not the time, was stored when using DATE as data type
- corrected the buildfile to exclude unnecessary files from the distributions

version 0.5.2 ( bin / src )

(not released)
- now ant 1.2 is used as make tool
- reworked the build.xml
- introduced filesets and patternsets to define the filesets for the certain programs
- introduced configuration variables (e.g. for the database) to be able to change
setting quickly
- an up-to-date check is done before buildung the JARs
- for building the JARs no temporary files are used
- moved some common classes to the new package de.hendriklipka.ldse.server.common
- translated the description of the communication protocols into english
- bugfix: there were wrong column names (MIMETYPE instead of MIME_TYPE) in the class RDBMSDBConnection
- NOTE to myself: test ALL functions before releasing code :(

version 0.5.1 ( bin / src )

- document server and gatherer can now read their config files from another directory
(so they could be read from the cfg subdirectory...)
- corrected the startup scripts to use the correct class names
- build.xml was changed to include all necessary classes to the jar files
- build.xml was changes to create correct binary and source distributions
- documentserver.xml was changed to use InstantDB as default database - translates more documentation into english (config)
- new email address for contact: ldse@hendriklipka.de

version 0.5.0 ( bin / src )

- initial release