Each of us has been faced with the problem of searching for
information more than once. Irregardless of the data source we
are using (Internet, file system on our hard drive, data base or
a global information system of a big company) the problems can
be multiple and include the physical volume of the data base
searched, the information being unstructured, different file
types and also the complexity of accurately wording the search
query. We have already reached the stage when the amount of data
on one single PC is comparable to the amount of text data stored
in a proper library. And as to the unstructured data flows, in
future they are only going to increase, and at a very rapid
tempo. If for an average user this might be just a minor
misfortune, for a big company absence of control over
information can mean significant problems. So the necessity to
create search systems and technologies simplifying and
accelerating access to the necessary information, originated
long ago. Such systems are numerous and moreover not every one
of them is based on a unique technology. And the task of
choosing the right one depends directly on the specific tasks to
be solved in the future. While the demand for the perfect data
searching and processing tools is steadily growing let’s
consider the state of affairs with the supply side.
Not going deeply into the various peculiarities of the
technology, all the searching programs and systems can be
divided into three groups. These are: global Internet systems,
turnkey business solutions (corporate data searching and
processing technologies) and simple phrasal or file search on a
local computer. Different directions presumably mean different
solutions.
Local search Everything is clear about search on a local PC.
It’s not remarkable for any particular functionality features
accept for the choice of file type (media, text etc.) and the
search destination. Just enter the name of the searched file (or
part of text, for example in the Word format) and that’s it. The
speed and result depend fully on the text entered into the query
line. There is zero intellectuality in this: simply looking
through the available files to define their relevance. This is
in its sense explicable: what’s the use of creating a
sophisticated system for such uncomplicated needs.
Global search technologies Matters stand totally different with
the search systems operating in the global network. One can’t
rely simply on looking through the available data. Huge volume
(Yandex for instance can boast the indexing capacity of more
than 11 terabyte of data) of the global chaos of unstructured
information will make the simple search not only ineffective but
also long and labor-consuming. That’s why lately the focus has
shifted towards optimizing and improving quality characteristics
of search. But the scheme is still very simple (except for the
secret innovations of every separate system) – the phrasal
search through the indexed data base with proper consideration
for morphology and synonyms. Undoubtedly, such an approach works
but doesn’t solve the problem completely. Reading dozens of
various articles dedicated to improving search with the help of
Google or Yandex, one can drive at the conclusion that without
knowing the hidden opportunities of these systems finding a
relevant document by the query is a matter of more than a
minute, and sometimes more than an hour. The problem is that
such a realization of search is very dependent on the query word
or phrase, entered by the user. The more indistinct the query
the worse is the search. This has become an axiom, or dogma,
whichever you prefer. Of course, intelligently using the key
functions of the search systems and properly defining the phrase
by which the documents and sites are searched, it is possible to
get acceptable results. But this would be the result of
painstaking mental work and time wasted on looking through
irrelevant information with a hope to at least find some clues
on how to upgrade the search query. In general, the scheme is
the following: enter the phrase, look through several results,
making sure that the query was not the right one, enter a new
phrase and the stages are repeated till the relevancy of results
achieves the highest possible level. But even in that case the
chances to find the right document are still few. No average
user will voluntary go for the sophistication of “advanced
search” (although it is equipped with a number of very useful
functions such as the choice of language, file format etc.). The
best would be to simply insert the word or phrase and get a
ready answer, without particular concern for the means of
getting it. Let the horse think – it has a big head. Maybe this
is not exactly up to the point, but one of the Google search
functions is called “I am feeling lucky!” characterizes very
well the existent searching technologies. Nevertheless, the
technology works, not ideally and not always justifying the
hopes, but if you allow for the complexity of searching through
the chaos of Internet data volume, it could be acceptable.
Corporate systems The third on the list are the turnkey
solutions based on the searching technologies. They are meant
for serious companies and corporations, possessing really large
data bases and staffed with all sorts of information systems and
documents. In principle, the technologies themselves can also be
used for home needs. For example, a programmer working remotely
from the office will make good use of the search to access
randomly located on his hard drive program source codes. But
these are particulars. The main application of the technology is
still solving the problem of quickly and accurately searching
through large data volumes and working with various information
sources. Such systems usually operate by a very simple scheme
(although there are undoubtedly numerous unique methods of
indexing and processing queries underneath the surface): phrasal
search, with proper consideration for all the stem forms,
synonyms etc. which once again leads us to the problem of human
resource. When using such technology the user should first word
the query phrases which are going to be the search criteria and
presumably met in the necessary documents to be retrieved. But
there is no guarantee that the user will be able to
independently choose or remember the correct phrase and
furthermore, that the search by this phrase will be
satisfactory. One more key moment is the speed of processing a
query. Of course, when using the whole document instead of a
couple of words, the accuracy of search increases manifold. But
up to date, such an opportunity has not been used because of the
high capacity drain of such a process. The point is that search
by words or phrases will not provide us with a highly relevant
similarity of results. And the search by phrase equal in its
length the whole document consumes much time and computer
resources. Here is an example: while processing the query by one
word there is no considerable difference in speed: whether it’s
0,1 or 0,001 second is not of crucial importance to the user.
But when you take an average size document which contains about
2000 unique words, then the search with consideration for
morphology (stem forms) and thesaurus (synonyms), as well as
generating a relevant list of results in case of search by key
words will take several dozens of minutes (which is unacceptable
for a user).
The interim summary As we can see, currently existing systems
and search technologies, although properly functioning, don’t
solve the problem of search completely. Where speed is
acceptable the relevancy leaves more to be desired. If the
search is accurate and adequate, it consumes lots of time and
resources. It is of course possible to solve the problem by a
very obvious manner – by increasing the computer capacity. But
equipping the office with dozens of ultra-fast computers which
will continuously process phrasal queries consisting of
thousands of unique words, struggling through gigabytes of
incoming correspondence, technical literature, final reports and
other information is more than irrational and disadvantageous.
There is a better way.
The unique similar content search At present many companies are
intensively working on developing full text search. The
calculation speeds allow creating technologies that enable
queries in different exponents and wide array of supplementary
conditions. The experience in creating phrasal search provides
these companies with an expertise to further develop and perfect
the search technology. In particular, one of the most popular
searches is the Google, and namely one of its functions called
the “similar pages”. Using this function enables the user to
view the pages of maximum similarity in their content to the
sample one. Functioning in principle, this function does not yet
allow getting relevant results – they are mostly vague and of
low relevancy and furthermore, sometimes utilizing this function
shows complete absence of similar pages as a result. Most
probably, this is the result of the chaotic and unstructured
nature of information in the Internet. But once the precedent
has been created, the advent of the perfect search without a
hitch is just a matter of time. What concerns the corporate data
processing and knowledge retrieval systems, here the matters
stand much worse. The functioning (not existing on paper)
technologies are very few. And no giant or the so called search
technology guru has so far succeeded in creating a real similar
content search. Maybe, the reason is that it’s not desperately
needed, maybe – too hard to implement. But there is a
functioning one though.
SoftInform Search Technology, developed by SoftInform, is the
technology of searching for documents similar in their content
to the sample. It enables fast and accurate search for documents
of similar content in any volume of data. The technology is
based on the mathematical model of analyzing the document
structure and selecting the words, word combinations and text
arrays, which results in forming a list of documents of maximum
similarity the sample text abstract with the relevancy percent
defined. In contrast to the standard phrasal search by the
similar content search there is no need to determine the key
words beforehand – the search is conducted through the whole
document. The technology works with several sources of
information that can be stored both in text files of txt, doc,
rtf, pdf, htm, html formats, and the information systems of the
most popular data bases (Access, MS SQL, Oracle, as well as any
SQL-supporting data bases). It also additionally supports the
synonyms and important words functions that enable to carry out
a more specific search. The similar search technology enables to
significantly cut time wasted on searching and reviewing the
same or very similar documents, diminish the processing time at
the stage of entering data into the archive by avoiding the
duplicate documents and forming sets of data by a certain
subject. Another advantage of the SoftInform technology is that
it’s not so sensitive to the computer capacity and allows
processing data at a very high speed even on ordinary office
computers. This technology is not just a theoretic development.
It has been tested and successfully implemented in a project of
giving legal advice via phone, where the speed of information
retrieval is of crucial importance. And it will undoubtedly be
more than useful in any knowledge base, analytical service and
support department of any large firm. Universality and
effectiveness of the SoftInform Search Technology allows solving
a wide spectrum of problems, arising while processing
information. These include the fuzziness of information (at the
document entering stage it is possible to immediately define
whether such a document already belongs to the data base or not)
and the similarity analysis of the documents which are already
entered into the data base, and the search for semantically
similar documents which saves time spent on selecting the
appropriate key words and viewing the irrelevant documents.
Perspectives Besides its primary assignment (fast and high
quality search for information in huge volume such as texts,
archives, data bases) an Internet direction could also be
defined. For example, it is possible to work out an expert
system to process incoming correspondence and news which will
become an important tool for analysts from different companies.
Mainly, this will be possible due to the unique similar content
search technology, absent from any of the existent systems so
far except for the SearchInform. The problem of spamming search
engines with the so called doorways (hidden pages with key words
redirecting to the site’s main pages and used to increase the
page rating with the search engines) and the e-mail spam problem
(a more intellectual analysis would ensure higher level of
security) would also be solved with the help of this technology.
But the most interesting perspective of the SoftInform Search
technology is creating a new Internet search engine, the main
competitive advantage of which would be ability to search not
just by key words, but also for similar web pages, which will
add to the flexibility of search making it more comfortable and
efficient.
To draw a conclusion, it could be stated with confidence that
the future belongs to the full text search technologies, both in
the Internet and the corporate search systems. Unlimited
development potential, adequacy of the results and processing
speed of any size of query make this technology much more
comfortable and in high demand. SoftInform Search technology
might not be the pioneer, but it’s a functioning, stable and
unique one with no existent analogues (which can be proved by
the active Eurasian patent). To my mind, even with the help of
the “similar search” it will be difficult to find a similar
technology.
About the author:
None