Do You Know What You're Searching For?
Are You Finding It?
Gregory L Fordham
Search technology is essential to both finding the smoking gun and controlling discovery costs in the digital age. The old fashioned manual approach is simply not practical for a variety of reasons.
The success of computerized search requires more than brainstorming search terms and entering them in someone’s black box technology. Indeed, how the black box actually treats those terms and performs its analysis can also be important when formulating search plans.
The first technology that searchers are likely to encounter is a single pass search, which is also known as a live search. This approach begins at the start of the search population and proceeds sequentially through the data looking for the desired search terms.
While single pass or live search technologies have the advantage of being able to start immediately without first developing an index, they also typically lack robust functionality.
For example, they typically provide only single term search capability but not Boolean or proximity locator searches. If they do multiple terms it is because the search pattern depicts their relational placement.
Single pass and live search technologies inherit the limitations of single term searches. They produce overly inclusive results that require more review time.
The second is an indexed search. This approach first scans through the population and catalogs the various terms much like an index to a book. The searches are then run against the terms in the index.
These types of search engines can do single word searches and pattern searches. They also offer more robust features like Boolean and proximity location searches.
Interestingly, in various studies Boolean searches have proven better than other techniques in finding relevant documents and eliminating false positives. Clearly, the additional attributes in a complex Boolean search help to target the relevant documents and eliminate false positives.
The indexed search also has advantages over the single pass or live search in that it is very well suited to highly iterative searching. Iterative searching is essential to formulating search terms by testing and sampling their result sets and tailoring as needed.
The third engine type is commonly called context search, which is typically a form of indexed search that has been embellished with vector data, synonyms and definition interpretation.
The goal of context search is to do a better job at locating relevant documents through context analysis and relevance ranking. In addition, it hopes to go beyond the search terms and find relevant documents that might not even contain the search terms.
In its simplest form, the concept of context is illustrated with the term falcon. If you are searching documents about falcons are you searching for football teams and their players or are you searching for birds?
Despite its intuitive appeal, context search has not been found to outperform Boolean keyword searches in litigation type environments. While context search may prove superior in unrelated data like the internet, by the time data is identified and collected in a preservation effort, its relationship to a particular context is already narrowed.
Of course, the particular search technology may not make much difference if the underlying data is flawed. The old adage “garbage in garbage out” is especially appropriate to computerized searches.
Many litigators are using converted data on which to perform their searches. In other words, they have had their electronic documents converted to TIFF images and the associated text obtained either through OCR or by extracting text from the native documents.
Since many electronic documents contain non-searchable data such as images, the data can only be obtained through an OCR process. OCR, however, is not perfect and can introduce erroneous characters in the text interpretation that can then frustrate search technology performance.
Certainly, staying with the native files is a solution to the OCR error problems. Of course, searchers will need to recognize the limitations of that choice as well.
If the searching is performed on native files, there are still other factors that need consideration. For example, can the search engine access documents within a zip file? Even traditional documents like Word and Excel can be more difficult with versions 2007 and beyond, since they have adopted the XML data format.
As the preceding illustrates, there are numerous technological issues that must be navigated and understood by those using computerized search tools. Remarkably the complexities don’t stop there. There can still be other issues outside the technology entirely. But without this knowledge you might not know for what you are really searching.