Just as a Web browser needs to organize data so users can results to a search, document classification allows organizations to make it simple to find important information. Document categorization is performed differently than using search engine algorithms because specific keywords can have different meanings. Such a method must be able to gauge the context of specific business documents. With supervised document classification, the user labels a set of documents which the automated system can use as a model. In the unsupervised method, they are mathematically organized based on similar words and phrases.
The user has the most control over document classification when rule-based classification is used. The context, categories, and rules are created according to what is manually inputted. During the process of document retrieval, everything is categorized according to the exact rules a user specified. Categories must be assigned during the supervised method as well. The step of actually writing out the rules the search system should follow, however, is completed automatically.
With document clustering, also called unsupervised classification, the groupings and categories are all done automatically. There is no manual input of rules, which can be both beneficial and disadvantageous. This process saves time as no rules need to be written, and similar documents are often found that were not considered similar initially. The downside is that documents might appear together that were not originally intended to be in the same category. The more automated approach is also more taxing on computer systems.
To find a balance between the two different methods, computer specialists have devised the method of semi-supervised document classification. The documents that are categorized manually are combined with document sets that are not labeled. Programs that can associate information from both use the data to learn how each document is classified. Information retrieval is aided by some control over the classification process. Document clustering is made more efficient when phrases can be used to cluster them, such as with Suffix Tree Clustering, especially for documents that are stored online.
Information science has explored various ways to make data mining more efficient. Most businesses are connected to the Internet, so Web mining needs to be as little time consuming as possible in order for relevant documents to be found. Computer scientists have also created several different algorithms to organize documents in a hierarchical fashion. Each is effective in its own way and document classification continues to be studied and defined by different software programs and custom corporate methods.