Sometimes known as information retrieval, information extraction (IE) is a process that is used with computer systems to allow relevant data to be extracted from larger bodies of data, using some set of pre-defined criteria. The idea behind information extraction is to make it possible to easily identify and assimilate data that is relevant to a particular activity, without the need to manually go through large amounts of information to find the exact data required. The process is similar to the ideas of concept mining or web scraping, in that all these approaches seek to collect useful information from a wider pool of available data.
The general approach to information extraction calls for using programming that is capable of scanning information sources that are considered machine-readable. This can include hard copy documents that have been scanned into some sort of electronic files, documents prepared as spreadsheets or word processing documents, or even the data that is contained in readable fields in a database. Typically, parameters are set that make it possible for a software program to be given access to these data sources and quickly scan through them using specific criteria to prioritize and pull out certain types of information from the available pool. This process is typically different from a simple search process, in that the method calls for not matching specific words or phrases per se, but instead uses a process called natural language processing, which aids in not only evaluating the actual words but also the context and the meaning implied by that context.
The complexities involved with information extraction make the use of this approach somewhat difficult to manage on a global scale, although there are IE tools that work very well only with a limited amount of data, such as the data sources associated with the electronic files housed on the server of a corporation, or even a pool of sources involving a limited number of news feeds. With this approach it is possible to identify some type of event, possibly even limit the returns to the inclusion of a certain number of participants in the event, and have the data arranged according to date.
As with many forms of technology, the tools used to engage in information extraction are continually being refined. Since the beginning of the 21st century, the ability to set parameters and make use of ever-increasing bodies of electronic data as part of the search for relevant information has increased significantly. This includes the ability to deal with large volumes of unstructured data and use those parameters to bring some order or structure to that data, making it all the more useful for future searches.