Web data acquisition and processing
Abstract: This is a network information resources acquisition system, which builds the reusable information service system through customized tracking and monitoring and acquisition of real-time Internet information. Be able to collect specific information that users are interested in from a variety of network sources, including webpages, BLOG, BBS, etc., and provide them for end-users in multiple forms after automatic classification.
Solution contents:
Quickly and timely capture hot news, market intelligence, industry information, policies and regulations and other network information contents required by users.
Data navigation
* Website navigation: Designate the website channel, etc.;
* Data metadata definition: Keywords used for extracting;
* Provide visual acquisition task configuration tools, so that users can add acquisition tasks at any time by themselves.
Data grabber
* Support turning navigation and content pages;
* Support a variety of forms of webpages: Static webpages, dynamic webpages, document webpages (Word, EXCEL, PDF, etc.);
* Support the acquisition of embedded tables;
* Support annex acquisition and analysis of articles (Word, EXCEL, PDF, etc.);
* Acquisition of metadata automatic test of analysis results;
* Removal of repeated acquisition results.
Data edition
* Data walkthrough, investigate the integrity and accuracy of the collected information;
* Editing and modification of collected data.
Data post-processing
* Standardize the metadata of collected data;
* Paragraph formatting of the body of the collected data;
* Hyperlink processing of the collected data;
* Classification of the collected data.
Data release
* Database release;
* XML release, used to support online retrieval and inquiry, customized push, publishing, etc.
* Automatic updating and acquisition of data;
* Automatic acquisition of new information on the target website (time interval can be set up, a minimum of 1 minute, you can also set a fixed time for acquisition in batches, such as starting increment acquisition in batches from 12:00 pm Beijing Time).
Solution implementation: According to the scale of information acquisition and the definition of information extraction, determine the time period of implementation.