Web data acquisition and processing

Abstract: This is a network information resources acquisition system, which builds the reusable information service system through customized tracking and monitoring and acquisition of real-time Internet information. Be able to collect specific information that users are interested in from a variety of network sources, including webpages, BLOG, BBS, etc., and provide them for end-users in multiple forms after automatic classification.

 

Solution contents:

 

Quickly and timely capture hot news, market intelligence, industry information, policies and regulations and other network information contents required by users.

Data navigation
       * Website navigation: Designate the website channel, etc.;
       * Data metadata definition: Keywords used for extracting;
       * Provide visual acquisition task configuration tools, so that users can add acquisition tasks at any time by themselves.

Data grabber
       * Support turning navigation and content pages;
       * Support a variety of forms of webpages: Static webpages, dynamic webpages, document webpages (Word, EXCEL, PDF, etc.);
       * Support the acquisition of embedded tables;
       * Support annex acquisition and analysis of articles (Word, EXCEL, PDF, etc.);
       * Acquisition of metadata automatic test of analysis results;
       * Removal of repeated acquisition results.

Data edition
       * Data walkthrough, investigate the integrity and accuracy of the collected information;
       * Editing and modification of collected data.

Data post-processing
       * Standardize the metadata of collected data;
       * Paragraph formatting of the body of the collected data;
       * Hyperlink processing of the collected data;
       * Classification of the collected data.

Data release
       * Database release;
       * XML release, used to support online retrieval and inquiry, customized push, publishing, etc.
       * Automatic updating and acquisition of data;
       * Automatic acquisition of new information on the target website (time interval can be set up, a minimum of 1 minute, you can also set a fixed time for acquisition in batches, such as starting increment acquisition in batches from 12:00 pm Beijing Time).

Solution implementation: According to the scale of information acquisition and the definition of information extraction, determine the time period of implementation.