CONTINUOUS TOP-K MONITORING ON DOCUMENT STREAMS

1croreprojects@gmail.com

ABSTRACT:

 

The efficient processing of document streams plays an important role in many information filtering systems. Emerging applications, such as news update filtering and social network notifications, demand presenting end-users with the most relevant content to their preferences. In this work, user preferences are indicated by a set of keywords. A central server monitors the document stream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our objective is to support large numbers of users and high stream rates, while refreshing the top-k results almost instantaneously. Our solution abandons the traditional frequency-ordered indexing approach. Instead, it follows an identifier-ordering paradigm that suits better the nature of the problem. When complemented with a novel, locally adaptive technique, our method offers (i) proven optimality w.r.t. the number of considered queries per stream event, and (ii) an order of magnitude shorter response time (i.e., time to refresh the query results) than the current state-of-the-art.

 

Click Here:-  Dotnet IEEE Projects in Chennai

 

AIM OF THE PROJECT:

Our objective is to support large numbers of users and high stream rates, while refreshing the top-k results almost instantaneously.

 

EXISTING SYSTEM:

 

In traditional text search, there are snapshot (i.e., one-off) top-k queries over static document collections. The inverted file is the standard index to organize documents.  It comprises a list for every term in the dictionary; the list for a term holds an entry for each document that contains the term.  By sorting the lists in decreasing term frequency, and with appropriate use of thresholding a snapshot query can be answered by processing only the top parts of the relevant lists.   Due to the said sorting, we refer to that paradigm as frequency-ordering. This common practice for snapshot queries has been followed by most approaches for continuous top-k search, albeit adapted to the “standing” nature of the continuous queries and the highly dynamic characteristics of the document stream.

 

DISADVANTAGES:

 

The amount of information made available to users far exceeds their capacity to discover and understand it.

The currently used search engine does not provide efficient results for the given query

It doesn’t provide efficient information constantly

The primary performance metric in our work is the time required to refresh (update) all CTQD results in response to stream events.

 

PROPOSED SYSTEM:

 

We propose an ID-ordering methodology for CTQDs. Our methodology involves three dimensions. First, we reverse the role of the documents and the queries. That is, we index the (relatively static) queries and probe the streaming documents against that index, in order to eliminate the need for index maintenance due to stream events.

Second, since we index user queries which, unlike the documents, typically comprise just a few terms (i.e., they are hugely sparse), we may effectively apply ID-ordering to the query index.

Third, we complement RIO with a novel, locally adaptive technique that produces tighter processing bounds.

Our advanced approach (MRIO) outperforms the current state-of-the-art by one order of magnitude.  MRIO employs novel bounds that offer proven optimality w.r.t. the number of considered queries per stream event.

MRIO is more than two times faster than RIO, demonstrating that a skillful adaptation of ID-ordering to CTQDs alone (as in RIO) is not enough to derive the improvements achieved in this work.

We further improve the performance of MRIO by restructuring its query index (i.e., rearranging the queries inside) to better exploit locality and strengthen the pruning effectiveness of its bounds.  

Stemming is the process for reducing inflected or sometimes derived words to their stem; base or root forms generally a written word form.

 

ADVANTAGES:

 

Our advanced approach (MRIO) outperforms the current state-of-the-art by one order of magnitude.

MRIO employs novel bounds that offer proven optimality w.r.t. the number of considered queries per stream event.

MRIO is more than two times faster than RIO, demonstrating that a skillful adaptation of ID ordering to CTQDs alone (as in RIO) is not enough to derive the improvements achieved in this work.

We further improve the performance of MRIO by restructuring its query index (i.e., rearranging the queries inside) to better exploit locality and strengthen the pruning effectiveness of its bounds.

 Efficient updated data retrieval system is used for processing of continuous text queries over document stream

To effectively identify the affected queries and to avoid heavy index maintenance costs, instead of the, documents

 

ALGORITHM:

RIO (Reverse ID-Ordering) 

Reports the correct result for every query when a new document arrives

MRIO (MINIMAL REVERSE ID-ORDERING)

Invokes the minimum possible number of iterations subject to the ID-ordering execution paradigm.

This approach renders MRIO optimal (minimal) in terms of the number of iterations required to process a document arrival

Stemming Algorithm

Stemming is the process for reducing inflected or sometimes derived words to their stem, base or root form generally a written word form

 

FUTURE WORKS

Future works are one of the inevitable for any kind of software project. Some of the enhancement features that are applicable for our project are listed as follows: In future we will extend our documents tagged with metadata and apply special scoring mechanisms in this type of documents. One more technique is to extract updated datas from the database itself.

 

SPECIFICATION:

HARDWARE SPECIFICATION:

 

  • System                  : Pentium IV 2.4 GHz.
  • Hard Disk               : 40 GB.
  • Floppy Drive           : 1.44 Mb.
  • Monitor                  : 15 VGA Colour.
  • Mouse                    : Logitech.
  • Ram                       : 2 Mb.

 

SOFTWARE SPECIFICATION:

 

  • Operating system     : Windows XP/7.
  • Coding Language      : ASP.net, C#.net
  • Tool                         : Visual Studio 2010
  • Database                  : SQL SERVER 2008

 

REFERENCES:

 

[1] P. Haghani, S. Michel, and K. Aberer, “The gist of everything new: personalized top-k processing over web 2.0 streams.” in CIKM, 2010, pp. 489–498.

[2] K. Mouratidis and H. Pang, “Efficient evaluation of continuous text search queries,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 10, pp. 1469–1482, 2011. 

[3] N. Vouzoukidou, B. Amann, and V. Christophides, “Processing continuous text queries featuring non-homogeneous scoring functions.” in CIKM, 2012, pp. 1065–1074.

[4] A. Hoppe, “Automatic ontology-based user profile learning from heterogeneous web resources in a big data context.” PVLDB, pp. 1428–1433, 2013.

[5] A. Lacerda and N. Ziviani, “Building user profiles to improve user experience in recommender systems,” in WSDM, 2013, pp. 759–764.



Leave a comment

*

*

*