Personal Web Revisitation by Context and Content Keywords with Relevance Feedback (IEEE 2017).


Getting back to previously viewed web pages is a common yet uneasy task for users due to the large volume of personally accessed information on the web. This paper leverages human’s natural recall process of using episodic and semantic memory cues to facilitate recall, and presents a personal web revisitation technique called WebPagePrev through context and content keywords. Underlying techniques for context and content memories’ acquisition, storage, decay, and utilization for page re-finding are discussed. A relevance feedback mechanism is also involved to tailor to individual’s memory strength and revisitation habits. Our 6-month user study shows that: (1) Compared with the existing web revisitation tool Memento, History List Searching method, and Search Engine method, the proposed WebPagePrev delivers the best re-finding quality in finding rate (92.10%), average F1-measure (0.4318) and average rank error (0.3145). (2) Our dynamic management of context and content memories including decay and reinforcement strategy can mimic users’ retrieval and recall mechanism. With relevance feedback, the finding rate of WebPagePrev increases by 9.82%, average F1-measure increases by 47.09%, and average rank error decreases by 19.44% compared to stable memory management strategy. Among time, location, and activity context factors in WebPagePrev, activity is the best recall cue, and context+content based re-finding delivers the best performance, compared to context based re-finding and content based re-finding.


Nowadays, the web is playing a significant role in delivering information to users’ fingertips. A web page can be localized by a fixed url, and displays the page content as time-varying snapshot. Among the common web behaviors, web revisitation is to re-find the previously viewed web pages, not only the page url, but also the page snapshot at that access timestamp [1]. A 6-week user study with 23 participants showed nearly 58% of web access belonged to web revisitation [2]. Another 1-year user study involving 114 participants revealed around 40% of queries were re-finding requests [3]. According to [4], on average, every second page loaded was already visited before by the same user, and the ratio of revisited pages among all visits ranges between 20% and 72%. Psychological studies show that humans rely on both episodic memory and semantic memory to recall information or events from the past. Human’s episodic memory receives and stores temporally dated episodes or events, together with their spatial-temporal relations, while human’s semantic memory, on the other hand, is a structured record of facts, meanings, concepts and skills that one has acquired from the external world. Semantic information is derived from accumulated episodic memory.


Three kinds of user’s access context, i.e., access time, access location, and concurrent activities, are captured. While access time is determinate, access location can be derived from the IP address of user’s computing device. By calling the public IP localization API, we can map the IP address (e.g., ””) to a region (e.g., ”Beijing, Tsinghua University”). In order to get a high-precision location, we further build an IP region geocoding database, which could translate a static IP address to a concrete place like ”Lab Building, Room 216”. If the user’s GPS information is available, a public GPS localization application could also help localize the user to a Point of Interest (POI) in the region. User’s concurrent activities are inferred from his/her computer programs, running before and after the page access. We continuously monitor the change of user’s focused program windows, which can be either a web page, a word file, or a chatting program window, etc., during user’s interaction with the computer. Once a user visits a web page longer than a threshold τc, computer programs that run interleaving with the current web access program for over τc time are taken as the associated computer programs (i.e., context activities).

Content Extraction and Management Module:

Apart from access context, users may also get back to the previous viewed pages through some content keywords. Instead of extracting content terms from the full web page, we only consider the page segments shown on the screen. There are many term weighting schemes in the information retrieval field. The most generic one is to calculate term frequency-inverse document frequency (tf-idf) [36]. For personalized web revisitation, merely counting the occurrence of a term in the presented page segment is not enough. Also, user’s web page browsing behaviors (e.g. visitation time length and highlighting or not), as well as page’s subject headings, are counted as user’s impression and potential interest indicators for later recall. In a similar manner as access context, we bind an impression score to each extracted content term d, showing how likely the user will refer to it for recall based on the four normalized features.


Now each user’s accessed web page w is bounded with a probabilistic context tree (denoted as w#tree) and a probabilistic term list (denoted as w#list). Let W be the set of user’s previously accessed web pages. A revisit query posted by the user at time t is expressed as Wm = Q(W, Qc, Qd, t), where Qc is a set of context keywords, Qd is a set of content keywords, and answer Wm is a ranked list of matched web pages from W. The computation of content ranking is straightforward, i.e., dRank(w#list| Qd, t) = ∏ qd∈Qd dIs(w, qd, t), which is the product of matching terms’ impression scores against content keywords Qd. A content ranking example with respect to Qd = {”retarget”, ”project”} is illustrated in Fig. 5(a). Comparatively, as a context keyword may appear in the titles of multiple tree nodes, the computation of context ranking cRank(w#tree| Qc, t) is a bit complex. We firstly split a context tree into multiple satisfactory subtrees, so that each subtree contains all the search keywords once and only once. We then compute the ranking of each subtree, and finally merge their ranking results by cRank(w#tree| Qc, t)= ∑n i=1 cRank(w#treesubi | Qc, t). To calculate the ranking score of each subtree, we firstly determine the matched node set V = {ν1, ν2, · · · }. For each context node ν in V , we calculate the matching score by mAs(Qc, ν, t) = |Qc∩ν.title| |ν.title| · cAs(w, ν, t). Considering the ancestor node νi with a matching child node νj (νj ≺a νi), we calculate νi and νj ’s matching scores by mAs(Qc, {νi , νj}, t) = mAs(Qc, νi ∩ νj , t) = cAs(Qc, νj , t). The reason is that ancestor node νi can be derived from νj in context tree, where there exists a dependency relationship. If user remembers the context nodes detailed at lower level, he can directly infer the corresponding nodes at upper level along an upward path. Therefore, the ancestor nodes with matched child nodes can be firstly pruned to keep the rest independent, and cRank(w#treesub| Qc, t) is the product of the remaining nodes’ scores. Fig. 5(b) gives two smallest subtrees that satisfy Qc = {”busy”, ”programming”, ”read”, ”at lab”}, where “at” as a stop word is removed.


Drawing on the characteristics of human brain memory in organizing and exploiting episodic events and semantic words in information recall, this paper presents a personal web revisitation technique based on context and content keywords. Context instances and page content are respectively organized as probabilistic context trees and probabilistic term lists, which dynamically evolve by degradation and reinforcement with relevance feedback. Our experimental results demonstrate the effectiveness and applicability of the proposed technique.